Lilith

A Distributed Video Server

John Kemp, ESPN

About Me


  • Architect at Nokia for ~9 years
  • Career focus on security, identity and the Web
  • Member of ESPN Technology Innovation team
  • Prototype interesting things using interesting tech, loosely aimed at television and video
  • User of server-side Javascript (on and off) since 1996

About Lilith

  • Protocol to upload/play video (files)
  • Uses websockets to send "messages"
  • Riak core use to distribute file replicas
  • Javascript browser client for upload/playback

  • Controller of d(a)emons and Node killer

This is an experiment


Why?

  • Learn me some Erlang for great good. Or maybe not.
  • Learn how to use Riak Core to build a distributed system that is not a K/V store (video replicas)
  • Use Cowboy Websockets because I've heard it's cool
  • Work loosely with something that ESPN actually does (store and play video clips)

What I believed about Erlang


  • A functional language would be inherently scalable without me trying too hard
  • Process-oriented model would be inherently scalable without me trying too hard
  • Cowboy websockets would be more scalable than socket.io on Node.js
  • Riak Core would get me the rest of the way from Erlang/OTP to a fully fault-tolerant and scalable system
  • Learning a "weird" language would gain me reputation in my team!

Basic protocol


  • Binary messages containing chunks of file data
  • Control messages in "plaintext" to start transactions
  • Websockets used to offer potential for better performance than "full" HTTP
  • Event-based model is quite nice to show progress bar (for example)
  • Lots of browser examples to figure out how to use!

Websockets with Cowboy

  1. Route
  2. 
    Dispatch = [{'_', [{[<<"index.html">>], index_handler, []},
                       {[<<"player.html">>], player_handler, []},
                       {'_', new_ws_handler, []}
                      ]}]
                            
  3. Handle
  4. 
    websocket_handle({binary, <<"Upload:", Msg/binary>>}, Req, State) ->
      lager:debug("Received binary upload command"),   
      ...
    
    websocket_handle({text, <<"Play:", Msg/binary>>}, Req, State) ->
      lager:debug("Received text play command"),
      ...
                            

Riak Core

  • Distributed, scalable, fault-tolerant
  • Implementation of the Amazon Dynamo paper


Which enables...


  • No single point of failure
  • Self-healing (supervised)
  • Scale cheaply by adding commodity hardware


THIS IS NOT YOUR FATHER'S KV STORE!

Like this...

Getting started...

  • TL;DR
  • Implement your core code as a Riak Core vnode
  • -> fill out handle_command/3 and other vnode functions

  • Create a basic command set as your interface
  • ... "play", "upload"

  • Deploy and start a cluster

Getting twisted

The "vnode"

  • Implement your command set here as 'handle_command' operations
  • 
    handle_command({upload, ReqID, Filename, Size}, _Sender, State)
                              
  • Start a riak_core_vnode_master
  • vnodes are Erlang processes -> need supervision to be fault-tolerant
  • Lots I didn't understand yet, so left unimplemented (for now)

Man vs. Finite State Machine


waiting({error, ReqID, Val}, SD0=#state{from=From, num_r=NumR0, replies=Replies0}) ->
...
                        

Ports, nifs and the like

  • Intent was to use a C++ library to acquire file metadata
  • Multiple options make this harder than I'd like
  • I didn't really get here yet!

Putting it all together

The good, the bad, the ugly

  • No good way to print "anything" (lager:debug pretty print often didn't)
  • Not easy to see what 'terms' were important between files in Riak core (show example) or where to configure what
  • 
    ok = riak_core:register_vnode_module(lilith_chunkr_vnode),
    ok = riak_core_node_watcher:service_up(lilith_chunkr, self()),
                            
  • 'Make' in a single, reliable way (there are lots of individual ways that are more or less reliable)
  • app.config change to parameterize cowboy port was not intuitive and didn't work with Rebar make devrel
  • Distributed system (+FSM) debugging
  • ... BUT IT WORKED!

Erlang vs. Node Smackdown!

  • Node code is (was?) easier (for me) to write
  • Is there a Riak Core equivalent in Javascript?
  • Tests _seem_ to show Erlang/WS scales better than NodeJS
  • comparisons are odious... but
  • Erlang _style_ seems to suit high-scale systems development

What do I believe about Erlang now?

The elegance of recursion


lookup( Name, [{Name, Buffer, F, Downloaded} | Tail ], Head) ->  
    {ok, Buffer, F, Downloaded, Head, Tail};
lookup( Name, [_| Tail], Head) -> 
    lookup( Name, Tail, Head );
lookup( _, [], _ ) ->
    {ok, notfound}.
                        

Patterns


receive
    {ReqID, ok} -> ok;
    {ReqID, ok, Val} -> {ok, Val};
    {ReqID, error, Val} -> {error, Val}
after Timeout ->
    {error, timeout}
end.
                        

Attachments


14:20:47.307 [debug] Supervisor ranch_sup started ranch_listener_sup:start_link(http, 100, 
ranch_tcp, [{port,8080}], cowboy_protocol, [{env,[{dispatch,[{'_',[{[<<"index.html">>],
index_handler,[]},{[<<"player.html">>],player_handler...}...]}]}]}]) at pid <0.144.0>

14:20:47.307 [info] Application lilith started on node 'lilith@127.0.0.1'

Eshell V5.9.3.1  (abort with ^G)

(lilith@127.0.0.1)1> 
                        

A cluster of easy


make devrel

for d in dev/dev*; do $d/bin/lilith start; done
for d in dev/dev*; do $d/bin/lilith ping; done
for d in dev/dev{2,3}; do $d/bin/lilith-admin join lilith1@127.0.0.1; done
                        

Summary

  • I love recursive list operations
  • Pattern matching and term-based model makes for very natural coding
  • Console is ideal for exploratory programming and debugging -- ideal for an eager novice!
  • Clustering nodes is built into the system
  • Riak Core provides additional redundancy and scalability features
  • This story is not over...

Thanks for all the fish