首先,有内网两台机器:
10.80.3.97 作为master,所有的代码都放在该机器上。
10.80.3.98 作为slave,从master通过网络进行引导。
根据kernel文档中提到的,我们可以在两个台机器上,分别如下启动两个节点(注意虚拟机最好是同一版本,安装在同一目录下):
erl -kernel start_boot_server true boot_server_slaves '[{10,80,3,97}, {10,80,3,98}]' -name master@10.80.3.97 -setcookie a erl -name slave@10.80.3.98 -loader inet -hosts "10.80.3.97" -id master@10.80.3.97 -setcookie a
slave上并没有任何代码,通过网络,可以做到无盘引导,接下来,我来跟踪剖析一下这个过程。
首先我们先查看init模块的代码:
do_boot(Init,Flags,Start) -> process_flag(trap_exit,true), {Pgm0,Nodes,Id,Path} = prim_load_flags(Flags), Root = b2s(get_flag('-root',Flags)), PathFls = path_flags(Flags), Pgm = b2s(Pgm0), _Pid = start_prim_loader(Init,b2a(Id),Pgm,bs2as(Nodes), bs2ss(Path),PathFls), BootFile = bootfile(Flags,Root), BootList = get_boot(BootFile,Root), LoadMode = b2a(get_flag('-mode',Flags,false)), Deb = b2a(get_flag('-init_debug',Flags,false)), catch ?ON_LOAD_HANDLER ! {init_debug_flag,Deb}, BootVars = get_flag_args('-boot_var',Flags), ParallelLoad = (Pgm =:= "efile") and (erlang:system_info(thread_pool_size) > 0), PathChoice = code_path_choice(), eval_script(BootList,Init,PathFls,{Root,BootVars},Path, {true,LoadMode,ParallelLoad},Deb,PathChoice), %% To help identifying Purify windows that pop up, %% print the node name into the Purify log. (catch erlang:system_info({purify, "Node: " ++ atom_to_list(node())})), start_em(Start).
我们可以看到do_boot中通过prim_load_flags获取相关参数,通过start_prim_loader启动erl_prim_loader。
在erl_prim_loader中根据-loader指定方式来决定从文件或者网络进行加载。
而在erl_prim_loader中:
start_it("inet", Id, Pid, Hosts) -> process_flag(trap_exit, true), ?dbg(inet, {Id,Pid,Hosts}), AL = ipv4_list(Hosts), ?dbg(addresses, AL), {ok,Tcp} = find_master(AL), init_ack(Pid), PS = prim_init(), State = #state {loader = inet, hosts = AL, id = Id, data = Tcp, timeout = ?IDLE_TIMEOUT, n_timeouts = ?N_TIMEOUTS, prim_state = PS}, loop(State, Pid, []);
我们要通过find_master首先要根据-hosts的值去发现master。
find_master(AL) -> find_master(AL, ?EBOOT_RETRY, ?EBOOT_REQUEST_DELAY, ?EBOOT_SHORT_RETRY_SLEEP, ?EBOOT_UNSUCCESSFUL_TRIES, ?EBOOT_LONG_RETRY_SLEEP). find_master(AL, Retry, ReqDelay, SReSleep, Tries, LReSleep) -> {ok,U} = ll_udp_open(0), find_master(U, Retry, AL, ReqDelay, SReSleep, [], Tries, LReSleep). %% %% Master connect loop %% find_master(U, Retry, AddrL, ReqDelay, SReSleep, Ignore, Tries, LReSleep) -> case find_loop(U, Retry, AddrL, ReqDelay, SReSleep, Ignore, Tries, LReSleep) of [] -> find_master(U, Retry, AddrL, ReqDelay, SReSleep, Ignore, Tries, LReSleep); Servers -> ?dbg(servers, Servers), case connect_master(Servers) of {ok, Socket} -> ll_close(U), {ok, Socket}; _Error -> find_master(U, Retry, AddrL, ReqDelay, SReSleep, Servers ++ Ignore, Tries, LReSleep) end end. connect_master([{_Prio,IP,Port} | Servers]) -> case ll_tcp_connect(0, IP, Port) of {ok, S} -> {ok, S}; _Error -> connect_master(Servers) end; connect_master([]) -> {error, ebusy}.
然后往某个固定的UDP端口发送一条协议去取得tcp端口,然后连接到该tcp端口,以便进行后续的加载操作请求。
那这个udp端口和tcp端口的服务在master节点上又体现在哪里呢?这就是kernel中的erl_boot_server模块的职责。
刚刚我们在启动master节点的时候:
-kernel start_boot_server true boot_server_slaves '[{10,80,3,97}, {10,80,3,98}]'
这表示在kernel启动的时候,在本节点将启动boot_server进程,将 boot_server_slaves 的值作为启动参数传入进去,由boot_server进行管理。
我们看下boot_server的初始化代码:
init(Slaves) -> {ok, U} = gen_udp:open(?EBOOT_PORT, []), {ok, L} = gen_tcp:listen(0, [binary,{packet,4}]), {ok, Port} = inet:port(L), {ok, UPort} = inet:port(U), Ref = make_ref(), Pid = proc_lib:spawn_link(?MODULE, boot_init, [Ref]), gen_tcp:controlling_process(L, Pid), Pid ! {Ref, L}, %% We trap exit inorder to restart boot_init and udp_port process_flag(trap_exit, true), {ok, #state{priority = 0, version = erlang:system_info(version), udp_sock = U, udp_port = UPort, listen_sock = L, listen_port = Port, slaves = ordsets:from_list(Slaves), bootp = Pid }}.
我们可以看到,boot_server启动了一个udp端口(默认是4368),并同时随机监听了一个tcp端口。
udp端口服务职责用于,其他节点询问tcp服务的实际端口值,同时对询问节点进行slaves有效性的判断。所有的远程请求,实际是通过tcp服务来完成的。
具体到我们的例子就是,master使用efile的加载方式,slave使用inet加载方式,通过和master上的boot_server所提供的服务进行远程代码加载(底层加载实际使用的是prim_file,prim_inet模块)。
呃,从源码来看,这里启动参数里的-id貌似没有起到神马用,这个不是太清楚,先这样吧,ok~