storm源码剖析(3):topology启动过程
storm的topology启动过程是执行strom jar topology1.jar MAINCLASS ARG1 ARG2
鉴于前面已经分析了脚本的解析过程,现在重点分析topology1.jar的执行。
以storm-starter中的ExclamationTopology为例,来进行剖析:
public class ExclamationTopology { public static class ExclamationBolt extends BaseRichBolt { OutputCollector _collector; @Override public void prepare(Map conf, TopologyContext context, OutputCollector collector) { _collector = collector; } @Override public void execute(Tuple tuple) { _collector.emit(tuple, new Values(tuple.getString(0) + "!!!")); _collector.ack(tuple); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word")); } } public static void main(String[] args) throws Exception { TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("word", new TestWordSpout(), 10); builder.setBolt("exclaim1", new ExclamationBolt(), 3).shuffleGrouping("word"); builder.setBolt("exclaim2", new ExclamationBolt(), 2).shuffleGrouping("exclaim1"); Config conf = new Config(); conf.setDebug(true); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopology(args[0], conf, builder.createTopology()); } else { LocalCluster cluster = new LocalCluster(); cluster.submitTopology("test", conf, builder.createTopology()); Utils.sleep(10000); cluster.killTopology("test"); cluster.shutdown(); } } }
可以看到一个topology的启动包括三个步骤:
(1)创建TopologyBuilder,设置输入源spout,设置输出源bolt
(2)创建Config,设置配置项
(3)提交topology
创建TopologyBuilder
TopologyBuilder对象创建很简单,先来看看setSpout():
public SpoutDeclarer setSpout(String id, IRichSpout spout, Number parallelism_hint) { validateUnusedId(id); initCommon(id, spout, parallelism_hint); _spouts.put(id, spout); return new SpoutGetter(id); }
首先,判断componentId是否使用过了,如果使用过,则直接剖错。
然后,初始化Commponent:创建ComponentCommon对象,并设置属性,然后在TopologyBuilder 的成员变量Map<String, IRichSpout> _commons中记录下common,其key为componentId(这里为“word”)。代码如下:
private void initCommon(String id, IComponent component, Number parallelism) { ComponentCommon common = new ComponentCommon(); common.set_inputs(new HashMap<GlobalStreamId, Grouping>()); if(parallelism!=null) common.set_parallelism_hint(parallelism.intValue()); Map conf = component.getComponentConfiguration(); if(conf!=null) common.set_json_conf(JSONValue.toJSONString(conf)); _commons.put(id, common); }
其中ComponentCommon是使用thrift定义的,在storm.thrift中定义,代码如下:
struct ComponentCommon { 1: required map<GlobalStreamId, Grouping> inputs; 2: required map<string, StreamInfo> streams; //key is stream id 3: optional i32 parallelism_hint; //how many threads across the cluster should be dedicated to this component // component specific configuration respects: // topology.debug: false // topology.max.task.parallelism: null // can replace isDistributed with this // topology.max.spout.pending: null // topology.kryo.register // this is the only additive one // component specific configuration 4: optional string json_conf; }
最后,在TopologyBuilder 的成员变量Map<String, IRichSpout> _spouts,记录下spout的记录。其中key也是componentId(这里为“word”)。
再来看看setBolt,与setSpout的处理一样,最终在TopologyBuilder 的成员变量Map<String, IRichSpout> _commons中记录下common,其key为componentId(这里为“exclaim1”);在TopologyBuilder 的成员变量Map<String, IRichSpout> _bolts,记录下bolt的记录。其中key也是componentId(这里为“exclaim1”).
之后,.shuffleGrouping("word")这部分,是调用setBolt返回的,BoltDeclarer中的shuffleGrouping。
最终将会调用到grouping,其中streamId在这里没有指定,会使用"default"来替代。
public BoltDeclarer shuffleGrouping(StringcomponentId) {
return shuffleGrouping(componentId, Utils.DEFAULT_STREAM_ID);
}
public BoltDeclarer shuffleGrouping(StringcomponentId, String streamId) {
return grouping(componentId, streamId, Grouping.shuffle(newNullStruct()));
}
在这里grouping最后一个参数是生成了Grouping对象,并填充shuffle为NullStruct,其中Grouping是在storm.thrift定义的一个联合体,thrift会生成对应的java代码,内部定义了很多种grouping的方式。
private BoltDeclarer grouping(StringcomponentId, String streamId, Grouping grouping)
{
_commons.get(_boltId).put_to_inputs(new GlobalStreamId(componentId,streamId), grouping);
return this;
}
grouing函数是将之前记录在_commons中的,bolt的componentId对应的ComponentCommon的键值对,取出来设置ComponentCommon中的inputs的值。以第一个setBolt为例,就是取出"exclaim1"这个componentId对应的ComponentCommon,将里面的inputs设置为,这个输入是从哪里来的,也就是"word"这个componentId,streamId为"default"的这个spout流作为第一个bolt的输入源。
创建Config
Config比较简单,继承自Map,通过setXxx()为自身添加配置。
在这个例子中有两个set函数的调用。
conf.setDebug(true);就是在Map中插入一条记录("topology.debug" -> "true"),标记是打开debug模式的。
conf.setNumWorkers(3);同样在Map中插入一条记录("topology.workers" -> 3),标记worker数为3个。
提交Topology--这才是重点
StormSubmitter.submitTopology(args[0], conf, builder.createTopology());
(1)createTopology
builder.createTopology()利用之间构建的TopologyBuilder对象生成一个StormToplogy对象。
public StormTopology createTopology() { Map<String, Bolt> boltSpecs = new HashMap<String, Bolt>(); Map<String, SpoutSpec> spoutSpecs = new HashMap<String, SpoutSpec>(); for(String boltId: _bolts.keySet()) { IRichBolt bolt = _bolts.get(boltId); ComponentCommon common = getComponentCommon(boltId, bolt); boltSpecs.put(boltId, new Bolt(ComponentObject.serialized_java(Utils.serialize(bolt)), common)); } for(String spoutId: _spouts.keySet()) { IRichSpout spout = _spouts.get(spoutId); ComponentCommon common = getComponentCommon(spoutId, spout); spoutSpecs.put(spoutId, new SpoutSpec(ComponentObject.serialized_java(Utils.serialize(spout)), common)); } return new StormTopology(spoutSpecs, boltSpecs, new HashMap<String, StateSpoutSpec>()); }
需要说明的是,StormTopology也是storm.thrift中定义的一个struct结构,定义如下:
struct StormTopology { //ids must be unique across maps // #workers to use is in conf 1: required map<string, SpoutSpec> spouts; 2: required map<string, Bolt> bolts; 3: required map<string, StateSpoutSpec> state_spouts; }
(2)submitTopology
提交到集群会调用StormSubmitter.submitTopology()
public static void submitTopology(String name, Map stormConf, StormTopology topology) throws AlreadyAliveException, InvalidTopologyException { if(!Utils.isValidConf(stormConf)) { throw new IllegalArgumentException("Storm conf is not valid. Must be json-serializable"); } stormConf = new HashMap(stormConf); stormConf.putAll(Utils.readCommandLineOpts()); Map conf = Utils.readStormConfig(); conf.putAll(stormConf); try { String serConf = JSONValue.toJSONString(stormConf); if(localNimbus!=null) { LOG.info("Submitting topology " + name + " in local mode"); localNimbus.submitTopology(name, null, serConf, topology); } else { submitJar(conf); NimbusClient client = NimbusClient.getConfiguredClient(conf); try { LOG.info("Submitting topology " + name + " in distributed mode with conf " + serConf); client.getClient().submitTopology(name, submittedJar, serConf, topology); } finally { client.close(); } } LOG.info("Finished submitting topology: " + name); } catch(TException e) { throw new RuntimeException(e); } }
具体流程为:
1)检验传进来的配置合法性,并读取命令行配置storm.options项的值、读取默认配置default.yaml、读取storm.yaml,将这些所有的配置项都添加到conf
中。
2)调用submitJar(conf),上传jar到master。
a. 具体是通过获取命令行参数“storm.jar”来得到要上传的jar;
b. 获取Config.NIMBUS_HOST和Config.NIMBUS_THRIFT_PORT值,创建NimbusClient。在内部是封装了访问Nimbus这个rpc server(基于thrift)的rpc client,在NimbusClient构造时,就创建了rpcclient并建立与rpc server的连接;
c. 调用client.getClient().beginFileUpload()通知要上传文件,Nimbus会返回一个上传的路径,之后分段读取jar文件,调用uploadChunk上传到nimbus所告知的那个路径,jar文件数据都上传完毕调用finishFileUpload告知nimbus对那个路径的文件已上传完毕,最后关闭rpc连接。其中beginFileUpload,uploadChunk,finishFileUpload都是storm.thrift定义的service Nimbus中的方法,其中Nimbus.Iface是在 Nimbus.clj被实现了。
d. 成功上传jar文件后,会再次创建的一个rpc client调用Nimbus上的submitTopology方法,这个方法也是在storm.thrft中service Nimbus。调用这个Nimbus上的这个方法可以理解为通知Nimbus去运行这个topology。通知的时候,会带上name,这个name就是topology的名字。
private static void submitJar(Map conf) { if(submittedJar==null) { LOG.info("Jar not uploaded to master yet. Submitting jar..."); String localJar = System.getProperty("storm.jar"); submittedJar = submitJar(conf, localJar); } else { LOG.info("Jar already uploaded to master. Not submitting jar."); } } public static String submitJar(Map conf, String localJar) { if(localJar==null) { throw new RuntimeException("Must submit topologies using the 'storm' client script so that StormSubmitter knows which jar to upload."); } NimbusClient client = NimbusClient.getConfiguredClient(conf); try { String uploadLocation = client.getClient().beginFileUpload(); LOG.info("Uploading topology jar " + localJar + " to assigned location: " + uploadLocation); BufferFileInputStream is = new BufferFileInputStream(localJar); while(true) { byte[] toSubmit = is.read(); if(toSubmit.length==0) break; client.getClient().uploadChunk(uploadLocation, ByteBuffer.wrap(toSubmit)); } client.getClient().finishFileUpload(uploadLocation); LOG.info("Successfully uploaded topology jar to assigned location: " + uploadLocation); return uploadLocation; } catch(Exception e) { throw new RuntimeException(e); } finally { client.close(); } } public static NimbusClient getConfiguredClient(Map conf) { String nimbusHost = (String) conf.get(Config.NIMBUS_HOST); int nimbusPort = Utils.getInt(conf.get(Config.NIMBUS_THRIFT_PORT)); return new NimbusClient(nimbusHost, nimbusPort); } public NimbusClient(String host, int port) { try { if(host==null) { throw new IllegalArgumentException("Nimbus host is not set"); } conn = new TFramedTransport(new TSocket(host, port));//创建rpc连接 client = new Nimbus.Client(new TBinaryProtocol(conn));//创建rpc客户端 conn.open(); //打开连接 } catch(TException e) { throw new RuntimeException(e); } }
到此,一个编写有topology任务的jar文件就上传提交到nimbus,接下来的工作就是由nimbus将topology分发给supervisors去执行。请关注后续的剖析。