WordCount: 基于kafka+storm+hbase
描述
使用wordcount程序,整合kafkf,storm和hbase
数据源:kafka, topic "logs"
词频统计: storm
存储:统计的结果存储到hbase
1,分析
1.1 storm topology
在topology中,使用KafkaSpout从kafka接收数据,接收到的数据是以行为单位的句子;
使用SentenceSplitBolt分拆出每个单词,再使用CountBolt统计每个单词出现的次数,最后使用Hbase bolt把结果存储到hbase中。
Kafka -> KafkaSpout -> SentenceSplitBolt -> CountBolt -> Hbase bolt
2, 实现
实验环境
2台服务器,hadoop1 和 hadoo2
CentOS-6.4 | hadoop1, hadoop2 |
Hadoop-2.5-cdh-5.3.6 | hadoop1 |
kafka-2.10-0.8.1.1 | hadoop2 |
hbase-0.98.6-cdh-5.3.6 | hadoop2-HMaster, hadoop1-RegionServer |
storm-0.9.6 | hadoop2 |
zookeeper-3.4.5-cdh5.3.6 | hadoop2 |
SentenceSplitBolt
public class SentenceSplitBolt extends BaseRichBolt {
static final Logger LOGGER = LoggerFactory.getLogger(SentenceSplitBolt.class);
private OutputCollector collector;
@Override
public void prepare(Map stormConf, TopologyContext context,
OutputCollector collector) {
collector = collector;
}
@Override
public void execute(Tuple input) {
// KafkaSpout中使用了"str"作为数据的字段名
String sentence = input.getStringByField("str");
String[] words = sentence.split(" ");
if (words.length > 0) {
for (String word : words) {
collector.emit(new Values(words)); // 一个一个单词发射出去
}
}
// 确认:tuple成功处理
collector.ack(input);
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word"));
}
}
CountBolt
public class CountBolt extends BaseRichBolt {
static final Logger LOGGER = LoggerFactory.getLogger(CountBolt.class);
private OutputCollector collector;
private Map<String, Integer> wordMap = new Hashtable<String, Integer>();
@Override
public void prepare(Map stormConf, TopologyContext context,
OutputCollector collector) {
this.collector = collector;
}
@Override
public void execute(Tuple input) {
String word = input.getStringByField("word");
if (!wordMap.containsKey(word)) {
wordMap.put(word, 0);
}
int count = wordMap.get(word);
count++;
wordMap.put(word, count);
// 为了方便测试,把count转化为字符串,这样能够在hue中方便查看到hbase中的数据
collector.emit(new Values(word, String.valueOf(count)));
}
@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
declarer.declare(new Fields("word", "count"));
}
}
WCTopohogy
public class WCTopohogy {
static Logger logger = LoggerFactory.getLogger(WCTopohogy.class);
public static void main(String[] args) throws AlreadyAliveException, InvalidTopologyException, InterruptedException {
TopologyBuilder builder = new TopologyBuilder();
SpoutConfig spoutConf = new SpoutConfig(new ZkHosts("hadoop2"), "test", "/test", UUID.randomUUID().toString());
spoutConf.forceFromStart = true;
spoutConf.scheme = new SchemeAsMultiScheme(new StringScheme());
KafkaSpout kafkaSpout = new KafkaSpout(spoutConf);
builder.setSpout("spout", kafkaSpout, 5);
builder.setBolt("split", new SentenceSplitBolt(), 8).shuffleGrouping("spout");
builder.setBolt("count", new CountBolt(), 12).fieldsGrouping("split", new Fields("word"));
SimpleHBaseMapper mapper = new SimpleHBaseMapper();
mapper.withColumnFamily("result");
mapper.withColumnFields(new Fields("count"));
mapper.withRowKeyField("word");
Map<String, Object> map = Maps.newTreeMap();
map.put("hbase.rootdir", "hdfs://hadoop1:9000/hbase");
map.put("hbase.zookeeper.quorum", "hadoop2:2181");
// hbase-bolt
HBaseBolt hBaseBolt = new HBaseBolt("wordcount", mapper).withConfigKey("hbase.conf");
builder.setBolt("hbase", hBaseBolt, 6).shuffleGrouping("count");
Config conf = new Config();
conf.setDebug(true);
conf.put("hbase.conf", map);
// 设置远程nimbus主机
// conf.put(Config.NIMBUS_HOST, "hadoop2");
// conf.put(Config.NIMBUS_THRIFT_PORT, 6627);
// 集群模式
if (args != null && args.length > 0) {
conf.setNumWorkers(3);
StormSubmitter.submitTopologyWithProgressBar(args[0], conf, builder.createTopology());
}
else { // 本地模式
conf.setMaxTaskParallelism(3);
LocalCluster cluster = new LocalCluster();
cluster.submitTopology("word-count", conf, builder.createTopology());
}
}
}
PrepareHbase 用于在hbase创建wordcount table
public class PrepareHbase {
public static void main(String[] args) throws MasterNotRunningException, ZooKeeperConnectionException, IOException {
Configuration conf = HBaseConfiguration.create();
conf.set("hbase.rootdir", "hdfs://hadoop1:9000/hbase");
conf.set("hbase.zookeeper.quorum", "hadoop2:2181");
HBaseAdmin admin = new HBaseAdmin(conf);
HTableDescriptor tableDescriptor = new HTableDescriptor("wordcount");
tableDescriptor.addFamily(new HColumnDescriptor("result"));
admin.createTable(tableDescriptor);
}
}
3 测试
- 运行PrepareHbase创建wordcount表
- 运行WCTopology
启动kafka-console-consumer,输入句子进行测试
在hue中观察storm出现的次数
再次在kafka-console-consumer输入storm后,观察storm的次数
4 总结
Storm是一个实时流式数据处理器,本实验使用storm处理来自kafka的消息,并把处理后的结果保存到hbase