《Hadoop实战》之 Streaming

通过Unix命令使用Streaming

使用命令行方式的时候,输入数据必须为文本,并且每行被视为一个记录。若输入的格式是TextInputFormat,则流操作只会将值传递给mapper

提取第二列数据

  • input/output:指定输入输出目录
  • cut -f 2:只取第二列数据
  • -d ,:指定","为分隔符
  • uniq:去重
# 删除输出目录
hadoop fs -rm -r /data-for-learn/out/hadoop-practice/streamingOut/

# 调用Streaming
hadoop jar  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
	-input /data-for-learn/hadoop-practice/cite75_99.txt
	-output /data-for-learn/out/hadoop-practice/streamingOut
	-mapper 'cut -f 2 -d ,'
	-reducer 'uniq'

# 查看输出文件:Streaming是按文本方式处理方式,因此输出的排序是按字母的
hadoop fs -text  /data-for-learn/out/hadoop-practice/streamingOut/part-00000

统计行的数量

  • 不需要用到reduce,通过-D设置配置属性(GenericOptionParser)
  • -D参数放前面(否则报错:ERROR streaming.StreamJob: Unrecognized option: -D)
hadoop jar  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
	-D mapred.reduce.tasks=0
	-input /data-for-learn/out/hadoop-practice/streamingOut
	-output /data-for-learn/out/hadoop-practice/lineCount
	-mapper 'wc -l'

# 查看输出
hadoop fs -text /data-for-learn/out/hadoop-practice/lineCount/part-00000

通过脚本使用Streaming

  • 数据取自UNiX的标准输入STDIN,输出到STDOUT

随机打印STDIN输入行的Python脚本

  • 该脚本可以改造成采样程序,数据采样可以得到小数据集,但带来精度损失
    • 设置mapred.reduce.tasks=1,得到一个采样文件
    • 设置mapred.reduce.tasks=0,得到很多个采样文件,后期用getmerge进行合并
#!/usr/bin/env python
import sys, random

for line in sys.stdin:
	if (random.randint(1, 100) <= int (sys.argv[1])):
		print line.strip()
  • 为了让所有节点拥有该脚本,使用-file选项将该脚本打包成作业提交的一部分
  • 默认的reducer是IdentityReducer
hadoop jar  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
	-D mapred.reduce.tasks=1
	-input /data-for-learn/hadoop-practice/cite75_99.txt
	-output /data-for-learn/out/hadoop-practice/randomSample
	-mapper 'RandomSample.py 10'  # 传入参数10,10/100的比例采样
	-file RandomSample.py
	
# 查看结果	
hadoop fs -text /data-for-learn/out/hadoop-practice/randomSample/*

找到某个属性最大值的python脚本

  • AttributeMax.py
#!/usr/bin/env python
#-*- coding:UTF-8 -*-
import sys

index = int (sys.argv[1])
max = 0
for line in sys.stdin:
	fields = line.strip().split(",")
	if fields[index].isdigit():
		val = int(fields[index])
		if (val > max):
			max = val
else:	# 当迭代对象完成所有迭代后且此时的迭代对象为空时,如果存在else子句则执行else子句(即有迭代,则执行else)
	print max

  • 分片(由mapper确定)的最大值
hadoop jar  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
	-D mapred.reduce.tasks=1
	-input /data-for-learn/hadoop-practice/apat63_99.txt
	-output /data-for-learn/out/hadoop-practice/AttributeMaxMapper
	-mapper 'AttributeMax.py 8'   # 第九列的最大值
	-file AttributeMax.py
	
# 查看结果	
hadoop fs -text /data-for-learn/out/hadoop-practice/AttributeMaxMapper/*
  • 全局(所有分片中)最大值
hadoop jar  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
	-D mapred.reduce.tasks=1
	-input /data-for-learn/hadoop-practice/apat63_99.txt
	-output /data-for-learn/out/hadoop-practice/AttributeMaxReducer
	-mapper 'AttributeMax.py 8'   # 第九列的最大值
	-reducer 'AttributeMax.py 0'   # 第一列的最大值
	-file AttributeMax.py
	
# 查看结果	
hadoop fs -text /data-for-learn/out/hadoop-practice/AttributeMaxReducer/*

通过Streaming处理键/值对

Streaming使用制表符('\t')分隔记录中的键与值,如果没有'\t',则整条记录被视为键,值为空白文本

输出为键值对的python Mapper脚本

  • AverageByAttributeMapper.py
  • 该脚本的输出有'\t'分隔符,在洗牌阶段会被识别为键值对
#!/usr/bin/env python

import sys

for line in sys.stdin:
	fields = line.strip().split(",")
	if (fields[8] and fields[8].isdigit()):
		print fields[4][1:-1] + "\t" + fields[8]
  • 设置reducer=0
    • 键的顺序与输入一致,且没有成组
hadoop jar  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
	-D mapred.reduce.tasks=0
	-input /data-for-learn/hadoop-practice/apat63_99.txt
	-output /data-for-learn/out/hadoop-practice/AverageByAttributeMapper0
	-mapper 'AverageByAttributeMapper.py'
	-file AverageByAttributeMapper.py
	
# 查看结果	
hadoop fs -text /data-for-learn/out/hadoop-practice/AverageByAttributeMapper0/*
  • 设置reducer=1
    • 键的顺序已排序,且成组
hadoop jar  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
	-D mapred.reduce.tasks=1
	-input /data-for-learn/hadoop-practice/apat63_99.txt
	-output /data-for-learn/out/hadoop-practice/AverageByAttributeMapper1
	-mapper 'AverageByAttributeMapper.py'
	-file AverageByAttributeMapper.py
	
# 查看结果	
hadoop fs -text /data-for-learn/out/hadoop-practice/AverageByAttributeMapper1/* | head -n 24

求平均值的python Reducer脚本

  • AverageByAttributeReducer.py
  • 按行处理,且键是有序的,因此可以分组求平均
#!/usr/bin/env python

import sys

(last_key, sum, count) = (None, 0.0, 0)

for line in sys.stdin:
	(key, val) = line.split("\t")
	if last_key and last_key != key:
		print last_key + "\t" + str(sum / count)
		(sum, count) = (0.0, 0)
	
	last_key = key
	sum += float(val)
	count += 1
	
print last_key + "\t" + str(sum / count)
  • 执行
hadoop jar  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
	-D mapred.reduce.tasks=1
	-input /data-for-learn/hadoop-practice/apat63_99.txt
	-output /data-for-learn/out/hadoop-practice/AverageByAttributeReducer
	-mapper 'AverageByAttributeMapper.py'
	-reducer 'AverageByAttributeReducer.py'
	-file AverageByAttributeMapper.py
	-file AverageByAttributeReducer.py
	
# 查看结果	
hadoop fs -text /data-for-learn/out/hadoop-practice/AverageByAttributeReducer/*

通过Aggregate包使用Streaming

聚合函数通常分3类

  • 分配型:最大值、最小值、总和以及计数(具有分配律特征)
  • 代数型:平均值和方差(不遵循分配律)
  • 全集型:K个最小/最大、中值函数

Mapper输出的格式

值聚合器:K\tV

Aggregate包支持的值聚合器函数列表

值聚合器 描述
DoubleValueSum
LongValueMax/LongValueMin/LongValueSum
StringValueSum/StringValueMin
UniqValueCount
ValueHistogram

Aggregate信号为LongValueSum的例子

  • AttributeCount.py
  • 'aggregate:' + key + '\t' +value
#!/usr/bin/env python

import sys

index = int(sys.argv[1])
for line in sys.stdin:
	fields = line.split(",")
	print "LongValueSum:" + fields[index] + '\t' + '1'
  • 运行
hadoop jar  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
	-input /data-for-learn/hadoop-practice/apat63_99.txt
	-output /data-for-learn/out/hadoop-practice/AttributeCountSum
	-mapper 'AttributeCount.py 1'
	-reducer aggregate
	-file AttributeCount.py
	
# 查看结果	
hadoop fs -text /data-for-learn/out/hadoop-practice/AttributeCountSum/*

Aggregate信号为UniqValueCount的例子

  • UniqueCount.py
  • 按index1分组,对index1组内求去重总数(每年参与的国家数)
#!/usr/bin/env python

import sys

index1 = int(sys.argv[1])
index2 = int(sys.argv[2])
for line in sys.stdin:
	fields = line.split(",")
	print "UniqValueCount:" + fields[index1] + '\t' + fields[index2]
  • 运行
hadoop jar  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
	-input /data-for-learn/hadoop-practice/apat63_99.txt
	-output /data-for-learn/out/hadoop-practice/UniqueCountByCountry
	-mapper 'UniqueCount.py 1 4'
	-reducer aggregate
	-file UniqueCount.py
	
# 查看结果	
hadoop fs -text /data-for-learn/out/hadoop-practice/UniqueCountByCountry/*

Aggregate信号为ValueHistogram的例子

ValueHistogram按顺序输出:唯一值个数(统计键的数量)、最小个数、中值个数、最大个数、平均个数、标准方差

ValueHistogram:K\tV\tCount
  • ValueHistogram.py
  • 按index1分组,统计上述值
#!/usr/bin/env python
#-*- coding:UTF-8 -*-

import sys

index1 = int(sys.argv[1])
index2 = int(sys.argv[2])
for line in sys.stdin:
	fields = line.split(",")
	print "ValueHistogram:" + fields[index1] + '\t' + fields[index2]  # 最后的个数Count可以省略,默认为1
  • 运行
hadoop jar  $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-2.7.2.jar
	-input /data-for-learn/hadoop-practice/apat63_99.txt
	-output /data-for-learn/out/hadoop-practice/ValueHistogram
	-mapper 'ValueHistogram.py 1 4'
	-reducer aggregate
	-file ValueHistogram.py
	
# 查看结果	
hadoop fs -text /data-for-learn/out/hadoop-practice/ValueHistogram/*
posted @ 2020-12-08 09:33  vlj  阅读(441)  评论(0编辑  收藏  举报