MapReduce
2016-12-21 16:53:49
mapred-default.xml
2016-12-11 18:56:44
JSON解析
1、根据JSON对象key的结构,创建对应的Java类;
2、通过反序列化,获取JSON对象key的值;
//反序列化 /* [{ "value": "1", "key": "a" }, { "value": "2", "key": "b" }] */ String propsStr = "[{\"value\":\"1\",\"key\":\"a\"},{\"value\":\"2\",\"key\":\"b\"}]"; List<PropsObject> propsList = JSON.parseArray(propsStr, PropsObject.class); /* [{ "action_name": "begin", "current_time": 1481221047146 }, { "action_name": "end", "current_time": 1481221058263, "props": [{ "value": "1", "key": "a" }, { "value": "2", "key": "b" }] }] */ String actionsStr = "[{\"action_name\":\"begin\",\"current_time\":1481221047146},{\"action_name\":\"end\",\"current_time\":1481221058263,\"props\":[{\"value\":\"1\",\"key\":\"a\"},{\"value\":\"2\",\"key\":\"b\"}]}]"; List<ActionsObject> actionsList = JSON.parseArray(actionsStr, ActionsObject.class);
public class PropsObject { private String key; private String value; //... }
import java.util.List; public class ActionsObject { private String action_name; private String current_time; private List<PropsObject> props; //... }
2016-11-15 16:37:05
需要多少个Map?
Map的数目通常是由输入数据的大小决定的,一般就是所有输入文件的总块(block)数。
Map正常的并行规模大致是每个节点(node)大约10到100个map,对于CPU 消耗较小的map任务可以设到300个左右。由于每个任务初始化需要一定的时间,因此,比较合理的情况是map执行的时间至少超过1分钟。
这样,如果你输入10TB的数据,每个块(block)的大小是128MB(conf.setLong("mapreduce.input.fileinputformat.split.maxsize", 134217728L);),你将需要大约82,000个map来完成任务。
需要多少个Reduce?
Reduce的数目建议是0.95或1.75乘以 (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum)。
用0.95,所有reduce可以在maps一完成时就立刻启动,开始传输map的输出结果。用1.75,速度快的节点可以在完成第一轮reduce任务后,可以开始第二轮,这样可以得到比较好的负载均衡的效果。
增加reduce的数目会增加整个框架的开销,但可以改善负载均衡,降低由于执行失败带来的负面影响。
上述比例因子比整体数目稍小一些是为了给框架中的推测性任务(speculative-tasks) 或失败的任务预留一些reduce的资源。
无Reducer
如果没有归约要进行,那么设置reduce任务的数目为零是合法的。
这种情况下,map任务的输出会直接被写入由 setOutputPath(Path)指定的输出路径。框架在把它们写入FileSystem之前没有对它们进行排序。
20161020
1、RCFile文件数据解析乱码
1.0、代码
Text columnText = new Text();
BytesRefWritable brw = cols.get(3);
columnText.set(brw.getData(), brw.getStart(), brw.getLength());
String columnValue = columnText.toString();
1.1、原因
解析字段的数据类型为int,转为String后出现乱码;
1.2、解决
提前将字段的数据类型由int转为string,如:cast(columnName as string) as columnName。
20161021
1、DistributedCache
1.1、将文件加入DistributedCache
通过命令行-files:将指定的hdfs文件分发到各个Task的工作目录下,不对文件进行任何处理;
Applications can specify a comma separated list of paths which would be present in the current working directory of the task using the option -files.
1.2、问题
使用时出现hive目录下文件(如:hdfs:///user/hive/warehouse/db_name.db/tb_name/file_name)无法加入DistributedCache的问题,
改为其他hdfs文件(如:hdfs:///dir_name1/dir_name2/file_name)后可以加入DistributedCache。
DISTRIBUTED_CACHE_FILE=hdfs:///dirname/filename hadoop jar ${SHELL_DIR}/${JAR_NAME}.jar ${MAIN_CLASS} \ -files ${DISTRIBUTED_CACHE_FILE}
2016-10-20 16:52:09
hadoop -libjars
The -libjars option allows applications to add jars to the classpaths of the maps and reduces.
HIVE_LIB=${HIVE_HOME}/lib HCATALOG_LIB=${HIVE_HOME}/hcatalog/share/hcatalog export HADOOP_CLASSPATH=${HIVE_LIB}/*:${HCATALOG_LIB}/*:${HADOOP_CLASSPATH} LIB_JARS="" for j in `ls -1 ${HIVE_LIB}/*.jar` do LIB_JARS=${LIB_JARS},$j done for j in `ls -1 ${HCATALOG_LIB}/*.jar` do LIB_JARS=${LIB_JARS},$j done LIB_JARS=${LIB_JARS:1} hadoop jar ${SHELL_DIR}/${JAR_NAME}.jar ${MAIN_CLASS} \ -libjars ${LIB_JARS}
2016-10-20 17:23:40
MapReduce Tutorial
Input and Output types of a MapReduce job:
(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)
The key and value classes have to be serializable by the framework and hence need to implement the Writable interface.
Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
WritableComparable
s can be compared to each other, typically via Comparator
s.
All intermediate values associated with a given output key are subsequently grouped by the framework, and passed to the Reducer(s) to determine the final output.
Users can control the grouping by specifying a Comparator via Job.setGroupingComparatorClass(Class).
The Mapper outputs are sorted and then partitioned per Reducer. The total number of partitions is the same as the number of reduce tasks for the job.
Users can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.
The MapReduce framework relies on the InputFormat of the job to:
-
Validate the input-specification of the job.
-
Split-up the input file(s) into logical InputSplit instances, each of which is then assigned to an individual Mapper.
-
Provide the RecordReader implementation used to glean input records from the logical InputSplit for processing by the Mapper.
The MapReduce framework relies on the OutputFormat of the job to:
-
Validate the output-specification of the job; for example, check that the output directory doesn’t already exist.
-
Provide the RecordWriter implementation used to write the output files of the job. Output files are stored in a FileSystem.
You can use CombineFileInputFormat for when the input has many small files.
If a maxSplitSize is specified, then blocks on the same node are combined to form a single split.
CombineFileRecordReader
A generic RecordReader that can hand out different recordReaders for each chunk in a CombineFileSplit
.
A CombineFileSplit can combine data chunks from multiple files.
This class allows using different RecordReaders for processing these data chunks from different files.
GenericWritable
A wrapper for Writable instances.
When two sequence files, which have same Key type but different Value types, are mapped out to reduce, multiple Value types is not allowed. In this case, this class can help you wrap instances with different types.