



  1HDFS: 分布式文件系统,存储海量数据




















      33、二级NameNodeSecondary NameNode



















        - Job & Task

         job → Task(maptask, reducetask)

        - JobTracker




        - TaskTracker

















    如:Hive, Hbase



    常用HDFS Shell命令:

      类Linux系统:ls, cat, mkdir, rm, chmod, chown








# hdfs_map.py
import sys

def read_input(file):
    for line in file:
        yield line.split()

def main():
    data = read_input(sys.stdin)

    for words in data:
        for word in words:

if __name__ == '__main__':


# hdfs_reduce.py

import sys
from operator import itemgetter
from itertools import groupby

def read_mapper_output(file, separator='\t'):
    for line in file:
        yield line.rstrip().split(separator, 1)

def main():
    data = read_mapper_output(sys.stdin)

    for current_word, group in groupby(data, itemgetter(0)):
        total_count = sum(int(count) for current_word, count in group)

        print('{} {}'.format(current_word, total_count))

if __name__ == '__main__':




hadoop jar /opt/hadoop-2.9.1/share/hadoop/tools/lib/hadoop-streaming-2.9.1.jar -files '/home/zzf/Git/Data_analysis/Hadoop/hdfs_map.py,/home/zzf/Git/Data_analysis/Hadoop/hdfs_reduce.py' -input /test/mk.txt -output /output/wordcount -mapper 'python3 hdfs_map.py' -reducer 'python3 hdfs_reduce.py'


  1 ➜  Documents hadoop jar /opt/hadoop-2.9.1/share/hadoop/tools/lib/hadoop-streaming-2.9.1.jar -files '/home/zzf/Git/Data_analysis/Hadoop/hdfs_map.py,/home/zzf/Git/Data_analysis/Hadoop/hdfs_reduce.py' -input /test/mk.txt -output /output/wordcount -mapper 'python3 hdfs_map.py' -reducer 'python3 hdfs_reduce.py' 
  2 # 结果
  3 18/06/26 16:22:45 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
  4 18/06/26 16:22:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
  5 18/06/26 16:22:45 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
  6 18/06/26 16:22:46 INFO mapred.FileInputFormat: Total input files to process : 1
  7 18/06/26 16:22:46 INFO mapreduce.JobSubmitter: number of splits:1
  8 18/06/26 16:22:46 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local49685846_0001
  9 18/06/26 16:22:46 INFO mapred.LocalDistributedCacheManager: Creating symlink: /home/zzf/hadoop_tmp/mapred/local/1530001366609/hdfs_map.py <- /home/zzf/Documents/hdfs_map.py
 10 18/06/26 16:22:46 INFO mapred.LocalDistributedCacheManager: Localized file:/home/zzf/Git/Data_analysis/Hadoop/hdfs_map.py as file:/home/zzf/hadoop_tmp/mapred/local/1530001366609/hdfs_map.py
 11 18/06/26 16:22:47 INFO mapred.LocalDistributedCacheManager: Creating symlink: /home/zzf/hadoop_tmp/mapred/local/1530001366610/hdfs_reduce.py <- /home/zzf/Documents/hdfs_reduce.py
 12 18/06/26 16:22:47 INFO mapred.LocalDistributedCacheManager: Localized file:/home/zzf/Git/Data_analysis/Hadoop/hdfs_reduce.py as file:/home/zzf/hadoop_tmp/mapred/local/1530001366610/hdfs_reduce.py
 13 18/06/26 16:22:47 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
 14 18/06/26 16:22:47 INFO mapred.LocalJobRunner: OutputCommitter set in config null
 15 18/06/26 16:22:47 INFO mapreduce.Job: Running job: job_local49685846_0001
 16 18/06/26 16:22:47 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapred.FileOutputCommitter
 17 18/06/26 16:22:47 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 18 18/06/26 16:22:47 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
 19 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Waiting for map tasks
 20 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Starting task: attempt_local49685846_0001_m_000000_0
 21 18/06/26 16:22:47 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 22 18/06/26 16:22:47 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
 23 18/06/26 16:22:47 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
 24 18/06/26 16:22:47 INFO mapred.MapTask: Processing split: hdfs://localhost:9000/test/mk.txt:0+2267
 25 18/06/26 16:22:47 INFO mapred.MapTask: numReduceTasks: 1
 26 18/06/26 16:22:47 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
 27 18/06/26 16:22:47 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
 28 18/06/26 16:22:47 INFO mapred.MapTask: soft limit at 83886080
 29 18/06/26 16:22:47 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
 30 18/06/26 16:22:47 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
 31 18/06/26 16:22:47 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
 32 18/06/26 16:22:47 INFO streaming.PipeMapRed: PipeMapRed exec [/usr/bin/python3, hdfs_map.py]
 33 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.work.output.dir is deprecated. Instead, use mapreduce.task.output.dir
 34 18/06/26 16:22:47 INFO Configuration.deprecation: map.input.start is deprecated. Instead, use mapreduce.map.input.start
 35 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
 36 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
 37 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
 38 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.local.dir is deprecated. Instead, use mapreduce.cluster.local.dir
 39 18/06/26 16:22:47 INFO Configuration.deprecation: map.input.file is deprecated. Instead, use mapreduce.map.input.file
 40 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
 41 18/06/26 16:22:47 INFO Configuration.deprecation: map.input.length is deprecated. Instead, use mapreduce.map.input.length
 42 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
 43 18/06/26 16:22:47 INFO Configuration.deprecation: user.name is deprecated. Instead, use mapreduce.job.user.name
 44 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
 45 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
 46 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
 47 18/06/26 16:22:47 INFO streaming.PipeMapRed: Records R/W=34/1
 48 18/06/26 16:22:47 INFO streaming.PipeMapRed: MRErrorThread done
 49 18/06/26 16:22:47 INFO streaming.PipeMapRed: mapRedFinished
 50 18/06/26 16:22:47 INFO mapred.LocalJobRunner: 
 51 18/06/26 16:22:47 INFO mapred.MapTask: Starting flush of map output
 52 18/06/26 16:22:47 INFO mapred.MapTask: Spilling map output
 53 18/06/26 16:22:47 INFO mapred.MapTask: bufstart = 0; bufend = 3013; bufvoid = 104857600
 54 18/06/26 16:22:47 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 26212876(104851504); length = 1521/6553600
 55 18/06/26 16:22:47 INFO mapred.MapTask: Finished spill 0
 56 18/06/26 16:22:47 INFO mapred.Task: Task:attempt_local49685846_0001_m_000000_0 is done. And is in the process of committing
 57 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Records R/W=34/1
 58 18/06/26 16:22:47 INFO mapred.Task: Task 'attempt_local49685846_0001_m_000000_0' done.
 59 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Finishing task: attempt_local49685846_0001_m_000000_0
 60 18/06/26 16:22:47 INFO mapred.LocalJobRunner: map task executor complete.
 61 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Waiting for reduce tasks
 62 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Starting task: attempt_local49685846_0001_r_000000_0
 63 18/06/26 16:22:47 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
 64 18/06/26 16:22:47 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
 65 18/06/26 16:22:47 INFO mapred.Task:  Using ResourceCalculatorProcessTree : [ ]
 66 18/06/26 16:22:47 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@257adccd
 67 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=334338464, maxSingleShuffleLimit=83584616, mergeThreshold=220663392, ioSortFactor=10, memToMemMergeOutputsThreshold=10
 68 18/06/26 16:22:47 INFO reduce.EventFetcher: attempt_local49685846_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
 69 18/06/26 16:22:47 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local49685846_0001_m_000000_0 decomp: 3777 len: 3781 to MEMORY
 70 18/06/26 16:22:47 INFO reduce.InMemoryMapOutput: Read 3777 bytes from map-output for attempt_local49685846_0001_m_000000_0
 71 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 3777, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->3777
 72 18/06/26 16:22:47 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
 73 18/06/26 16:22:47 INFO mapred.LocalJobRunner: 1 / 1 copied.
 74 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
 75 18/06/26 16:22:47 INFO mapred.Merger: Merging 1 sorted segments
 76 18/06/26 16:22:47 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3769 bytes
 77 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: Merged 1 segments, 3777 bytes to disk to satisfy reduce memory limit
 78 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: Merging 1 files, 3781 bytes from disk
 79 18/06/26 16:22:47 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
 80 18/06/26 16:22:47 INFO mapred.Merger: Merging 1 sorted segments
 81 18/06/26 16:22:47 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 3769 bytes
 82 18/06/26 16:22:47 INFO mapred.LocalJobRunner: 1 / 1 copied.
 83 18/06/26 16:22:47 INFO streaming.PipeMapRed: PipeMapRed exec [/usr/bin/python3, hdfs_reduce.py]
 84 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
 85 18/06/26 16:22:47 INFO Configuration.deprecation: mapred.map.tasks is deprecated. Instead, use mapreduce.job.maps
 86 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=1/0/0 in:NA [rec/s] out:NA [rec/s]
 87 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=10/0/0 in:NA [rec/s] out:NA [rec/s]
 88 18/06/26 16:22:47 INFO streaming.PipeMapRed: R/W/S=100/0/0 in:NA [rec/s] out:NA [rec/s]
 89 18/06/26 16:22:47 INFO streaming.PipeMapRed: Records R/W=381/1
 90 18/06/26 16:22:47 INFO streaming.PipeMapRed: MRErrorThread done
 91 18/06/26 16:22:47 INFO streaming.PipeMapRed: mapRedFinished
 92 18/06/26 16:22:47 INFO mapred.Task: Task:attempt_local49685846_0001_r_000000_0 is done. And is in the process of committing
 93 18/06/26 16:22:47 INFO mapred.LocalJobRunner: 1 / 1 copied.
 94 18/06/26 16:22:47 INFO mapred.Task: Task attempt_local49685846_0001_r_000000_0 is allowed to commit now
 95 18/06/26 16:22:47 INFO output.FileOutputCommitter: Saved output of task 'attempt_local49685846_0001_r_000000_0' to hdfs://localhost:9000/output/wordcount/_temporary/0/task_local49685846_0001_r_000000
 96 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Records R/W=381/1 > reduce
 97 18/06/26 16:22:47 INFO mapred.Task: Task 'attempt_local49685846_0001_r_000000_0' done.
 98 18/06/26 16:22:47 INFO mapred.LocalJobRunner: Finishing task: attempt_local49685846_0001_r_000000_0
 99 18/06/26 16:22:47 INFO mapred.LocalJobRunner: reduce task executor complete.
100 18/06/26 16:22:48 INFO mapreduce.Job: Job job_local49685846_0001 running in uber mode : false
101 18/06/26 16:22:48 INFO mapreduce.Job:  map 100% reduce 100%
102 18/06/26 16:22:48 INFO mapreduce.Job: Job job_local49685846_0001 completed successfully
103 18/06/26 16:22:48 INFO mapreduce.Job: Counters: 35
104     File System Counters
105         FILE: Number of bytes read=279474
106         FILE: Number of bytes written=1220325
107         FILE: Number of read operations=0
108         FILE: Number of large read operations=0
109         FILE: Number of write operations=0
110         HDFS: Number of bytes read=4534
111         HDFS: Number of bytes written=2287
112         HDFS: Number of read operations=13
113         HDFS: Number of large read operations=0
114         HDFS: Number of write operations=4
115     Map-Reduce Framework
116         Map input records=34
117         Map output records=381
118         Map output bytes=3013
119         Map output materialized bytes=3781
120         Input split bytes=85
121         Combine input records=0
122         Combine output records=0
123         Reduce input groups=236
124         Reduce shuffle bytes=3781
125         Reduce input records=381
126         Reduce output records=236
127         Spilled Records=762
128         Shuffled Maps =1
129         Failed Shuffles=0
130         Merged Map outputs=1
131         GC time elapsed (ms)=0
132         Total committed heap usage (bytes)=536870912
133     Shuffle Errors
134         BAD_ID=0
135         CONNECTION=0
136         IO_ERROR=0
137         WRONG_LENGTH=0
138         WRONG_MAP=0
139         WRONG_REDUCE=0
140     File Input Format Counters 
141         Bytes Read=2267
142     File Output Format Counters 
143         Bytes Written=2287
144 18/06/26 16:22:48 INFO streaming.StreamJob: Output directory: /output/wordcount
View Code


  1 ➜  Documents hdfs dfs -cat /output/wordcount/part-00000
  2 # 结果
  3 "Even 1    
  4 "My 1    
  5 "We 1    
  6 (16ft) 1    
  7 11 1    
  8 16, 1    
  9 17-member 1    
 10 25-year-old 1    
 11 5m 1    
 12 AFP. 1    
 13 BBC's 1    
 14 Bangkok 1    
 15 But 1    
 16 Chiang 1    
 17 Constant 1    
 18 Deputy 1    
 19 Desperate 1    
 20 Head, 1    
 21 How 1    
 22 I'm 1    
 23 Jonathan 1    
 24 June 1    
 25 Luang 2    
 26 Minister 1    
 27 Myanmar, 1    
 28 Nang 2    
 29 Navy 2    
 30 Non 2    
 31 October. 1    
 32 PM 1    
 33 Post, 1    
 34 Prawit 1    
 35 Prime 1    
 36 Rai 1    
 37 Rescue 2    
 38 Royal 1    
 39 Saturday 2    
 40 Saturday. 1    
 41 Thai 1    
 42 Thailand's 2    
 43 Tham 2    
 44 The 6    
 45 They 2    
 46 Tuesday 1    
 47 Tuesday. 2    
 48 Wongsuwon 1    
 49 a 8    
 50 able 1    
 51 according 2    
 52 after 2    
 53 afternoon. 1    
 54 aged 1    
 55 alive, 1    
 56 alive," 1    
 57 all 1    
 58 along 1    
 59 and 6    
 60 anything 1    
 61 are 5    
 62 areas 1    
 63 as 1    
 64 at 2    
 65 attraction 1    
 66 authorities 1    
 67 be 1    
 68 been 2    
 69 began 1    
 70 believed 1    
 71 between 1    
 72 bicycles 1    
 73 border 1    
 74 boys 1    
 75 boys, 1    
 76 briefly 1    
 77 bring 1    
 78 but 1    
 79 by 1    
 80 camping 1    
 81 can 1    
 82 case 1    
 83 cave 9    
 84 cave, 3    
 85 cave. 1    
 86 cave.According 1    
 87 ceremony 1    
 88 chamber 1    
 89 child, 1    
 90 coach 3    
 91 completely 1    
 92 complex, 1    
 93 correspondent. 1    
 94 cross 1    
 95 crying 1    
 96 day. 1    
 97 deputy 1    
 98 dive 1    
 99 divers 2    
100 down. 1    
101 drink."The 1    
102 drones, 1    
103 during 1    
104 early 1    
105 eat, 1    
106 efforts 1    
107 efforts, 2    
108 enter 1    
109 entered 2    
110 enters 1    
111 equipment 1    
112 extensive 1    
113 flood 1    
114 floods. 1    
115 footballers 1    
116 footprints 1    
117 for 4    
118 found 1    
119 fresh 1    
120 from 2    
121 gear, 1    
122 get 1    
123 group 1    
124 group's 1    
125 had 2    
126 halted 2    
127 hampered 1    
128 hampering 1    
129 has 1    
130 have 6    
131 he 1    
132 here 1    
133 holding 1    
134 hopes 1    
135 if 1    
136 in 3    
137 inaccessible 1    
138 include 1    
139 inside 3    
140 into 1    
141 is 4    
142 it 1    
143 kilometres 1    
144 levels 1    
145 lies 1    
146 local 1    
147 making 1    
148 many 1    
149 may 1    
150 missing. 1    
151 must 1    
152 navy 1    
153 near 1    
154 network. 1    
155 night 1    
156 not 1    
157 now," 1    
158 of 4    
159 officials. 1    
160 on 5    
161 one 1    
162 optimistic 2    
163 our 1    
164 out 2    
165 outside 2    
166 parent 1    
167 pools 1    
168 poor 1    
169 prayer 1    
170 preparing 1    
171 province 1    
172 pumping 1    
173 rainfall 1    
174 rainy 1    
175 raising 1    
176 re-enter 1    
177 relatives 1    
178 reported 1    
179 reportedly 1    
180 rescue 1    
181 resumed 1    
182 return. 1    
183 rising 2    
184 runs 2    
185 safe 1    
186 safety. 1    
187 said 3    
188 said, 1    
189 says 1    
190 scene, 1    
191 scuba 1    
192 search 3    
193 search. 1    
194 searching 1    
195 season, 1    
196 seen 1    
197 sent 1    
198 should 1    
199 small 1    
200 sports 1    
201 started 1    
202 still 2    
203 stream 2    
204 submerged, 1    
205 team 3    
206 teams 1    
207 the 23    
208 their 5    
209 them 1    
210 these 1    
211 they 5    
212 third 1    
213 though 1    
214 thought 1    
215 through 1    
216 to 17    
217 tourist 1    
218 train 1    
219 trapped 1    
220 trapped? 1    
221 try 1    
222 underground. 1    
223 underwater 1    
224 unit 1    
225 up 1    
226 use 1    
227 visibility 1    
228 visitors 1    
229 was 2    
230 water 2    
231 waters 2    
232 were 4    
233 which 5    
234 who 1    
235 with 2    
236 workers 1    
237 you 1    
238 young 1    
View Code


八、思考一: 如何通过Hadoop存储小文件?

Sequence File / Map File




posted @ 2018-06-26 17:09  理想几岁  阅读(1533)  评论(0编辑  收藏  举报