我的电脑上装的系统是ubuntu 10.04,装的Hadoop是 hadoop 0.20.2 2010.feb。
|
|
sudo apt-get install openssh-server
随 后,Ubuntu 会自动下载并安装 openssh server,并一并解决所有的依赖关系。当您完成这一操作后,您可以找另一台计算机,然后使用一个 SSH 客户端软件(强烈推荐 PuTTy),输入您服务器的 IP 地址。如果一切正常的话,等一会儿就可以连接上了。并且使用现有的用户名和密码应该就可以登录了。
事实上如果没什么特别需求,到这里 OpenSSH Server 就算安装好了。但是进一步设置一下,可以让 OpenSSH 登录时间更短,并且更加安全。这一切都是通过修改 openssh 的配置文件 sshd_config 实现的。
首先,您刚才实验远程登录的时候可能会发现,在输入完用户名后需要等很长一段时间才会提示输入密码。其实这是由于 sshd 需要反查客户端的 dns 信息导致的。我们可以通过禁用这个特性来大幅提高登录的速度。首先,打开 sshd_config 文件,找到 GSSAPI options 这一节,将下面两行注释掉:
#GSSAPIAuthentication yes
#GSSAPIDelegateCredentials no
然后重新启动 ssh 服务即可:
|
|
|
|
|
|
设置临时文件目录参数hadoop.tmp.dir,默认情况下master会将元数据等存在这个目录下,而slave会将所有上传的文件放在这个目录下,我选择的数据目录为:/home/hadoop/hadoop_tmp
|
配 置 MapReduce 的一些设置,从/home/hadoopor/hadoop-0.20.2/src/mapred 目录下复制mapred-default.xml到conf目录下,并改名为mapred-site.xml。执行命令同core-site.xml操作 完全相似。
修改如下属性配置:
|
(4)修改 masters 和 slaves 文件:配置,文件中写入作为master机器和slaves机器的IP地址,如果是单机,都写localhost即可。
|
|
|
|
|
|
|
Running a MapReduce job
We will now run your first Hadoop MapReduce job. We will use the WordCount example job which reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a word and the count of how often it occurred, separated by a tab. More information of what happens behind the scenes is available at the Hadoop Wiki.
Download example input data
We will use three ebooks from Project Gutenberg for this example:
- The Outline of Science, Vol. 1 (of 4) by J. Arthur Thomson
- The Notebooks of Leonardo Da Vinci
- Ulysses by James Joyce
Download each ebook as plain text files in us-ascii encoding and store the uncompressed files in a temporary directory of choice, for example /tmp/gutenberg.
hadoop@ubuntu:~$ ls -l /tmp/gutenberg/
total 3592
-rw-r--r-- 1 hadoop hadoop 674425 2007-01-22 12:56 20417-8.txt
-rw-r--r-- 1 hadoop hadoop 1423808 2006-08-03 16:36 7ldvc10.txt
-rw-r--r-- 1 hadoop hadoop 1561677 2004-11-26 09:48 ulyss12.txt
hadoop@ubuntu:~$
Restart the Hadoop cluster
Restart your Hadoop cluster if it's not running already.
hadoop@ubuntu:~$ <HADOOP_INSTALL>/bin/start-all.sh
Copy local example data to HDFS
Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop's HDFS.
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -copyFromLocal /tmp/gutenberg gutenberg
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2010-05-08 17:40 /user/hadoop/gutenberg
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg
Found 3 items
-rw-r--r-- 1 hadoop supergroup 674762 2010-05-08 17:40 /user/hadoop/gutenberg/20417.txt
-rw-r--r-- 1 hadoop supergroup 1573044 2010-05-08 17:40 /user/hadoop/gutenberg/4300.txt
-rw-r--r-- 1 hadoop supergroup 1391706 2010-05-08 17:40 /user/hadoop/gutenberg/7ldvc10.txt
hadoop@ubuntu:/usr/local/hadoop$
Run the MapReduce job
Now, we actually run the WordCount example job.
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount gutenberg gutenberg-output
This command will read all the files in the HDFS directory gutenberg, process it, and store the result in the HDFS directorygutenberg-output.
Exemplary output of the previous command in the console:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount gutenberg gutenberg-output
10/05/08 17:43:00 INFO input.FileInputFormat: Total input paths to process : 3
10/05/08 17:43:01 INFO mapred.JobClient: Running job: job_201005081732_0001
10/05/08 17:43:02 INFO mapred.JobClient: map 0% reduce 0%
10/05/08 17:43:14 INFO mapred.JobClient: map 66% reduce 0%
10/05/08 17:43:17 INFO mapred.JobClient: map 100% reduce 0%
10/05/08 17:43:26 INFO mapred.JobClient: map 100% reduce 100%
10/05/08 17:43:28 INFO mapred.JobClient: Job complete: job_201005081732_0001
10/05/08 17:43:28 INFO mapred.JobClient: Counters: 17
10/05/08 17:43:28 INFO mapred.JobClient: Job Counters
10/05/08 17:43:28 INFO mapred.JobClient: Launched reduce tasks=1
10/05/08 17:43:28 INFO mapred.JobClient: Launched map tasks=3
10/05/08 17:43:28 INFO mapred.JobClient: Data-local map tasks=3
10/05/08 17:43:28 INFO mapred.JobClient: FileSystemCounters
10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_READ=2214026
10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_READ=3639512
10/05/08 17:43:28 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3687918
10/05/08 17:43:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=880330
10/05/08 17:43:28 INFO mapred.JobClient: Map-Reduce Framework
10/05/08 17:43:28 INFO mapred.JobClient: Reduce input groups=82290
10/05/08 17:43:28 INFO mapred.JobClient: Combine output records=102286
10/05/08 17:43:28 INFO mapred.JobClient: Map input records=77934
10/05/08 17:43:28 INFO mapred.JobClient: Reduce shuffle bytes=1473796
10/05/08 17:43:28 INFO mapred.JobClient: Reduce output records=82290
10/05/08 17:43:28 INFO mapred.JobClient: Spilled Records=255874
10/05/08 17:43:28 INFO mapred.JobClient: Map output bytes=6076267
10/05/08 17:43:28 INFO mapred.JobClient: Combine input records=629187
10/05/08 17:43:28 INFO mapred.JobClient: Map output records=629187
10/05/08 17:43:28 INFO mapred.JobClient: Reduce input records=102286
Check if the result is successfully stored in HDFS directory gutenberg-output:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2010-05-08 17:40 /user/hadoop/gutenberg
drwxr-xr-x - hadoop supergroup 0 2010-05-08 17:43 /user/hadoop/gutenberg-output
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls gutenberg-output
Found 2 items
drwxr-xr-x - hadoop supergroup 0 2010-05-08 17:43 /user/hadoop/gutenberg-output/_logs
-rw-r--r-- 1 hadoop supergroup 880330 2010-05-08 17:43 /user/hadoop/gutenberg-output/part-r-00000
hadoop@ubuntu:/usr/local/hadoop$
If you want to modify some Hadoop settings on the fly like increasing the number of Reduce tasks, you can use the "-D" option:
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop jar hadoop-0.20.2-examples.jar wordcount -D mapred.reduce.tasks=16 gutenberg gutenberg-output
An important note about mapred.map.tasks: Hadoop does not honor mapred.map.tasks beyond considering it a hint. But it accepts the user specified mapred.reduce.tasks and doesn't manipulate that. You cannot forcemapred.map.tasks but can specify mapred.reduce.tasks.
Retrieve the job result from HDFS
To inspect the file, you can copy it from HDFS to the local file system. Alternatively, you can use the command
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -cat gutenberg-output/part-r-00000
to read the file directly from HDFS without copying it to the local file system. In this tutorial, we will copy the results to the local file system though.
hadoop@ubuntu:/usr/local/hadoop$ mkdir /tmp/gutenberg-output
hadoop@ubuntu:/usr/local/hadoop$ bin/hadoop dfs -getmerge gutenberg-output /tmp/gutenberg-output
hadoop@ubuntu:/usr/local/hadoop$ head /tmp/gutenberg-output/gutenberg-output
"(Lo)cra" 1
"1490 1
"1498," 1
"35" 1
"40," 1
"A 2
"AS-IS". 1
"A_ 1
"Absoluti 1
"Alack! 1
hadoop@ubuntu:/usr/local/hadoop$
Note that in this specific output the quote signs (") enclosing the words in the head output above have not been inserted by Hadoop. They are the result of the word tokenizer used in the WordCount example, and in this case they matched the beginning of a quote in the ebook texts. Just inspect the part-00000 file further to see it for yourself.
参考:http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
http://www.cppblog.com/thronds/archive/2008/11/17/67153.html
http://hi.baidu.com/pwcrab/blog/item/3cd63086fcd3733067096e95.html