

1.配置cygwin的环境变量,这一步很重要,如果没有配置的话,后面就会出现"Failed to get the current user's information" 或者 'Login failed: Cannot run program "bash"'的错误。 
2.新建一个工程,随便取个名字,选择"Create project from existing source",指向自己nutch-1.0的目录。 
3.点击下一步,切换到"Libraries"选择"Add Class Folder..." 按钮,从列表中选择"conf"。这里要说一下,看过的很多帖子这一步不太一样。 
4.切换到"Order and Export"找到"conf",把它移到顶端。 
5.切换到"Source"将output folder设置为Nutch /bin/tmp_build(这一步视自己情况而定),点击finish完成导入。 
7.从http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-mp3/lib/,http://nutch.cvs.sourceforge.net/nutch/nutch/src/plugin/parse-rtf/lib/下载MP3跟rtf的jar文件,分别拷贝到src/plugin/parse-mp3/lib 和 src/plugin/parse-rtf/lib/文件夹下 
8.刷新几下,右键选择工程文件夹,选择Build Path->Configure Build Path...在弹出的窗口上,切换到Libraries,选择Add Jars...,添加刚才下载的jar文件到工程。 
9.到这一步,一般的工程都会有两个错误,nutch的official 1.0 release版本中,这两个问题因为licensing issues没有修复。接下来的就是最关键的部分了。 
添加import org.apache.nutch.parse.ParseResult; 
将public Parse getParse(Content content) { 
改为public ParseResult getParse(Content content) { 
将return new ParseStatus(ParseStatus.FAILED, 
改为return new ParseStatus(ParseStatus.FAILED, 
              e.toString()).getEmptyParseResult(content.getUrl(), getConf()); 
将return new ParseImpl(text, 
                         new ParseData(ParseStatus.STATUS_SUCCESS, 
                                       OutlinkExtractor.getOutlinks(text, this.conf), 
改为return ParseResult.createParseResult(content.getUrl(), 
                             new ParseImpl(text, 
                                     new ParseData(ParseStatus.STATUS_SUCCESS, 
                                             OutlinkExtractor.getOutlinks(text, this.conf), 

将parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content); 
改为parse = new ParseUtil(conf).parseByExtensionId("parse-rtf", content).get(urlString); 
10.选择Run->Run As->Java Application在弹出的Select Java Application上选择Crawl-org.apache.nutch.crawl,第一次运行由于没有设置参数,所以不会有什么,接下来,选择Run->Run Configurations…在左边的Java Application下面会有Crawl这一项,选择它, 
切换到Arguments,Program Arguments的内容就是要设置的参数,填上urls -dir crawl -depth 3 -topN 50(这里视自己的具体情况而定,urls为链接)在VM arguments下面填上-Dhadoop.log.dir=logs -Dhadoop.log.file=hadoop.log 
选择Run即可,一般的情况下,如果不出意外的话,运行没什么问题,可以看到抓取页面的过程,但是我在这里碰到了一个问题,就是Java Heap Size的问题,查看logs/hadoop.log文件,如果出现类似java.lang.OutOfMemoryError: Java heap space语句,那么一般都是这个问题,Eclipse -> Window -> Preferences -> Java -> Installed JREs -> edit -> Default VM arguments 
设置为-Xms5m –Xmx250m,其中Xms为最小内存,Xmx为最大内存。 
Eclipse: Cannot create project content in workspace 
The nutch source code must be out of the workspace folder. My first attempt was download the code with eclipse (svn) under my workspace. When I try to create the project using existing code, eclipse don't let me do it from source code into the workspace. I use the source code out of my workspace and it work fine. 
plugin dir not found 
Make sure you set your plugin.folders property correct, instead of using a relative path you can use a absolute one as well in nutch-defaults.xml or may be better in nutch-site.xml 
No plugins loaded during unit tests in Eclipse 
During unit testing, Eclipse ignored conf/nutch-site.xml in favor of src/test/nutch-site.xml, so you might need to add the plugin directory configuration to that file as well. 
Unit tests work in eclipse but fail when running ant in the command line 
Suppose your unit tests work perfectly in eclipse, but each and everyone fail when running ant test in the command line - including the ones you haven't modified. Check if you defined the plugin.folders property in hadoop-site.xml. In that case, try removing it from that file and adding it directly to nutch-site.xml 
Run ant test again. That should have solved the problem. 
If that didn't solve the problem, are you testing a plugin? If so, did you add the plugin to the list of packages in plugin\build.xml, on the test target? 
• open the class itself, rightclick 
• refresh the build dir 
debugging hadoop classes 
• Sometime it makes sense to also have the hadoop classes available during debugging. So, you can check out the Hadoop sources on your machine and add the sources to the hadoop-xxx.jar. Alternatively, you can: 
o Remove the hadoopXXX.jar from your classpath libraries 
o Checkout the hadoop brunch that is used within nutch 
o configure a hadoop project similar to the nutch project within your eclipse 
o add the hadoop project as a dependent project of nutch project 
o you can now also set break points within hadoop classes lik inputformat implementations etc. 
Failed to get the current user's information 
On Windows, if the crawler throws an exception complaining it "Failed to get the current user's information" or 'Login failed: Cannot run program "bash"', it is likely you forgot to set the PATH to point to cygwin. Open a new command line window (All Programs > Accessories > Command Prompt) and type "bash". This should start cygwin. If it doesn't, type "path" to see your path. You should see within the path the cygwin bin directory (e.g., C:\cygwin\bin). See the steps to adding this to your PATH at the top of the article under "For Windows Users". After setting the PATH, you will likely need to restart Eclipse so it will use the new PATH.

