第一个nuthc采集Nutch-site.xml文件配置（记录用）

<configuration>
<property>
<name>http.robots.agents</name>
<value>My Nutch Spider</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>

<property>
<name>http.agent.name</name>
<value>My Nutch Spider </value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version

and set their values appropriately.

</description>
</property>

<property>
<name>http.content.limit</name>
<value>655360</value>
<description>The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>

<property>
<name>http.redirect.max</name>
<value>2</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>

<property>
<name>db.max.outlinks.per.page</name>
<value>-1</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>

</configuration>

posted @ 2015-03-09 16:20 一颗大柠檬阅读(138) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

一颗大柠檬

好记性，不如烂随笔

第一个nuthc采集Nutch-site.xml文件配置（记录用）

公告