nutch-default.xml 解读
这两天花了很长的时间一直在读nutch-default.xml,nutch-default.xml为nutch的初始配置,如果你要修改其中的如果你想修改其中的选项,你可以把相应的entries拷贝到nutch-site.xml,然后进行修改.如果nutch-site.xml不存在,创建它.(Note)已经基本上将此文件整明白,可是也尚未完全明白.还有不少疑点,将此xml文件生成的html文档贴于此处,以便查阅.
nutch-default.xml
name | value | description |
http.agent.name | NutchCVS | Our HTTP 'User-Agent' request header. |
http.robots.agents | NutchCVS,Nutch,* | The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. |
http.robots.403.allow | TRUE | Some servers return HTTP status 403 (Forbidden) if /robots.txt doesn't exist. This should probably mean that we are allowed to crawl the site nonetheless. If this is set to false, then such sites will be treated as forbidden. |
http.agent.description | Nutch | Further description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. |
http.agent.url | http://lucene.apache.org/nutch/bot.html | A URL to advertise in the User-Agent header. This will appear in parenthesis after the agent name. |
http.agent.email | nutch-agent@lucene.apache.org | An email address to advertise in the HTTP 'From' request header and User-Agent header. |
http.agent.version | 0.7 | A version string to advertise in the User-Agent header. |
http.timeout | 10000 | The default network timeout, in milliseconds. |
http.max.delays | 3 | The number of times a thread will delay when trying to fetch a page. Each time it finds that a host is busy, it will wait fetcher.server.delay. After http.max.delays attepts, it will give up on the page for now. |
http.content.limit | 65536 | The length limit for downloaded content, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all.(65536 bytes即64KB) |
http.proxy.host | The proxy hostname. If empty, no proxy is used. (proxy代理服务器) | |
http.proxy.port | The proxy port. | |
http.verbose | FALSE | If true, HTTP will log more verbosely.(verbose 详细的,冗长的) |
http.redirect.max | 3 | The maximum number of redirects the fetcher will follow when trying to fetch a page. |
file.content.limit | 65536 | The length limit for downloaded content, in bytes. If this value is larger than zero, content longer than it will be truncated; otherwise (zero or negative), no truncation at all. |
file.content.ignored | TRUE | If true, no file content will be saved during fetch. And it is probably what we want to set most of time, since file:// URLs are meant to be local and we can always use them directly at parsing and indexing stages. Otherwise file contents will be saved. !! NO IMPLEMENTED YET !! (大多数情况,文件并不存储) |
ftp.username | anonymous | ftp login username. |
ftp.password | anonymous@example.com | ftp login password. |
ftp.content.limit | 65536 | The length limit for downloaded content, in bytes. If this value is larger than zero, content longer than it is truncated; otherwise (zero or negative), no truncation at all. Caution: classical ftp RFCs never defines partial transfer and, in fact, some ftp servers out there do not handle client side forced close-down very well. Our implementation tries its best to handle such situations smoothly. |
ftp.timeout | 60000 | Default timeout for ftp client socket, in millisec. Please also see ftp.keep.connection below. (对于ftp客户端) |
ftp.server.timeout | 100000 | An estimation of ftp server idle time, in millisec. Typically it is 120000 millisec for many ftp servers out there. Better be conservative here. Together with ftp.timeout, it is used to decide if we need to delete (annihilate) current ftp.client instance and force to start another ftp.client instance anew. This is necessary because a fetcher thread may not be able to obtain next request from queue in time (due to idleness) before our ftp client times out or remote server disconnects. Used only when ftp.keep.connection is true (please see below). (对于ftp服务器) |
ftp.keep.connection | FALSE | Whether to keep ftp connection. Useful if crawling same host again and again. When set to true, it avoids connection, login and dir list parser setup for subsequent urls. If it is set to true, however, you must make sure (roughly): (1) ftp.timeout is less than ftp.server.timeout (2) ftp.timeout is larger than (fetcher.threads.fetch * fetcher.server.delay) Otherwise there will be too many "delete client because idled too long" messages in thread logs.(对于ftp客户端) |
ftp.follow.talk | FALSE | Whether to log dialogue between our client and remote server. Useful for debugging.(对于ftp客户端) |
db.default.fetch.interval | 30 | The default number of days between re-fetches of a page. (每隔一个月重新fetch页面) |
db.ignore.internal.links | TRUE | If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping the only the highest quality links.(用来限制链接数据库的扩增,只保存最高质量的链接) |
db.score.injected | 1 | The score of new pages added by the injector. |
db.score.link.external | 1 | The score factor for new pages added due to a link from another host relative to the referencing page's score. |
db.score.link.internal | 1 | The score factor for pages added due to a link from the same host, relative to the referencing page's score. |
db.max.outlinks.per.page | 100 | The maximum number of outlinks that we'll process for a page. (对于一个page,最多处理100个链出链接) |
db.max.anchor.length | 100 | The maximum number of characters permitted in an anchor. (Anchor(锚)的长度不超过100) |
db.fetch.retry.max | 3 | The maximum number of times a url that has encountered recoverable errors is generated for fetch. (一个url最多可以遭遇3个可恢复性错误) |
fetchlist.score.by.link.count | TRUE | If true, set page scores on fetchlist entries based on log(number of anchors), instead of using original page scores. This results in prioritization of pages with many incoming links. (如果为真,那么对于page score采用链入的链接数.这样具有较多链接数的page具有较高的优先级.) |
fetcher.server.delay | 5 | The number of seconds the fetcher will delay between successive requests to the same server. |
fetcher.threads.fetch | 10 | The number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection). (fetcher线程的最大数目为10) |
fetcher.threads.per.host | 1 | This number is the maximum number of threads that should be allowed to access a host at one time. |
fetcher.verbose | FALSE | If true, fetcher will log more verbosely. |
parser.threads.parse | 10 | Number of ParserThreads ParseSegment should use. |
io.sort.factor | 100 | The number of streams to merge at once while sorting files. This determines the number of open file handles. (100个stream进行合并) |
io.sort.mb | 100 | The total amount of buffer memory to use while sorting files, in megabytes. By default, gives each merge stream 1MB, which should minimize seeks. (用于排序文档的缓存大小为100MB.相当于每一个Strem分1MB) |
io.file.buffer.size | 131072 | The size of buffer for use in sequence files. The size of this buffer should probably be a multiple of hardware page size (4096 on Intel x86), and it determines how much data is buffered during read and write operations. (大小为128KB) |
fs.default.name | local | The name of the default file system. Either the literal string "local" or a host:port for NDFS. |
ndfs.name.dir | /tmp/nutch/ndfs/name | Determines where on the local filesystem the NDFS name node should store the name table. |
ndfs.data.dir | /tmp/nutch/ndfs/data | Determines where on the local filesystem an NDFS data node should store its blocks. |
mapred.job.tracker | localhost:8010 | The host and port that the MapReduce job tracker runs at. |
mapred.local.dir | /tmp/nutch/mapred/local | The local directory where MapReduce stores temprorary files related to tasks and jobs.(执行MapReduce功能时临时文件) |
indexer.score.power | 0.5 | Determines the power of link analyis scores. Each pages's boost is set to score scorePower where score is its link analysis score and scorePower is the value of this parameter. This is compiled into indexes, so, when this is changed, pages must be re-indexed for it to take effect. |
indexer.boost.by.link.count | TRUE | When true scores for a page are multipled by the log of the number of incoming links to the page.(当其为真时,page的score再乘以此page链入的链接数) |
indexer.max.title.length | 100 | The maximum number of characters of a title that are indexed. |
indexer.max.tokens | 10000 | The maximum number of tokens that will be indexed for a single field in a document. This limits the amount of memory required for indexing, so that collections with very large files will not crash the indexing process by running out of memory. Note that this effectively truncates large documents, excluding from the index tokens that occur further in the document. If you know your source documents are large, be sure to set this value high enough to accomodate the expected size. If you set it to Integer.MAX_VALUE, then the only limit is your memory, but you should anticipate an OutOfMemoryError.(需要修改) |
indexer.mergeFactor | 50 | The factor that determines the frequency of Lucene segment merges. This must not be less than 2, higher values increase indexing speed but lead to increased RAM usage, and increase the number of open file handles (which may lead to "Too many open files" errors). NOTE: the "segments" here have nothing to do with Nutch segments, they are a low-level data unit used by Lucene.(segment合并因子) |
indexer.minMergeDocs | 50 | This number determines the minimum number of Lucene Documents buffered in memory between Lucene segment merges. Larger values increase indexing speed and increase RAM usage. |
indexer.maxMergeDocs | 50 | This number determines the maximum number of Lucene Documents to be merged into a new Lucene segment. Larger values increase indexing speed and reduce the number of Lucene segments, which reduces the number of open file handles; however, this also increases RAM usage during indexing.( 将Lucene文献合并成一个segment) |
indexer.termIndexInterval | 128 | Determines the fraction of terms which Lucene keeps in RAM when searching, to facilitate random-access. Smaller values use more memory but make searches somewhat faster. Larger values use less memory but make searches somewhat slower. (Luceene将部分term存于RAM中用于搜索,此类term的个数) |
analysis.common.terms.file | common-terms.utf8 | The name of a file containing a list of common terms that should be indexed in n-grams.(放于此文件的term应该被以n-gram进行搜索) |
searcher.dir | . | Path to root of index directories. This directory is searched (in order) for either the file search-servers.txt, containing a list of distributed search servers, or the directory "index" containing merged indexes, or the directory "segments" containing segment indexes. (index目录内含合并索引,segments目录内含segment(分段)索引!建索引时,先在segments目录的index目录下建立 索引,然后合并成一个大的index目录,这样大的index目录包含document的个数就小于等于segments目录下的index目录 所包含的document的个数) |
searcher.filter.cache.size | 16 | Maximum number of filters to cache. Filters can accelerate certain field-based queries, like language, document format, etc. Each filter requires one bit of RAM per page. So, with a 10 million page index, a cache size of 16 consumes two bytes per page, or 20MB. (注意其计算) |
searcher.filter.cache.threshold | 0.05 | Filters are cached when their term is matched by more than this fraction of pages. For example, with a threshold of 0.05, and 10 million pages, the term must match more than 1/20, or 50,000 pages. So, if out of 10 million pages, 50% of pages are in English, and 2% are in Finnish, then, with a threshold of 0.05, searches for "lang:en" will use a cached filter, while searches for "lang:fi" will score all 20,000 finnish documents. (由于50%>0,05,从而使用filter cashe,而0.02小于0.05,所以不使用filter cache,返回文档应该为 10million*0.02=0.2million=200,000个,不知道怎么算成20,000了??) |
searcher.hostgrouping.rawhits.factor | 2 | A factor that is used to determine the number of raw hits initially fetched, before host grouping is done. (不太明白??) |
searcher.summary.context | 5 | The number of context terms to display preceding and following matching terms in a hit summary. |
searcher.summary.length | 20 | The total number of terms to display in a hit summary. (在Hits中显示的term的总个数) |
urlnormalizer.class | org.apache.nutch.net.BasicUrlNormalizer | Name of the class used to normalize URLs. (为什么此处没有用org.apache.nutch.net.RegexUrlNormalizer?? 在RegexUrlNormalizer的javaDoc中不是明确指定用RegexUrlNormalizer吗???) |
urlnormalizer.regex.file | regex-normalize.xml | Name of the config file used by the RegexUrlNormalizer class. |
mime.types.file | mime-types.xml | Name of file in CLASSPATH containing filename extension and magic sequence to mime types mapping information |
mime.type.magic | TRUE | Defines if the mime content type detector uses magic resolution. (不太明白何谓magic resolution) |
ipc.client.timeout | 10000 | Defines the timeout for IPC calls in milliseconds. (IPC进程间通讯,Interprocess communication) |
plugin.folders | plugins | Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath. |
plugin.includes | protocol-httpclient|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url) | Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. (此处需要修改,事实上nutch中所有的plugin.includes为clustering-carrot2 |
|creativecommons | ||
|index-(basic|more) | ||
|language-identifier | ||
|ontology | ||
|parse(ext|html|js|msword|pdf|rss|text|file|ftp|http) | ||
|protocal-(httpclient|file|ftp|http) | ||
|query-(basic|more|site|url) | ||
|urlfilter-(prefix|regex)) | ||
plugin.excludes | Regular expression naming plugin directory names to exclude. | |
parser.character.encoding.default | windows-1252 | The character encoding to fall back to when no other information is available (当没有可用信息是使用的编码格式) |
parser.html.impl | neko | HTML Parser implementation. Currently the following keywords are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup. |
urlfilter.regex.file | regex-urlfilter.txt | Name of file on CLASSPATH containing regular expressions used by urlfilter-regex (RegexURLFilter) plugin. |
urlfilter.prefix.file | prefix-urlfilter.txt | Name of file on CLASSPATH containing url prefixes used by urlfilter-prefix (PrefixURLFilter) plugin. (没有找到这个文件?) |
urlfilter.order | The order by which url filters are applied. If empty, all available url filters (as dictated by properties plugin-includes and plugin-excludes above) are loaded and applied in system defined order. If not empty, only named filters are loaded and applied in given order. For example, if this property has value: org.apache.nutch.net.RegexURLFilter org.apache.nutch.net.PrefixURLFilter then RegexURLFilter is applied first, and PrefixURLFilter second. Since all filters are AND'ed, filter ordering does not have impact on end result, but it may have performance implication, depending on relative expensiveness of filters.(urlfilter的顺序) | |
extension.clustering.hits-to-cluster | 100 | Number of snippets retrieved for the clustering extension if clustering extension is available and user requested results to be clustered. |
extension.clustering.extension-name | Use the specified online clustering extension. If empty, the first available extension will be used. The "name" here refers to an 'id' attribute of the 'implementation' element in the plugin descriptor XML file.(不太理解) | |
extension.ontology.extension-name | Use the specified online ontology extension. If empty, the first available extension will be used. The "name" here refers to an 'id' attribute of the 'implementation' element in the plugin descriptor XML file. | |
extension.ontology.urls | Urls of owl files, separated by spaces, such as http://www.example.com/ontology/time.owl http://www.example.com/ontology/space.owl http://www.example.com/ontology/wine.owl Or file:/ontology/time.owl file:/ontology/space.owl file:/ontology/wine.owl You have to make sure each url is valid. By default, there is no owl file, so query refinement based on ontology is silently ignored. (何为owl??) | |
query.url.boost | 4 | Used as a boost for url field in Lucene query. |
query.anchor.boost | 2 | Used as a boost for anchor field in Lucene query. (对于boost如何翻译为好?) |
query.title.boost | 1.5 | Used as a boost for title field in Lucene query. |
query.host.boost | 2 | Used as a boost for host field in Lucene query. |
query.phrase.boost | 1 | Used as a boost for phrase in Lucene query. Multiplied by boost for field phrase is matched in. |
lang.ngram.min.length | 1 | The minimum size of ngrams to uses to identify language (must be between 1 and lang.ngram.max.length). The larger is the range between lang.ngram.min.length and lang.ngram.max.length, the better is the identification, but the slowest it is. |
lang.ngram.max.length | 4 | The maximum size of ngrams to uses to identify language (must be between lang.ngram.min.length and 4). The larger is the range between lang.ngram.min.length and lang.ngram.max.length, the better is the identification, but the slowest it is.(差距越大,识别的越好,可是就慢了) |
lang.analyze.max.length | 2048 | The maximum bytes of data to uses to indentify the language (0 means full content analysis). The larger is this value, the better is the analysis, but the slowest it is.(用来识别语言所用的最大数据) |