Nutch的nutch-default.xml和regex-urlfilter.txt的中文解释
nutch-default解释.xml
1 <?xml version="1.0"?> 2 <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> 3 <!-- 4 Licensed to the Apache Software Foundation (ASF) under one or more 5 contributor license agreements. See the NOTICE file distributed with 6 this work for additional information regarding copyright ownership. 7 The ASF licenses this file to You under the Apache License, Version 2.0 8 (the "License"); you may not use this file except in compliance with 9 the License. You may obtain a copy of the License at 10 11 http://www.apache.org/licenses/LICENSE-2.0 12 13 Unless required by applicable law or agreed to in writing, software 14 distributed under the License is distributed on an "AS IS" BASIS, 15 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 16 See the License for the specific language governing permissions and 17 limitations under the License. 18 --> 19 <!-- Do not modify this file directly. Instead, copy entries that you --> 20 <!-- wish to modify from this file into nutch-site.xml and change them --> 21 <!-- there. If nutch-site.xml does not already exist, create it. --> 22 23 <configuration> 24 25 <!-- general properties --> 26 27 <property> 28 <name>store.ip.address</name> 29 <value>false</value> 30 <description>Enables us to capture the specific IP address 31 (InetSocketAddress) of the host which we connect to via 32 the given protocol. Currently supported is protocol-ftp and 33 http. 34 </description> 35 </property> 36 37 <!-- file properties --> 38 39 <property> 40 <name>file.content.limit</name> 41 <value>65536</value> 42 <description>The length limit for downloaded content using the file:// 43 protocol, in bytes. If this value is nonnegative (>=0), content longer 44 than it will be truncated; otherwise, no truncation at all. Do not 45 confuse this setting with the http.content.limit setting. 46 默认情况下,nutch只抓取网页的前65536个字节,之后的内容将被丢弃 47 但对于某些大型网站,首页的内容远远不止65536个字节, 48 甚至前面65536个字节里面均是一些布局信息,并没有任何的超链接, 49 所以可以把这个值设置的很大,或者直接设置为-1,表示不进行限制。 50 </description> 51 </property> 52 53 <property> 54 <name>file.crawl.parent</name> 55 <value>true</value> 56 <description>The crawler is not restricted to the directories that you specified in the 57 Urls file but it is jumping into the parent directories as well. For your own crawlings you can 58 change this bahavior (set to false) the way that only directories beneath the directories that you specify get 59 crawled.</description> 60 </property> 61 62 <property> 63 <name>file.crawl.redirect_noncanonical</name> 64 <value>true</value> 65 <description> 66 If true, protocol-file treats non-canonical file names as 67 redirects and does not canonicalize file names internally. A file 68 name containing symbolic links as path elements is then not 69 resolved and "fetched" but recorded as redirect with the 70 canonical name (all links on path are resolved) as redirect 71 target. 72 </description> 73 </property> 74 75 <property> 76 <name>file.content.ignored</name> 77 <value>true</value> 78 <description>If true, no file content will be saved during fetch. 79 And it is probably what we want to set most of time, since file:// URLs 80 are meant to be local and we can always use them directly at parsing 81 and indexing stages. Otherwise file contents will be saved. 82 !! NO IMPLEMENTED YET !! 83 </description> 84 </property> 85 86 <!-- HTTP properties --> 87 88 <property> 89 <name>http.agent.name</name> 90 <value></value> 91 <description>HTTP 'User-Agent' request header. MUST NOT be empty - 92 please set this to a single word uniquely related to your organization. 93 94 NOTE: You should also check other related properties: 95 96 http.robots.agents 97 http.agent.description 98 http.agent.url 99 http.agent.email 100 http.agent.version 101 102 and set their values appropriately. 103 104 </description> 105 </property> 106 107 <property> 108 <name>http.robots.agents</name> 109 <value></value> 110 <description>Any other agents, apart from 'http.agent.name', that the robots 111 parser would look for in robots.txt. Multiple agents can be provided using 112 comma as a delimiter. eg. mybot,foo-spider,bar-crawler 113 114 The ordering of agents does NOT matter and the robots parser would make 115 decision based on the agent which matches first to the robots rules. 116 Also, there is NO need to add a wildcard (ie. "*") to this string as the 117 robots parser would smartly take care of a no-match situation. 118 119 If no value is specified, by default HTTP agent (ie. 'http.agent.name') 120 would be used for user agent matching by the robots parser. 121 </description> 122 </property> 123 124 <property> 125 <name>http.robot.rules.whitelist</name> 126 <value></value> 127 <description>Comma separated list of hostnames or IP addresses to ignore 128 robot rules parsing for. Use with care and only if you are explicitly 129 allowed by the site owner to ignore the site's robots.txt! 130 </description> 131 </property> 132 133 <property> 134 <name>http.robots.403.allow</name> 135 <value>true</value> 136 <description>Some servers return HTTP status 403 (Forbidden) if 137 /robots.txt doesn't exist. This should probably mean that we are 138 allowed to crawl the site nonetheless. If this is set to false, 139 then such sites will be treated as forbidden.</description> 140 </property> 141 142 <property> 143 <name>http.agent.description</name> 144 <value></value> 145 <description>Further description of our bot- this text is used in 146 the User-Agent header. It appears in parenthesis after the agent name. 147 </description> 148 </property> 149 150 <property> 151 <name>http.agent.url</name> 152 <value></value> 153 <description>A URL to advertise in the User-Agent header. This will 154 appear in parenthesis after the agent name. Custom dictates that this 155 should be a URL of a page explaining the purpose and behavior of this 156 crawler. 157 </description> 158 </property> 159 160 <property> 161 <name>http.agent.email</name> 162 <value></value> 163 <description>An email address to advertise in the HTTP 'From' request 164 header and User-Agent header. A good practice is to mangle this 165 address (e.g. 'info at example dot com') to avoid spamming. 166 </description> 167 </property> 168 169 <property> 170 <name>http.agent.version</name> 171 <value>Nutch-1.10</value> 172 <description>A version string to advertise in the User-Agent 173 header.</description> 174 </property> 175 176 <property> 177 <name>http.agent.rotate</name> 178 <value>false</value> 179 <description> 180 If true, instead of http.agent.name, alternating agent names are 181 chosen from a list provided via http.agent.rotate.file. 182 </description> 183 </property> 184 185 <property> 186 <name>http.agent.rotate.file</name> 187 <value>agents.txt</value> 188 <description> 189 File containing alternative user agent names to be used instead of 190 http.agent.name on a rotating basis if http.agent.rotate is true. 191 Each line of the file should contain exactly one agent 192 specification including name, version, description, URL, etc. 193 </description> 194 </property> 195 196 <property> 197 <name>http.agent.host</name> 198 <value></value> 199 <description>Name or IP address of the host on which the Nutch crawler 200 would be running. Currently this is used by 'protocol-httpclient' 201 plugin. 202 </description> 203 </property> 204 205 <property> 206 <name>http.timeout</name> 207 <value>10000</value> 208 <description>The default network timeout, in milliseconds.</description> 209 </property> 210 211 <property> 212 <name>http.max.delays</name> 213 <value>100</value> 214 <description>The number of times a thread will delay when trying to 215 fetch a page. Each time it finds that a host is busy, it will wait 216 fetcher.server.delay. After http.max.delays attepts, it will give 217 up on the page for now. 218 爬虫的网络延时线程等待时间,以秒计时,默认的配时间是3秒,视网络状况而定。如果 219 在爬虫运行的时候发现服务器返回了主机忙消息,则等待时间由fetcher.server.delay决定, 220 所以在网络状况不太好的情况下fetcher.server.delay也设置稍大一点的值较好, 221 此外还有一个http.timeout也和网络状况有关系 222 </description> 223 </property> 224 225 <property> 226 <name>http.content.limit</name> 227 <value>65536</value> 228 <description>The length limit for downloaded content using the http:// 229 protocol, in bytes. If this value is nonnegative (>=0), content longer 230 than it will be truncated; otherwise, no truncation at all. Do not 231 confuse this setting with the file.content.limit setting. 232 描述爬虫抓取的文档内容长度的配置项。默认值是65536, 233 也就是说抓取到的一个文档截取65KB左右,超过部分将被忽略, 234 对于抓取特定内容的搜索引擎需要修改此项,比如XML文档。 235 设置为-1表示不限制。 236 </description> 237 </property> 238 239 <!--下面这四个属性是用来设置代理地址和端口,如果代理需要密码的话还需要设置用户名和密码 --> 240 <property> 241 <name>http.proxy.host</name> 242 <value></value> 243 <description>The proxy hostname. If empty, no proxy is used.</description> 244 </property> 245 246 <property> 247 <name>http.proxy.port</name> 248 <value></value> 249 <description>The proxy port.</description> 250 </property> 251 252 <property> 253 <name>http.proxy.username</name> 254 <value></value> 255 <description>Username for proxy. This will be used by 256 'protocol-httpclient', if the proxy server requests basic, digest 257 and/or NTLM authentication. To use this, 'protocol-httpclient' must 258 be present in the value of 'plugin.includes' property. 259 NOTE: For NTLM authentication, do not prefix the username with the 260 domain, i.e. 'susam' is correct whereas 'DOMAIN\susam' is incorrect. 261 </description> 262 </property> 263 264 <property> 265 <name>http.proxy.password</name> 266 <value></value> 267 <description>Password for proxy. This will be used by 268 'protocol-httpclient', if the proxy server requests basic, digest 269 and/or NTLM authentication. To use this, 'protocol-httpclient' must 270 be present in the value of 'plugin.includes' property. 271 </description> 272 </property> 273 274 <property> 275 <name>http.proxy.realm</name> 276 <value></value> 277 <description>Authentication realm for proxy. Do not define a value 278 if realm is not required or authentication should take place for any 279 realm. NTLM does not use the notion of realms. Specify the domain name 280 of NTLM authentication as the value for this property. To use this, 281 'protocol-httpclient' must be present in the value of 282 'plugin.includes' property. 283 </description> 284 </property> 285 286 <property> 287 <name>http.auth.file</name> 288 <value>httpclient-auth.xml</value> 289 <description>Authentication configuration file for 290 'protocol-httpclient' plugin. 291 </description> 292 </property> 293 294 <property> 295 <name>http.verbose</name> 296 <value>false</value> 297 <description>If true, HTTP will log more verbosely.</description> 298 </property> 299 300 <property> 301 <name>http.redirect.max</name> 302 <value>0</value> 303 <description>The maximum number of redirects the fetcher will follow when 304 trying to fetch a page. If set to negative or 0, fetcher won't immediately 305 follow redirected URLs, instead it will record them for later fetching. 306 </description> 307 </property> 308 309 <property> 310 <name>http.useHttp11</name> 311 <value>false</value> 312 <description>NOTE: at the moment this works only for protocol-httpclient. 313 If true, use HTTP 1.1, if false use HTTP 1.0 . 314 </description> 315 </property> 316 317 <property> 318 <name>http.accept.language</name> 319 <value>en-us,en-gb,en;q=0.7,*;q=0.3</value> 320 <description>Value of the "Accept-Language" request header field. 321 This allows selecting non-English language as default one to retrieve. 322 It is a useful setting for search engines build for certain national group. 323 </description> 324 </property> 325 326 <property> 327 <name>http.accept</name> 328 <value>text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8</value> 329 <description>Value of the "Accept" request header field. 330 </description> 331 </property> 332 333 <property> 334 <name>http.store.responsetime</name> 335 <value>true</value> 336 <description>Enables us to record the response time of the 337 host which is the time period between start connection to end 338 connection of a pages host. The response time in milliseconds 339 is stored in CrawlDb in CrawlDatum's meta data under key "_rs_" 340 </description> 341 </property> 342 343 <property> 344 <name>http.enable.if.modified.since.header</name> 345 <value>true</value> 346 <description>Whether Nutch sends an HTTP If-Modified-Since header. It reduces 347 bandwidth when enabled by not downloading pages that respond with an HTTP 348 Not-Modified header. URL's that are not downloaded are not passed through 349 parse or indexing filters. If you regularly modify filters, you should force 350 Nutch to also download unmodified pages by disabling this feature. 351 </description> 352 </property> 353 354 <!-- FTP properties --> 355 356 <property> 357 <name>ftp.username</name> 358 <value>anonymous</value> 359 <description>ftp login username.</description> 360 </property> 361 362 <property> 363 <name>ftp.password</name> 364 <value>anonymous@example.com</value> 365 <description>ftp login password.</description> 366 </property> 367 368 <property> 369 <name>ftp.content.limit</name> 370 <value>65536</value> 371 <description>The length limit for downloaded content, in bytes. 372 If this value is nonnegative (>=0), content longer than it will be truncated; 373 otherwise, no truncation at all. 374 Caution: classical ftp RFCs never defines partial transfer and, in fact, 375 some ftp servers out there do not handle client side forced close-down very 376 well. Our implementation tries its best to handle such situations smoothly. 377 默认情况下,nutch只抓取网页的前65536个字节,之后的内容将被丢弃。 378 但对于某些大型网站,首页的内容远远不止65536个字节, 379 甚至前面65536个字节里面均是一些布局信息,并没有任何的超链接。 380 设置为-1表示不限制。 381 </description> 382 </property> 383 384 <property> 385 <name>ftp.timeout</name> 386 <value>60000</value> 387 <description>Default timeout for ftp client socket, in millisec. 388 Please also see ftp.keep.connection below.</description> 389 </property> 390 391 <property> 392 <name>ftp.server.timeout</name> 393 <value>100000</value> 394 <description>An estimation of ftp server idle time, in millisec. 395 Typically it is 120000 millisec for many ftp servers out there. 396 Better be conservative here. Together with ftp.timeout, it is used to 397 decide if we need to delete (annihilate) current ftp.client instance and 398 force to start another ftp.client instance anew. This is necessary because 399 a fetcher thread may not be able to obtain next request from queue in time 400 (due to idleness) before our ftp client times out or remote server 401 disconnects. Used only when ftp.keep.connection is true (please see below). 402 </description> 403 </property> 404 405 <property> 406 <name>ftp.keep.connection</name> 407 <value>false</value> 408 <description>Whether to keep ftp connection. Useful if crawling same host 409 again and again. When set to true, it avoids connection, login and dir list 410 parser setup for subsequent urls. If it is set to true, however, you must 411 make sure (roughly): 412 (1) ftp.timeout is less than ftp.server.timeout 413 (2) ftp.timeout is larger than (fetcher.threads.fetch * fetcher.server.delay) 414 Otherwise there will be too many "delete client because idled too long" 415 messages in thread logs.</description> 416 </property> 417 418 <property> 419 <name>ftp.follow.talk</name> 420 <value>false</value> 421 <description>Whether to log dialogue between our client and remote 422 server. Useful for debugging.</description> 423 </property> 424 425 <!-- web db properties --> 426 <property> 427 <name>db.fetch.interval.default</name> 428 <value>2592000</value> 429 <description>The default number of seconds between re-fetches of a page (30 days). 430 这个功能对定期自动爬取需求的开发有用,设置多少天重新爬一个页面,默认2592000秒,即30天 431 </description> 432 </property> 433 434 <property> 435 <name>db.fetch.interval.max</name> 436 <value>7776000</value> 437 <description>The maximum number of seconds between re-fetches of a page 438 (90 days). After this period every page in the db will be re-tried, no 439 matter what is its status. 440 </description> 441 </property> 442 443 <property> 444 <name>db.fetch.schedule.class</name> 445 <value>org.apache.nutch.crawl.DefaultFetchSchedule</value> 446 <description>The implementation of fetch schedule. DefaultFetchSchedule simply 447 adds the original fetchInterval to the last fetch time, regardless of 448 page changes.</description> 449 </property> 450 451 <property> 452 <name>db.fetch.schedule.adaptive.inc_rate</name> 453 <value>0.4</value> 454 <description>If a page is unmodified, its fetchInterval will be 455 increased by this rate. This value should not 456 exceed 0.5, otherwise the algorithm becomes unstable.</description> 457 </property> 458 459 <property> 460 <name>db.fetch.schedule.adaptive.dec_rate</name> 461 <value>0.2</value> 462 <description>If a page is modified, its fetchInterval will be 463 decreased by this rate. This value should not 464 exceed 0.5, otherwise the algorithm becomes unstable.</description> 465 </property> 466 467 <property> 468 <name>db.fetch.schedule.adaptive.min_interval</name> 469 <value>60.0</value> 470 <description>Minimum fetchInterval, in seconds.</description> 471 </property> 472 473 <property> 474 <name>db.fetch.schedule.adaptive.max_interval</name> 475 <value>31536000.0</value> 476 <description>Maximum fetchInterval, in seconds (365 days). 477 NOTE: this is limited by db.fetch.interval.max. Pages with 478 fetchInterval larger than db.fetch.interval.max 479 will be fetched anyway.</description> 480 </property> 481 482 <property> 483 <name>db.fetch.schedule.adaptive.sync_delta</name> 484 <value>true</value> 485 <description>If true, try to synchronize with the time of page change. 486 by shifting the next fetchTime by a fraction (sync_rate) of the difference 487 between the last modification time, and the last fetch time.</description> 488 </property> 489 490 <property> 491 <name>db.fetch.schedule.adaptive.sync_delta_rate</name> 492 <value>0.3</value> 493 <description>See sync_delta for description. This value should not 494 exceed 0.5, otherwise the algorithm becomes unstable.</description> 495 </property> 496 497 <property> 498 <name>db.fetch.schedule.mime.file</name> 499 <value>adaptive-mimetypes.txt</value> 500 <description>The configuration file for the MimeAdaptiveFetchSchedule. 501 </description> 502 </property> 503 504 <property> 505 <name>db.update.additions.allowed</name> 506 <value>true</value> 507 <description>If true, updatedb will add newly discovered URLs, if false 508 only already existing URLs in the CrawlDb will be updated and no new 509 URLs will be added. 510 </description> 511 </property> 512 513 <property> 514 <name>db.preserve.backup</name> 515 <value>true</value> 516 <description>If true, updatedb will keep a backup of the previous CrawlDB 517 version in the old directory. In case of disaster, one can rename old to 518 current and restore the CrawlDB to its previous state. 519 </description> 520 </property> 521 522 <property> 523 <name>db.update.purge.404</name> 524 <value>false</value> 525 <description>If true, updatedb will add purge records with status DB_GONE 526 from the CrawlDB. 527 </description> 528 </property> 529 530 <property> 531 <name>db.url.normalizers</name> 532 <value>false</value> 533 <description>Normalize urls when updating crawldb</description> 534 </property> 535 536 <property> 537 <name>db.url.filters</name> 538 <value>false</value> 539 <description>Filter urls when updating crawldb</description> 540 </property> 541 542 <property> 543 <name>db.update.max.inlinks</name> 544 <value>10000</value> 545 <description>Maximum number of inlinks to take into account when updating 546 a URL score in the crawlDB. Only the best scoring inlinks are kept. 547 </description> 548 </property> 549 550 <property> 551 <name>db.ignore.internal.links</name> 552 <value>true</value> 553 <description>If true, when adding new links to a page, links from 554 the same host are ignored. This is an effective way to limit the 555 size of the link database, keeping only the highest quality 556 links. 557 </description> 558 </property> 559 560 <property> 561 <name>db.ignore.external.links</name> 562 <value>false</value> 563 <description>If true, outlinks leading from a page to external hosts 564 will be ignored. This is an effective way to limit the crawl to include 565 only initially injected hosts, without creating complex URLFilters. 566 若为true,则只抓取本域名内的网页,忽略外部链接。 567 可以在 regex-urlfilter.txt中增加过滤器达到同样效果, 568 但如果过滤器过多,如几千个,则会大大影响nutch的性能。 569 </description> 570 </property> 571 572 <property> 573 <name>db.injector.overwrite</name> 574 <value>false</value> 575 <description>Whether existing records in the CrawlDB will be overwritten 576 by injected records. 577 </description> 578 </property> 579 580 <property> 581 <name>db.injector.update</name> 582 <value>false</value> 583 <description>If true existing records in the CrawlDB will be updated with 584 injected records. Old meta data is preserved. The db.injector.overwrite 585 parameter has precedence. 586 </description> 587 </property> 588 589 <property> 590 <name>db.score.injected</name> 591 <value>1.0</value> 592 <description>The score of new pages added by the injector. 593 注入时url的默认网页得分(重要程度) 594 </description> 595 </property> 596 597 <property> 598 <name>db.score.link.external</name> 599 <value>1.0</value> 600 <description>The score factor for new pages added due to a link from 601 another host relative to the referencing page's score. Scoring plugins 602 may use this value to affect initial scores of external links. 603 </description> 604 </property> 605 606 <property> 607 <name>db.score.link.internal</name> 608 <value>1.0</value> 609 <description>The score factor for pages added due to a link from the 610 same host, relative to the referencing page's score. Scoring plugins 611 may use this value to affect initial scores of internal links. 612 </description> 613 </property> 614 615 <property> 616 <name>db.score.count.filtered</name> 617 <value>false</value> 618 <description>The score value passed to newly discovered pages is 619 calculated as a fraction of the original page score divided by the 620 number of outlinks. If this option is false, only the outlinks that passed 621 URLFilters will count, if it's true then all outlinks will count. 622 </description> 623 </property> 624 625 <property> 626 <name>db.max.inlinks</name> 627 <value>10000</value> 628 <description>Maximum number of Inlinks per URL to be kept in LinkDb. 629 If "invertlinks" finds more inlinks than this number, only the first 630 N inlinks will be stored, and the rest will be discarded. 631 </description> 632 </property> 633 634 <property> 635 <name>db.max.outlinks.per.page</name> 636 <value>100</value> 637 <description>The maximum number of outlinks that we'll process for a page. 638 If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks 639 will be processed for a page; otherwise, all outlinks will be processed. 640 默认情况下,Nutch只抓取某个网页的100个外部链接,导致部分链接无法抓取。 641 若要改变此情况,可以修改此配置项,可以增大 或者设置为-1,-1表示不限制。 642 </description> 643 </property> 644 645 <property> 646 <name>db.max.anchor.length</name> 647 <value>100</value> 648 <description>The maximum number of characters permitted in an anchor. 649 </description> 650 </property> 651 652 <property> 653 <name>db.parsemeta.to.crawldb</name> 654 <value></value> 655 <description>Comma-separated list of parse metadata keys to transfer to the crawldb (NUTCH-779). 656 Assuming for instance that the languageidentifier plugin is enabled, setting the value to 'lang' 657 will copy both the key 'lang' and its value to the corresponding entry in the crawldb. 658 </description> 659 </property> 660 661 <property> 662 <name>db.fetch.retry.max</name> 663 <value>3</value> 664 <description>The maximum number of times a url that has encountered 665 recoverable errors is generated for fetch.</description> 666 </property> 667 668 <property> 669 <name>db.signature.class</name> 670 <value>org.apache.nutch.crawl.MD5Signature</value> 671 <description>The default implementation of a page signature. Signatures 672 created with this implementation will be used for duplicate detection 673 and removal.</description> 674 </property> 675 676 <property> 677 <name>db.signature.text_profile.min_token_len</name> 678 <value>2</value> 679 <description>Minimum token length to be included in the signature. 680 </description> 681 </property> 682 683 <property> 684 <name>db.signature.text_profile.quant_rate</name> 685 <value>0.01</value> 686 <description>Profile frequencies will be rounded down to a multiple of 687 QUANT = (int)(QUANT_RATE * maxFreq), where maxFreq is a maximum token 688 frequency. If maxFreq > 1 then QUANT will be at least 2, which means that 689 for longer texts tokens with frequency 1 will always be discarded. 690 </description> 691 </property> 692 693 <!-- generate properties --> 694 695 <property> 696 <name>generate.max.count</name> 697 <value>-1</value> 698 <description>The maximum number of urls in a single 699 fetchlist. -1 if unlimited. The urls are counted according 700 to the value of the parameter generator.count.mode. 701 702 与generate.count.mode相配合,限制生成的每个fetchlist里属于 703 同一个host/domain/ip的URL最多为(generate.max.count-1)个 704 705 -1,表示不限制一个fetchlist里面有多少个属于同一个host/domain/ip的url 706 </description> 707 </property> 708 709 <property> 710 <name>generate.count.mode</name> 711 <value>host</value> 712 <description>Determines how the URLs are counted for generator.max.count. 713 Default value is 'host' but can be 'domain'. Note that we do not count 714 per IP in the new version of the Generator. 715 716 byHost/byDomain/byIP三种,表示按照何种方式计数, 717 以达到generate.max.count指定的数量 718 719 byHost,即根据host来对每个fetchlist中的url进行计数, 720 同一个segment里面属于同一个host的url不能超过generate.max.count, 721 如果超过,则需要将其他属于该host的url放到新的fetchlist中(如果还有新的fetchlist未放满的话) 722 </description> 723 </property> 724 725 <property> 726 <name>generate.update.crawldb</name> 727 <value>false</value> 728 <description>For highly-concurrent environments, where several 729 generate/fetch/update cycles may overlap, setting this to true ensures 730 that generate will create different fetchlists even without intervening 731 updatedb-s, at the cost of running an additional job to update CrawlDB. 732 If false, running generate twice without intervening 733 updatedb will generate identical fetchlists. 734 735 是否在generator完成之后,更新crawldb,主要是更新CrawlDatum的_ngt_字段 736 为此次执行generator的时间,防止下次generator 737 (由参数crawl.gen.delay指定的时间之内开始的另一个generator), 738 加入相同的url(备注:即使下次generator加入相同的url,也不会造成逻辑错误, 739 只是会浪费资源,重复爬取相同URL) 740 </description> 741 </property> 742 743 <property> 744 <name>generate.min.score</name> 745 <value>0</value> 746 <description>Select only entries with a score larger than 747 generate.min.score. 748 749 如果经过ScoreFilters之后,url的score值(反应网页重要性的值,类似于PageRank值) 750 仍然小于generate.min.score值,则该url不加入fetchlist中(即跳过该URL), 751 配置了改值,表明generator只会考虑将较重要的网页加入到fetchlist 752 753 0,表明所有url都不会因为score值在generator阶段被过滤掉 754 </description> 755 </property> 756 757 <property> 758 <name>generate.min.interval</name> 759 <value>-1</value> 760 <description>Select only entries with a retry interval lower than 761 generate.min.interval. A value of -1 disables this check. 762 设置该值表示generator只考虑需要频繁采集的url(即:CrawlDatum的fetchInterval较小), 763 对于不需要频繁采集的url,不加入到fetchlist 764 -1,表明禁用该检查 765 </description> 766 </property> 767 768 <!-- urlpartitioner properties --> 769 770 <property> 771 <name>partition.url.mode</name> 772 <value>byHost</value> 773 <description>Determines how to partition URLs. Default value is 'byHost', 774 also takes 'byDomain' or 'byIP'. 775 这个配置用来设定mapper操作以后,partition操作根据Host进行Hash。 776 结果是具有相同Host的URL会被打到同一个Reduce节点上面 777 778 779 在对生成的fetchlist做划分(partition)的时候,划分的方式是什么,有如下3中:byHost/byDomain/byIP 780 </description> 781 </property> 782 783 <property> 784 <name>crawl.gen.delay</name> 785 <value>604800000</value> 786 <description> 787 This value, expressed in milliseconds, defines how long we should keep the lock on records 788 in CrawlDb that were just selected for fetching. If these records are not updated 789 in the meantime, the lock is canceled, i.e. they become eligible for selecting. 790 Default value of this is 7 days (604800000 ms). 791 792 793 794 generator执行时,会使用“_ngt_”(stand for ”nutch generate time“) 795 作为key来来存储上一次对该url调用generator的时间,表明该url已经加入到了某个fetchlist, 796 并可能正在完成fetch->updated的过程当中,而可能这个过程时间较长,也或者过程中出错了, 797 而generator执行的过程当中。在考虑该url是否能加入此次的fetchlist时, 798 需要一种机制来判断是否能将该url加入还是继续等待之前的fetch->updatedb流程完成 799 (这样crawldb中该url的_ngt_会被更新成上次成功执行generator的时间。 800 crawl.gen.deley就是用来解决该问题的,如果”_ngt_”+crawl.gen.delay 小于 当前时间, 801 则该url可以加入到本次生成的fetchlist中;否则,不加入(跳过该url) 802 </description> 803 </property> 804 805 <!-- fetcher properties --> 806 807 <property> 808 <name>fetcher.server.delay</name> 809 <value>5.0</value> 810 <description>The number of seconds the fetcher will delay between 811 successive requests to the same server. Note that this might get 812 overriden by a Crawl-Delay from a robots.txt and is used ONLY if 813 fetcher.threads.per.queue is set to 1. 814 </description> 815 </property> 816 817 <property> 818 <name>fetcher.server.min.delay</name> 819 <value>0.0</value> 820 <description>The minimum number of seconds the fetcher will delay between 821 successive requests to the same server. This value is applicable ONLY 822 if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking 823 is turned off).</description> 824 </property> 825 826 <property> 827 <name>fetcher.max.crawl.delay</name> 828 <value>30</value> 829 <description> 830 If the Crawl-Delay in robots.txt is set to greater than this value (in 831 seconds) then the fetcher will skip this page, generating an error report. 832 If set to -1 the fetcher will never skip such pages and will wait the 833 amount of time retrieved from robots.txt Crawl-Delay, however long that 834 might be. 835 </description> 836 </property> 837 838 <property> 839 <name>fetcher.threads.fetch</name> 840 <value>10</value> 841 <description>The number of FetcherThreads the fetcher should use. 842 This is also determines the maximum number of requests that are 843 made at once (each FetcherThread handles one connection). The total 844 number of threads running in distributed mode will be the number of 845 fetcher threads * number of nodes as fetcher has one map task per node. 846 最大抓取线程数量 847 </description> 848 </property> 849 850 <property> 851 <name>fetcher.threads.per.queue</name> 852 <value>1</value> 853 <description>This number is the maximum number of threads that 854 should be allowed to access a queue at one time. Setting it to 855 a value > 1 will cause the Crawl-Delay value from robots.txt to 856 be ignored and the value of fetcher.server.min.delay to be used 857 as a delay between successive requests to the same server instead 858 of fetcher.server.delay. 859 </description> 860 </property> 861 862 <property> 863 <name>fetcher.queue.mode</name> 864 <value>byHost</value> 865 <description>Determines how to put URLs into queues. Default value is 'byHost', 866 also takes 'byDomain' or 'byIP'. 867 </description> 868 </property> 869 870 <property> 871 <name>fetcher.verbose</name> 872 <value>false</value> 873 <description>If true, fetcher will log more verbosely. 874 如果是true,打印出更多详细信息 875 </description> 876 </property> 877 878 <property> 879 <name>fetcher.parse</name> 880 <value>false</value> 881 <description>If true, fetcher will parse content. Default is false, which means 882 that a separate parsing step is required after fetching is finished. 883 能否在抓取的同时进行解释:可以,但不 建议这样做。 884 </description> 885 </property> 886 887 <property> 888 <name>fetcher.store.content</name> 889 <value>true</value> 890 <description>If true, fetcher will store content.</description> 891 </property> 892 893 <property> 894 <name>fetcher.timelimit.mins</name> 895 <value>-1</value> 896 <description>This is the number of minutes allocated to the fetching. 897 Once this value is reached, any remaining entry from the input URL list is skipped 898 and all active queues are emptied. The default value of -1 deactivates the time limit. 899 </description> 900 </property> 901 902 <property> 903 <name>fetcher.max.exceptions.per.queue</name> 904 <value>-1</value> 905 <description>The maximum number of protocol-level exceptions (e.g. timeouts) per 906 host (or IP) queue. Once this value is reached, any remaining entries from this 907 queue are purged, effectively stopping the fetching from this host/IP. The default 908 value of -1 deactivates this limit. 909 </description> 910 </property> 911 912 <property> 913 <name>fetcher.throughput.threshold.pages</name> 914 <value>-1</value> 915 <description>The threshold of minimum pages per second. If the fetcher downloads less 916 pages per second than the configured threshold, the fetcher stops, preventing slow queue's 917 from stalling the throughput. This threshold must be an integer. This can be useful when 918 fetcher.timelimit.mins is hard to determine. The default value of -1 disables this check. 919 </description> 920 </property> 921 922 <property> 923 <name>fetcher.throughput.threshold.retries</name> 924 <value>5</value> 925 <description>The number of times the fetcher.throughput.threshold is allowed to be exceeded. 926 This settings prevents accidental slow downs from immediately killing the fetcher thread. 927 </description> 928 </property> 929 930 <property> 931 <name>fetcher.throughput.threshold.check.after</name> 932 <value>5</value> 933 <description>The number of minutes after which the throughput check is enabled.</description> 934 </property> 935 936 <property> 937 <name>fetcher.threads.timeout.divisor</name> 938 <value>2</value> 939 <description>(EXPERT)The thread time-out divisor to use. By default threads have a time-out 940 value of mapred.task.timeout / 2. Increase this setting if the fetcher waits too 941 long before killing hanged threads. Be careful, a too high setting (+8) will most likely kill the 942 fetcher threads prematurely. 943 </description> 944 </property> 945 946 <property> 947 <name>fetcher.queue.depth.multiplier</name> 948 <value>50</value> 949 <description>(EXPERT)The fetcher buffers the incoming URLs into queues based on the [host|domain|IP] 950 (see param fetcher.queue.mode). The depth of the queue is the number of threads times the value of this parameter. 951 A large value requires more memory but can improve the performance of the fetch when the order of the URLS in the fetch list 952 is not optimal. 953 </description> 954 </property> 955 956 <property> 957 <name>fetcher.follow.outlinks.depth</name> 958 <value>-1</value> 959 <description>(EXPERT)When fetcher.parse is true and this value is greater than 0 the fetcher will extract outlinks 960 and follow until the desired depth is reached. A value of 1 means all generated pages are fetched and their first degree 961 outlinks are fetched and parsed too. Be careful, this feature is in itself agnostic of the state of the CrawlDB and does not 962 know about already fetched pages. A setting larger than 2 will most likely fetch home pages twice in the same fetch cycle. 963 It is highly recommended to set db.ignore.external.links to true to restrict the outlink follower to URL's within the same 964 domain. When disabled (false) the feature is likely to follow duplicates even when depth=1. 965 A value of -1 of 0 disables this feature. 966 </description> 967 </property> 968 969 <property> 970 <name>fetcher.follow.outlinks.num.links</name> 971 <value>4</value> 972 <description>(EXPERT)The number of outlinks to follow when fetcher.follow.outlinks.depth is enabled. Be careful, this can multiply 973 the total number of pages to fetch. This works with fetcher.follow.outlinks.depth.divisor, by default settings the followed outlinks 974 at depth 1 is 8, not 4. 975 </description> 976 </property> 977 978 <property> 979 <name>fetcher.follow.outlinks.depth.divisor</name> 980 <value>2</value> 981 <description>(EXPERT)The divisor of fetcher.follow.outlinks.num.links per fetcher.follow.outlinks.depth. This decreases the number 982 of outlinks to follow by increasing depth. The formula used is: outlinks = floor(divisor / depth * num.links). This prevents 983 exponential growth of the fetch list. 984 </description> 985 </property> 986 987 <property> 988 <name>fetcher.follow.outlinks.ignore.external</name> 989 <value>true</value> 990 <description>Whether to ignore or follow external links. Set db.ignore.external.links to false and this to true to store outlinks 991 in the output but not follow them. If db.ignore.external.links is true this directive is ignored. 992 </description> 993 </property> 994 995 <property> 996 <name>fetcher.bandwidth.target</name> 997 <value>-1</value> 998 <description>Target bandwidth in kilobits per sec for each mapper instance. This is used to adjust the number of 999 fetching threads automatically (up to fetcher.maxNum.threads). A value of -1 deactivates the functionality, in which case 1000 the number of fetching threads is fixed (see fetcher.threads.fetch).</description> 1001 </property> 1002 1003 <property> 1004 <name>fetcher.maxNum.threads</name> 1005 <value>25</value> 1006 <description>Max number of fetch threads allowed when using fetcher.bandwidth.target. Defaults to fetcher.threads.fetch if unspecified or 1007 set to a value lower than it. </description> 1008 </property> 1009 1010 <property> 1011 <name>fetcher.bandwidth.target.check.everyNSecs</name> 1012 <value>30</value> 1013 <description>(EXPERT) Value in seconds which determines how frequently we should reassess the optimal number of fetch threads when using 1014 fetcher.bandwidth.target. Defaults to 30 and must be at least 1.</description> 1015 </property> 1016 1017 <!-- moreindexingfilter plugin properties --> 1018 1019 <property> 1020 <name>moreIndexingFilter.indexMimeTypeParts</name> 1021 <value>true</value> 1022 <description>Determines whether the index-more plugin will split the mime-type 1023 in sub parts, this requires the type field to be multi valued. Set to true for backward 1024 compatibility. False will not split the mime-type. 1025 </description> 1026 </property> 1027 1028 <property> 1029 <name>moreIndexingFilter.mapMimeTypes</name> 1030 <value>false</value> 1031 <description>Determines whether MIME-type mapping is enabled. It takes a 1032 plain text file with mapped MIME-types. With it the user can map both 1033 application/xhtml+xml and text/html to the same target MIME-type so it 1034 can be treated equally in an index. See conf/contenttype-mapping.txt. 1035 </description> 1036 </property> 1037 1038 <!-- AnchorIndexing filter plugin properties --> 1039 1040 <property> 1041 <name>anchorIndexingFilter.deduplicate</name> 1042 <value>false</value> 1043 <description>With this enabled the indexer will case-insensitive deduplicate anchors 1044 before indexing. This prevents possible hundreds or thousands of identical anchors for 1045 a given page to be indexed but will affect the search scoring (i.e. tf=1.0f). 1046 </description> 1047 </property> 1048 1049 <!-- indexingfilter plugin properties --> 1050 1051 <property> 1052 <name>indexingfilter.order</name> 1053 <value></value> 1054 <description>The order by which index filters are applied. 1055 If empty, all available index filters (as dictated by properties 1056 plugin-includes and plugin-excludes above) are loaded and applied in system 1057 defined order. If not empty, only named filters are loaded and applied 1058 in given order. For example, if this property has value: 1059 org.apache.nutch.indexer.basic.BasicIndexingFilter org.apache.nutch.indexer.more.MoreIndexingFilter 1060 then BasicIndexingFilter is applied first, and MoreIndexingFilter second. 1061 1062 Filter ordering might have impact on result if one filter depends on output of 1063 another filter. 1064 </description> 1065 </property> 1066 1067 <property> 1068 <name>indexer.score.power</name> 1069 <value>0.5</value> 1070 <description>Determines the power of link analyis scores. Each 1071 pages's boost is set to <i>score<sup>scorePower</sup></i> where 1072 <i>score</i> is its link analysis score and <i>scorePower</i> is the 1073 value of this parameter. This is compiled into indexes, so, when 1074 this is changed, pages must be re-indexed for it to take 1075 effect.</description> 1076 </property> 1077 1078 <property> 1079 <name>indexer.max.title.length</name> 1080 <value>100</value> 1081 <description>The maximum number of characters of a title that are indexed. A value of -1 disables this check. 1082 </description> 1083 </property> 1084 1085 <property> 1086 <name>indexer.max.content.length</name> 1087 <value>-1</value> 1088 <description>The maximum number of characters of a content that are indexed. 1089 Content beyond the limit is truncated. A value of -1 disables this check. 1090 </description> 1091 </property> 1092 1093 <property> 1094 <name>indexer.add.domain</name> 1095 <value>false</value> 1096 <description>Whether to add the domain field to a NutchDocument.</description> 1097 </property> 1098 1099 <property> 1100 <name>indexer.skip.notmodified</name> 1101 <value>false</value> 1102 <description>Whether the indexer will skip records with a db_notmodified status. 1103 </description> 1104 </property> 1105 1106 <!-- URL normalizer properties --> 1107 1108 <property> 1109 <name>urlnormalizer.order</name> 1110 <value>org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer</value> 1111 <description>Order in which normalizers will run. If any of these isn't 1112 activated it will be silently skipped. If other normalizers not on the 1113 list are activated, they will run in random order after the ones 1114 specified here are run. 1115 </description> 1116 </property> 1117 1118 <property> 1119 <name>urlnormalizer.regex.file</name> 1120 <value>regex-normalize.xml</value> 1121 <description>Name of the config file used by the RegexUrlNormalizer class. 1122 </description> 1123 </property> 1124 1125 <property> 1126 <name>urlnormalizer.loop.count</name> 1127 <value>1</value> 1128 <description>Optionally loop through normalizers several times, to make 1129 sure that all transformations have been performed. 1130 </description> 1131 </property> 1132 1133 <!-- mime properties --> 1134 1135 <!-- 1136 <property> 1137 <name>mime.types.file</name> 1138 <value>tika-mimetypes.xml</value> 1139 <description>Name of file in CLASSPATH containing filename extension and 1140 magic sequence to mime types mapping information. Overrides the default Tika config 1141 if specified. 1142 </description> 1143 </property> 1144 --> 1145 1146 <property> 1147 <name>mime.type.magic</name> 1148 <value>true</value> 1149 <description>Defines if the mime content type detector uses magic resolution. 1150 </description> 1151 </property> 1152 1153 <!-- plugin properties --> 1154 1155 <property> 1156 <name>plugin.folders</name> 1157 <value>plugins</value> 1158 <description>Directories where nutch plugins are located. Each 1159 element may be a relative or absolute path. If absolute, it is used 1160 as is. If relative, it is searched for on the classpath. 1161 这个属性是用来指定plugin的目录,在eclipse中执行时需要改为:./src/plugin 1162 但是在分布式集群运行打成的JOB包时,需要改为plugins 1163 </description> 1164 </property> 1165 1166 <property> 1167 <name>plugin.auto-activation</name> 1168 <value>true</value> 1169 <description>Defines if some plugins that are not activated regarding 1170 the plugin.includes and plugin.excludes properties must be automaticaly 1171 activated if they are needed by some actived plugins. 1172 </description> 1173 </property> 1174 1175 <property> 1176 <name>plugin.includes</name> 1177 <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value> 1178 <description>Regular expression naming plugin directory names to 1179 include. Any plugin not matching this expression is excluded. 1180 In any case you need at least include the nutch-extensionpoints plugin. By 1181 default Nutch includes crawling just HTML and plain text via HTTP, 1182 and basic indexing and search plugins. In order to use HTTPS please enable 1183 protocol-httpclient, but be aware of possible intermittent problems with the 1184 underlying commons-httpclient library. 1185 配置插件功能的配置项,plugin.includes表示需要加载的插件列表 1186 </description> 1187 </property> 1188 1189 <property> 1190 <name>plugin.excludes</name> 1191 <value></value> 1192 <description>Regular expression naming plugin directory names to exclude. 1193 </description> 1194 </property> 1195 1196 <property> 1197 <name>urlmeta.tags</name> 1198 <value></value> 1199 <description> 1200 To be used in conjunction with features introduced in NUTCH-655, which allows 1201 for custom metatags to be injected alongside your crawl URLs. Specifying those 1202 custom tags here will allow for their propagation into a pages outlinks, as 1203 well as allow for them to be included as part of an index. 1204 Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with 1205 white-space at their boundaries, if you are using anything earlier than Hadoop-0.21. 1206 </description> 1207 </property> 1208 1209 <!-- parser properties --> 1210 1211 <property> 1212 <name>parse.plugin.file</name> 1213 <value>parse-plugins.xml</value> 1214 <description>The name of the file that defines the associations between 1215 content-types and parsers.</description> 1216 </property> 1217 1218 <property> 1219 <name>parser.character.encoding.default</name> 1220 <value>windows-1252</value> 1221 <description>The character encoding to fall back to when no other information 1222 is available 1223 解析文档的时候使用的默认编码windows-1252,如果文档中解析不到编码,则使用默认编码 1224 </description> 1225 </property> 1226 1227 <property> 1228 <name>encodingdetector.charset.min.confidence</name> 1229 <value>-1</value> 1230 <description>A integer between 0-100 indicating minimum confidence value 1231 for charset auto-detection. Any negative value disables auto-detection. 1232 </description> 1233 </property> 1234 1235 <property> 1236 <name>parser.caching.forbidden.policy</name> 1237 <value>content</value> 1238 <description>If a site (or a page) requests through its robot metatags 1239 that it should not be shown as cached content, apply this policy. Currently 1240 three keywords are recognized: "none" ignores any "noarchive" directives. 1241 "content" doesn't show the content, but shows summaries (snippets). 1242 "all" doesn't show either content or summaries.</description> 1243 </property> 1244 1245 <property> 1246 <name>parser.html.impl</name> 1247 <value>neko</value> 1248 <description>HTML Parser implementation. Currently the following keywords 1249 are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup. 1250 制定解析HTML文档的时候使用的解析器,NEKO功能比较强大, 1251 后面会有专门的文章介绍Neko从HTML到TEXT以及html片断的解析等功能做介绍 1252 </description> 1253 </property> 1254 1255 <property> 1256 <name>parser.html.form.use_action</name> 1257 <value>false</value> 1258 <description>If true, HTML parser will collect URLs from form action 1259 attributes. This may lead to undesirable behavior (submitting empty 1260 forms during next fetch cycle). If false, form action attribute will 1261 be ignored.</description> 1262 </property> 1263 1264 <property> 1265 <name>parser.html.outlinks.ignore_tags</name> 1266 <value></value> 1267 <description>Comma separated list of HTML tags, from which outlinks 1268 shouldn't be extracted. Nutch takes links from: a, area, form, frame, 1269 iframe, script, link, img. If you add any of those tags here, it 1270 won't be taken. Default is empty list. Probably reasonable value 1271 for most people would be "img,script,link".</description> 1272 </property> 1273 1274 <property> 1275 <name>htmlparsefilter.order</name> 1276 <value></value> 1277 <description>The order by which HTMLParse filters are applied. 1278 If empty, all available HTMLParse filters (as dictated by properties 1279 plugin-includes and plugin-excludes above) are loaded and applied in system 1280 defined order. If not empty, only named filters are loaded and applied 1281 in given order. 1282 HTMLParse filter ordering MAY have an impact 1283 on end result, as some filters could rely on the metadata generated by a previous filter. 1284 </description> 1285 </property> 1286 1287 <property> 1288 <name>parser.timeout</name> 1289 <value>30</value> 1290 <description>Timeout in seconds for the parsing of a document, otherwise treats it as an exception and 1291 moves on the the following documents. This parameter is applied to any Parser implementation. 1292 Set to -1 to deactivate, bearing in mind that this could cause 1293 the parsing to crash because of a very long or corrupted document. 1294 </description> 1295 </property> 1296 1297 <property> 1298 <name>parse.filter.urls</name> 1299 <value>true</value> 1300 <description>Whether the parser will filter URLs (with the configured URL filters).</description> 1301 </property> 1302 1303 <property> 1304 <name>parse.normalize.urls</name> 1305 <value>true</value> 1306 <description>Whether the parser will normalize URLs (with the configured URL normalizers).</description> 1307 </property> 1308 1309 <property> 1310 <name>parser.skip.truncated</name> 1311 <value>true</value> 1312 <description>Boolean value for whether we should skip parsing for truncated documents. By default this 1313 property is activated due to extremely high levels of CPU which parsing can sometimes take. 1314 </description> 1315 </property> 1316 1317 <!-- 1318 <property> 1319 <name>tika.htmlmapper.classname</name> 1320 <value>org.apache.tika.parser.html.IdentityHtmlMapper</value> 1321 <description>Classname of Tika HTMLMapper to use. Influences the elements included in the DOM and hence 1322 the behaviour of the HTMLParseFilters. 1323 </description> 1324 </property> 1325 --> 1326 1327 <property> 1328 <name>tika.uppercase.element.names</name> 1329 <value>true</value> 1330 <description>Determines whether TikaParser should uppercase the element name while generating the DOM 1331 for a page, as done by Neko (used per default by parse-html)(see NUTCH-1592). 1332 </description> 1333 </property> 1334 1335 1336 <!-- urlfilter plugin properties --> 1337 1338 <property> 1339 <name>urlfilter.domain.file</name> 1340 <value>domain-urlfilter.txt</value> 1341 <description>Name of file on CLASSPATH containing either top level domains or 1342 hostnames used by urlfilter-domain (DomainURLFilter) plugin.</description> 1343 </property> 1344 1345 <property> 1346 <name>urlfilter.regex.file</name> 1347 <value>regex-urlfilter.txt</value> 1348 <description>Name of file on CLASSPATH containing regular expressions 1349 used by urlfilter-regex (RegexURLFilter) plugin.</description> 1350 </property> 1351 1352 <property> 1353 <name>urlfilter.automaton.file</name> 1354 <value>automaton-urlfilter.txt</value> 1355 <description>Name of file on CLASSPATH containing regular expressions 1356 used by urlfilter-automaton (AutomatonURLFilter) plugin.</description> 1357 </property> 1358 1359 <property> 1360 <name>urlfilter.prefix.file</name> 1361 <value>prefix-urlfilter.txt</value> 1362 <description>Name of file on CLASSPATH containing url prefixes 1363 used by urlfilter-prefix (PrefixURLFilter) plugin.</description> 1364 </property> 1365 1366 <property> 1367 <name>urlfilter.suffix.file</name> 1368 <value>suffix-urlfilter.txt</value> 1369 <description>Name of file on CLASSPATH containing url suffixes 1370 used by urlfilter-suffix (SuffixURLFilter) plugin.</description> 1371 </property> 1372 1373 <property> 1374 <name>urlfilter.order</name> 1375 <value></value> 1376 <description>The order by which url filters are applied. 1377 If empty, all available url filters (as dictated by properties 1378 plugin-includes and plugin-excludes above) are loaded and applied in system 1379 defined order. If not empty, only named filters are loaded and applied 1380 in given order. For example, if this property has value: 1381 org.apache.nutch.urlfilter.regex.RegexURLFilter org.apache.nutch.urlfilter.prefix.PrefixURLFilter 1382 then RegexURLFilter is applied first, and PrefixURLFilter second. 1383 Since all filters are AND'ed, filter ordering does not have impact 1384 on end result, but it may have performance implication, depending 1385 on relative expensiveness of filters. 1386 </description> 1387 </property> 1388 1389 <!-- scoring filters properties --> 1390 1391 <property> 1392 <name>scoring.filter.order</name> 1393 <value></value> 1394 <description>The order in which scoring filters are applied. This 1395 may be left empty (in which case all available scoring filters will 1396 be applied in system defined order), or a space separated list of 1397 implementation classes. 1398 </description> 1399 </property> 1400 1401 <!-- scoring-depth properties 1402 Add 'scoring-depth' to the list of active plugins 1403 in the parameter 'plugin.includes' in order to use it. 1404 --> 1405 1406 <property> 1407 <name>scoring.depth.max</name> 1408 <value>1000</value> 1409 <description>Max depth value from seed allowed by default. 1410 Can be overriden on a per-seed basis by specifying "_maxdepth_=VALUE" 1411 as a seed metadata. This plugin adds a "_depth_" metadatum to the pages 1412 to track the distance from the seed it was found from. 1413 The depth is used to prioritise URLs in the generation step so that 1414 shallower pages are fetched first. 1415 </description> 1416 </property> 1417 1418 <!-- language-identifier plugin properties --> 1419 1420 <property> 1421 <name>lang.analyze.max.length</name> 1422 <value>2048</value> 1423 <description> The maximum bytes of data to uses to indentify 1424 the language (0 means full content analysis). 1425 The larger is this value, the better is the analysis, but the 1426 slowest it is. 1427 和语言有关系,分词的时候会用到, 1428 </description> 1429 </property> 1430 1431 <property> 1432 <name>lang.extraction.policy</name> 1433 <value>detect,identify</value> 1434 <description>This determines when the plugin uses detection and 1435 statistical identification mechanisms. The order in which the 1436 detect and identify are written will determine the extraction 1437 policy. Default case (detect,identify) means the plugin will 1438 first try to extract language info from page headers and metadata, 1439 if this is not successful it will try using tika language 1440 identification. Possible values are: 1441 detect 1442 identify 1443 detect,identify 1444 identify,detect 1445 </description> 1446 </property> 1447 1448 <property> 1449 <name>lang.identification.only.certain</name> 1450 <value>false</value> 1451 <description>If set to true with lang.extraction.policy containing identify, 1452 the language code returned by Tika will be assigned to the document ONLY 1453 if it is deemed certain by Tika. 1454 </description> 1455 </property> 1456 1457 <!-- index-static plugin properties --> 1458 1459 <property> 1460 <name>index.static</name> 1461 <value></value> 1462 <description> 1463 Used by plugin index-static to adds fields with static data at indexing time. 1464 You can specify a comma-separated list of fieldname:fieldcontent per Nutch job. 1465 Each fieldcontent can have multiple values separated by space, e.g., 1466 field1:value1.1 value1.2 value1.3,field2:value2.1 value2.2 ... 1467 It can be useful when collections can't be created by URL patterns, 1468 like in subcollection, but on a job-basis. 1469 </description> 1470 </property> 1471 1472 <!-- index-metadata plugin properties --> 1473 1474 <property> 1475 <name>index.parse.md</name> 1476 <value>metatag.description,metatag.keywords</value> 1477 <description> 1478 Comma-separated list of keys to be taken from the parse metadata to generate fields. 1479 Can be used e.g. for 'description' or 'keywords' provided that these values are generated 1480 by a parser (see parse-metatags plugin) 1481 </description> 1482 </property> 1483 1484 <property> 1485 <name>index.content.md</name> 1486 <value></value> 1487 <description> 1488 Comma-separated list of keys to be taken from the content metadata to generate fields. 1489 </description> 1490 </property> 1491 1492 <property> 1493 <name>index.db.md</name> 1494 <value></value> 1495 <description> 1496 Comma-separated list of keys to be taken from the crawldb metadata to generate fields. 1497 Can be used to index values propagated from the seeds with the plugin urlmeta 1498 </description> 1499 </property> 1500 1501 <!-- index-geoip plugin properties --> 1502 <property> 1503 <name>index.geoip.usage</name> 1504 <value>insightsService</value> 1505 <description> 1506 A string representing the information source to be used for GeoIP information 1507 association. Either enter 'cityDatabase', 'connectionTypeDatabase', 1508 'domainDatabase', 'ispDatabase' or 'insightsService'. If you wish to use any one of the 1509 Database options, you should make one of GeoIP2-City.mmdb, GeoIP2-Connection-Type.mmdb, 1510 GeoIP2-Domain.mmdb or GeoIP2-ISP.mmdb files respectively available on the classpath and 1511 available at runtime. 1512 </description> 1513 </property> 1514 1515 <property> 1516 <name>index.geoip.userid</name> 1517 <value></value> 1518 <description> 1519 The userId associated with the GeoIP2 Precision Services account. 1520 </description> 1521 </property> 1522 1523 <property> 1524 <name>index.geoip.licensekey</name> 1525 <value></value> 1526 <description> 1527 The license key associated with the GeoIP2 Precision Services account. 1528 </description> 1529 </property> 1530 1531 <!-- parse-metatags plugin properties --> 1532 <property> 1533 <name>metatags.names</name> 1534 <value>description,keywords</value> 1535 <description> Names of the metatags to extract, separated by ','. 1536 Use '*' to extract all metatags. Prefixes the names with 'metatag.' 1537 in the parse-metadata. For instance to index description and keywords, 1538 you need to activate the plugin index-metadata and set the value of the 1539 parameter 'index.parse.md' to 'metatag.description,metatag.keywords'. 1540 </description> 1541 </property> 1542 1543 <!-- Temporary Hadoop 0.17.x workaround. --> 1544 1545 <property> 1546 <name>hadoop.job.history.user.location</name> 1547 <value>${hadoop.log.dir}/history/user</value> 1548 <description>Hadoop 0.17.x comes with a default setting to create 1549 user logs inside the output path of the job. This breaks some 1550 Hadoop classes, which expect the output to contain only 1551 part-XXXXX files. This setting changes the output to a 1552 subdirectory of the regular log directory. 1553 </description> 1554 </property> 1555 1556 <!-- linkrank scoring properties --> 1557 1558 <property> 1559 <name>link.ignore.internal.host</name> 1560 <value>true</value> 1561 <description>Ignore outlinks to the same hostname.</description> 1562 </property> 1563 1564 <property> 1565 <name>link.ignore.internal.domain</name> 1566 <value>true</value> 1567 <description>Ignore outlinks to the same domain.</description> 1568 </property> 1569 1570 <property> 1571 <name>link.ignore.limit.page</name> 1572 <value>true</value> 1573 <description>Limit to only a single outlink to the same page.</description> 1574 </property> 1575 1576 <property> 1577 <name>link.ignore.limit.domain</name> 1578 <value>true</value> 1579 <description>Limit to only a single outlink to the same domain.</description> 1580 </property> 1581 1582 <property> 1583 <name>link.analyze.num.iterations</name> 1584 <value>10</value> 1585 <description>The number of LinkRank iterations to run.</description> 1586 </property> 1587 1588 <property> 1589 <name>link.analyze.initial.score</name> 1590 <value>1.0f</value> 1591 <description>The initial score.</description> 1592 </property> 1593 1594 <property> 1595 <name>link.analyze.damping.factor</name> 1596 <value>0.85f</value> 1597 <description>The damping factor.</description> 1598 </property> 1599 1600 <property> 1601 <name>link.delete.gone</name> 1602 <value>false</value> 1603 <description>Whether to delete gone pages from the web graph.</description> 1604 </property> 1605 1606 <property> 1607 <name>link.loops.depth</name> 1608 <value>2</value> 1609 <description>The depth for the loops algorithm.</description> 1610 </property> 1611 1612 <property> 1613 <name>link.score.updater.clear.score</name> 1614 <value>0.0f</value> 1615 <description>The default score for URL's that are not in the web graph.</description> 1616 </property> 1617 1618 <property> 1619 <name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name> 1620 <value>false</value> 1621 <description>Hadoop >= 0.21 generates SUCCESS files in the output which can crash 1622 the readers. This should not be an issue once Nutch is ported to the new MapReduce API 1623 but for now this parameter should prevent such cases. 1624 </description> 1625 </property> 1626 1627 <!-- solr index properties --> 1628 <property> 1629 <name>solr.server.url</name> 1630 <value>http://127.0.0.1:8983/solr/</value> 1631 <description> 1632 Defines the Solr URL into which data should be indexed using the 1633 indexer-solr plugin. 1634 </description> 1635 </property> 1636 1637 <property> 1638 <name>solr.mapping.file</name> 1639 <value>solrindex-mapping.xml</value> 1640 <description> 1641 Defines the name of the file that will be used in the mapping of internal 1642 nutch field names to solr index fields as specified in the target Solr schema. 1643 </description> 1644 </property> 1645 1646 <property> 1647 <name>solr.commit.size</name> 1648 <value>250</value> 1649 <description> 1650 Defines the number of documents to send to Solr in a single update batch. 1651 Decrease when handling very large documents to prevent Nutch from running 1652 out of memory. NOTE: It does not explicitly trigger a server side commit. 1653 </description> 1654 </property> 1655 1656 <property> 1657 <name>solr.commit.index</name> 1658 <value>true</value> 1659 <description> 1660 When closing the indexer, trigger a commit to the Solr server. 1661 </description> 1662 </property> 1663 1664 <property> 1665 <name>solr.auth</name> 1666 <value>false</value> 1667 <description> 1668 Whether to enable HTTP basic authentication for communicating with Solr. 1669 Use the solr.auth.username and solr.auth.password properties to configure 1670 your credentials. 1671 </description> 1672 </property> 1673 1674 <!-- Elasticsearch properties --> 1675 1676 <property> 1677 <name>elastic.host</name> 1678 <value></value> 1679 <description>The hostname to send documents to using TransportClient. Either host 1680 and port must be defined or cluster.</description> 1681 </property> 1682 1683 <property> 1684 <name>elastic.port</name> 1685 <value>9300</value>The port to connect to using TransportClient.<description> 1686 </description> 1687 </property> 1688 1689 <property> 1690 <name>elastic.cluster</name> 1691 <value></value> 1692 <description>The cluster name to discover. Either host and potr must be defined 1693 or cluster.</description> 1694 </property> 1695 1696 <property> 1697 <name>elastic.index</name> 1698 <value>nutch</value> 1699 <description>Default index to send documents to.</description> 1700 </property> 1701 1702 <property> 1703 <name>elastic.max.bulk.docs</name> 1704 <value>250</value> 1705 <description>Maximum size of the bulk in number of documents.</description> 1706 </property> 1707 1708 <property> 1709 <name>elastic.max.bulk.size</name> 1710 <value>2500500</value> 1711 <description>Maximum size of the bulk in bytes.</description> 1712 </property> 1713 1714 <!-- subcollection properties --> 1715 1716 <property> 1717 <name>subcollection.default.fieldname</name> 1718 <value>subcollection</value> 1719 <description> 1720 The default field name for the subcollections. 1721 </description> 1722 </property> 1723 1724 <!-- Headings plugin properties --> 1725 1726 <property> 1727 <name>headings</name> 1728 <value>h1,h2</value> 1729 <description>Comma separated list of headings to retrieve from the document</description> 1730 </property> 1731 1732 <property> 1733 <name>headings.multivalued</name> 1734 <value>false</value> 1735 <description>Whether to support multivalued headings.</description> 1736 </property> 1737 1738 <!-- mimetype-filter plugin properties --> 1739 1740 <property> 1741 <name>mimetype.filter.file</name> 1742 <value>mimetype-filter.txt</value> 1743 <description> 1744 The configuration file for the mimetype-filter plugin. This file contains 1745 the rules used to allow or deny the indexing of certain documents. 1746 </description> 1747 </property> 1748 1749 </configuration>
regex-urlfilter解释.txt
1 # Licensed to the Apache Software Foundation (ASF) under one or more 2 # contributor license agreements. See the NOTICE file distributed with 3 # this work for additional information regarding copyright ownership. 4 # The ASF licenses this file to You under the Apache License, Version 2.0 5 # (the "License"); you may not use this file except in compliance with 6 # the License. You may obtain a copy of the License at 7 # 8 # http://www.apache.org/licenses/LICENSE-2.0 9 # 10 # Unless required by applicable law or agreed to in writing, software 11 # distributed under the License is distributed on an "AS IS" BASIS, 12 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 13 # See the License for the specific language governing permissions and 14 # limitations under the License. 15 16 17 # The default url filter. 18 # Better for whole-internet crawling. 19 20 # Each non-comment, non-blank line contains a regular expression 21 # prefixed by '+' or '-'. The first matching pattern in the file 22 # -表示不包含,+表示包含 23 # determines whether a URL is included or ignored. If no pattern 24 # matches, the URL is ignored. 25 26 # skip file: ftp: and mailto: urls 27 -^(file|ftp|mailto): 28 29 # skip image and other suffixes we can't yet parse 30 # for a more extensive coverage use the urlfilter-suffix plugin 31 -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ 32 33 # skip URLs containing certain characters as probable queries, etc. 34 #表示过滤包含指定字符的URL,这样是抓取不到包含?*!@=等字符的URL的,建议改为: -[~] 35 #-[?*!@=] 36 -[~] 37 # skip URLs with slash-delimited segment that repeats 3+ times, to break loops 38 -.*(/[^/]+)/[^/]+\1/[^/]+\1/ 39 40 # accept anything else 41 #+. 42 # 过滤正则表达式,([a-z0-9]*\.)*表示任意数字和字母,[\s\S]*表示任意字符 43 +^http://([a-z0-9]*\.)*bbs.superwu.cn/[\s\S]* 44 抓取discuz论坛中的数据 45 +^http://bbs.superwu.cn/forum.php$ 46 +^http://bbs.superwu.cn/forum.php?mod=forumdisplay&fid=/d+$ 47 +^http://bbs.superwu.cn/forum.php?mod=forumdisplay&fid=/d+&page=/d+$ 48 +^http://bbs.superwu.cn/forum.php?mod=viewthread&tid=/d+&extra=page%3D/d+$ 49 +^http://bbs.superwu.cn/forum.php?mod=viewthread&tid=/d+&extra=page%3D/d+&page=/d+$
作者:SummerChill 出处:http://www.cnblogs.com/DreamDrive/ 本博客为自己总结亦或在网上发现的技术博文的转载。 如果文中有什么错误,欢迎指出。以免更多的人被误导。 |