


  • Linux,Ubuntu20.04LST
  • IDEA
  • Nutch1.18
  • Solr8.11

转载请声明出处!!!By 鸭梨的药丸哥




  • IndexWriter – Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).
  • IndexingFilter – Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).
  • Parser – Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.
  • HtmlParseFilter – Permits one to add additional metadata to HTML parses (from javadoc).
  • Protocol – Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.
  • URLFilter – URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.
  • URLNormalizer – Interface used to convert URLs to normal form and optionally perform substitutions.
  • ScoringFilter – A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.
  • SegmentMergeFilter – Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.



/src #源码目录
build.xml   #ant怎样编译这个plugin配置文件(编译出jar包放哪啊等配置信息)
ivy.xml     #plugin的ivy配置信息(依赖管理,跟maven的pom.xml一样的东东)
plugin.xml  #nutch描述这个plugin的信息(如,这个插件实现了哪些扩展点,插件的扩展点实现类名字等)

3. build.xml


<project name="parse-html" default="jar-core">

  <import file="../build-plugin.xml"/>

  <!-- Build compilation dependencies -->
  <target name="deps-jar">
    <ant target="jar" inheritall="false" dir="../lib-nekohtml"/>

  <!-- Add compilation dependencies to classpath -->
  <path id="plugin.deps">
    <fileset dir="${nutch.root}/build">
      <include name="**/lib-nekohtml/*.jar" />

  <!-- Deploy Unit test dependencies -->
  <target name="deps-test">
    <ant target="deploy" inheritall="false" dir="../lib-nekohtml"/>
    <ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>


4. ivy.xml


<ivy-module version="1.0">
  <info organisation="org.apache.nutch" module="${}">
    <license name="Apache 2.0"/>
    <ivyauthor name="Apache Nutch Team" url=""/>
        Apache Nutch

    <include file="../../../ivy/ivy-configurations.xml"/>

    <!--get the artifact from our module name-->
    <artifact conf="master"/>
   <dependency org="org.ccil.cowan.tagsoup" name="tagsoup" rev="1.2.1"/>


5. plugin.xml

   name="Html Parse Plug-in"
      <library name="parse-html.jar">
         <export name="*"/>
      <library name="tagsoup-1.2.1.jar"/>
      <import plugin="nutch-extensionpoints"/>
      <import plugin="lib-nekohtml"/>
   <extension id="org.apache.nutch.parse.html"
      <implementation id="org.apache.nutch.parse.html.HtmlParser"
        <parameter name="contentType" value="text/html|application/xhtml+xml"/>
        <parameter name="pathSuffix" value=""/>



6. 解读parse-html插件



public class HtmlParser implements Parser


  • public ParseResult getParse(Content c) //解析数据的
  • public void setConf(Configuration configuration) //用于nutch-setting中的配置
  • public Configuration getConf()
setConf(Configuration conf)

nutch-setting.xml读取信息,因为nutch会在调用插件通过setConf(Configuration conf)往插件传递配置信息。

public void setConf(Configuration conf) {
    this.conf = conf;
    //HtmlParseFilters使用数组HtmlParseFilter[] htmlParseFilters装插件
    this.htmlParseFilters = new HtmlParseFilters(getConf());
    this.parserImpl = getConf().get("parser.html.impl", "neko");
    this.defaultCharEncoding = getConf().get(
        "parser.character.encoding.default", "windows-1252");
    this.utils = new DOMContentUtils(conf);
    this.cachingPolicy = getConf().get("parser.caching.forbidden.policy",


  • 从前面的build.xml引入了lib-nekohtml插件,这个就是NekoHTML
  • ivy.xml引入了tagsoup的ivy依赖,这个就是TagSoup,两者都能解析html页面
  <description>HTML Parser implementation. Currently the following keywords
  are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
parse(InputSource input)


private DocumentFragment parse(InputSource input) throws Exception {
    if ("tagsoup".equalsIgnoreCase(parserImpl))
    	return parseTagSoup(input);
    	return parseNeko(input);
getParse(Content content)

注意:在ParseResult filteredParse = this.htmlParseFilters.filter(content,parseResult, metaTags, root);会运行继承HtmlParseFilter扩展点的插件,所以我们需要解析html中的格外的标签中的数据时,可以通过实现HtmlParseFilter扩展点来自定义一些html中的标签数据发解析。

public ParseResult getParse(Content content) {
    //HTML meta标签
    HTMLMetaTags metaTags = new HTMLMetaTags();
    URL base;
    try {
      base = new URL(content.getBaseUrl());
    } catch (MalformedURLException e) {
      return new ParseStatus(e)
          .getEmptyParseResult(content.getUrl(), getConf());
    String text = "";
    String title = "";
    Outlink[] outlinks = new Outlink[0];
    Metadata metadata = new Metadata();
    // parse the content
    DocumentFragment root;
    try {
      byte[] contentInOctets = content.getContent();
      InputSource input = new InputSource(new ByteArrayInputStream(
      EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(content, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(content, defaultCharEncoding);

      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

      if (LOG.isTraceEnabled()) {
      root = parse(input);
    } catch (IOException e) {
      return new ParseStatus(e)
          .getEmptyParseResult(content.getUrl(), getConf());
    } catch (DOMException e) {
      return new ParseStatus(e)
          .getEmptyParseResult(content.getUrl(), getConf());
    } catch (SAXException e) {
      return new ParseStatus(e)
          .getEmptyParseResult(content.getUrl(), getConf());
    } catch (Exception e) {
      LOG.error("Error: ", e);
      return new ParseStatus(e)
          .getEmptyParseResult(content.getUrl(), getConf());
    // get meta directives
    HTMLMetaProcessor.getMetaTags(metaTags, root, base);
    // populate Nutch metadata with HTML meta directives

    if (LOG.isTraceEnabled()) {
      LOG.trace("Meta tags for " + base + ": " + metaTags.toString());
    // check meta directives
    if (!metaTags.getNoIndex()) { // okay to index
      StringBuffer sb = new StringBuffer();
      if (LOG.isTraceEnabled()) {
        LOG.trace("Getting text...");
      utils.getText(sb, root); // extract text
      text = sb.toString();
      if (LOG.isTraceEnabled()) {
        LOG.trace("Getting title...");
      utils.getTitle(sb, root); // extract title
      title = sb.toString().trim();

    if (!metaTags.getNoFollow()) { // okay to follow links
      ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
      URL baseTag = base;
      String baseTagHref = utils.getBase(root);
      if (baseTagHref != null) {
        try {
          baseTag = new URL(base, baseTagHref);
        } catch (MalformedURLException e) {
          baseTag = base;
      if (LOG.isTraceEnabled()) {
        LOG.trace("Getting links...");
      utils.getOutlinks(baseTag, l, root);
      outlinks = l.toArray(new Outlink[l.size()]);
      if (LOG.isTraceEnabled()) {
        LOG.trace("found " + outlinks.length + " outlinks in "
            + content.getUrl());
    ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
    if (metaTags.getRefresh()) {
      status.setArgs(new String[] { metaTags.getRefreshHref().toString(),
          Integer.toString(metaTags.getRefreshTime()) });
    ParseData parseData = new ParseData(status, title, outlinks,
        content.getMetadata(), metadata);
    ParseResult parseResult = ParseResult.createParseResult(content.getUrl(),
        new ParseImpl(text, parseData));
    // run filters on parse
    ParseResult filteredParse = this.htmlParseFilters.filter(content,
        parseResult, metaTags, root);
    if (metaTags.getNoCache()) { // not okay to cache
      for (Map.Entry<, Parse> entry : filteredParse)
            .set(Nutch.CACHING_FORBIDDEN_KEY, cachingPolicy);
    return filteredParse;




public class MetaTagsParser implements HtmlParseFilter
public ParseResult filter(Content content, ParseResult parseResult,
      HTMLMetaTags metaTags, DocumentFragment doc) {
    Parse parse = parseResult.get(content.getUrl());
    Metadata metadata = parse.getData().getParseMeta();
     * NUTCH-1559: do not extract meta values from ParseData's metadata to avoid
     * duplicate metatag values
    Metadata generalMetaTags = metaTags.getGeneralTags();
    for (String tagName : generalMetaTags.names()) {
      addIndexedMetatags(metadata, tagName, generalMetaTags.getValues(tagName));

    Properties httpequiv = metaTags.getHttpEquivTags();
    for (Enumeration<?> tagNames = httpequiv.propertyNames(); tagNames
        .hasMoreElements();) {
      String name = (String) tagNames.nextElement();
      String value = httpequiv.getProperty(name);
      addIndexedMetatags(metadata, name, value);

    return parseResult;

观察一下这个方法,你就知道使用metadata plugin时,在使用index-metadata时,为什么配置要进行index的字段名要加上metatag.这个前缀了。

  private void addIndexedMetatags(Metadata metadata, String metatag,
      String value) {
    String lcMetatag = metatag.toLowerCase(Locale.ROOT);
    if (metatagset.contains("*") || metatagset.contains(lcMetatag)) {
      if (LOG.isDebugEnabled()) {
        LOG.debug("Found meta tag: {}\t{}", lcMetatag, value);
      metadata.add("metatag." + lcMetatag, value);
metadata plugin的配置


<description> Names of the metatags to extract, separated by ','.
  Use '*' to extract all metatags. Prefixes the names with 'metatag.'
  in the parse-metadata. For instance to index description and keywords,
  you need to activate the plugin index-metadata and set the value of the
  parameter '' to 'metatag.description,metatag.keywords'.
  Comma-separated list of keys to be taken from the parse metadata to generate fields.
  Can be used e.g. for 'description' or 'keywords' provided that these values are generated
  by a parser (see parse-metatags plugin)
