Nutch开发(四)

Nutch开发(四)

开发环境

  • Linux,Ubuntu20.04LST
  • IDEA
  • Nutch1.18
  • Solr8.11

转载请声明出处!!!By 鸭梨的药丸哥

1.Nutch插件设计介绍

Nutch高度可扩展,使用的插件系统是基于Eclipse2.x的插件系统

Nutch暴露了几个扩展点,每个扩展点都是一个接口,通过实现接口来进行插件扩展的开发。Nutch提供以下扩展点,我们只需要实现对应的接口即可开发我们的Nutch插件

  • IndexWriter – Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).
  • IndexingFilter – Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).
  • Parser – Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.
  • HtmlParseFilter – Permits one to add additional metadata to HTML parses (from javadoc).
  • Protocol – Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.
  • URLFilter – URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.
  • URLNormalizer – Interface used to convert URLs to normal form and optionally perform substitutions.
  • ScoringFilter – A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.
  • SegmentMergeFilter – Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.

2.解读插件目录结构

Nutch插件的目录都相似,这里介绍一下parse-html的目录就行了

/src #源码目录
build.xml   #ant怎样编译这个plugin配置文件(编译出jar包放哪啊等配置信息)
ivy.xml     #plugin的ivy配置信息(依赖管理,跟maven的pom.xml一样的东东)
plugin.xml  #nutch描述这个plugin的信息(如,这个插件实现了哪些扩展点,插件的扩展点实现类名字等)

3. build.xml

build.xml告知ant如何编译这个插件的

<project name="parse-html" default="jar-core">

  <import file="../build-plugin.xml"/>

  <!-- Build compilation dependencies -->
  <target name="deps-jar">
      <!--build时依赖于另一个插件-->
    <ant target="jar" inheritall="false" dir="../lib-nekohtml"/>
  </target>

  <!-- Add compilation dependencies to classpath -->
  <path id="plugin.deps">
    <fileset dir="${nutch.root}/build">
      <include name="**/lib-nekohtml/*.jar" />
    </fileset>
  </path>

  <!-- Deploy Unit test dependencies -->
  <target name="deps-test">
      <!--test时用到的依赖插件-->
    <ant target="deploy" inheritall="false" dir="../lib-nekohtml"/>
    <ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
  </target>

</project>

4. ivy.xml

跟maven的pom.xml一样的东西。一些外部依赖可以在这里声明导入

<ivy-module version="1.0">
  <info organisation="org.apache.nutch" module="${ant.project.name}">
    <license name="Apache 2.0"/>
    <ivyauthor name="Apache Nutch Team" url="https://nutch.apache.org/"/>
    <description>
        Apache Nutch
    </description>
  </info>

  <configurations>
    <include file="../../../ivy/ivy-configurations.xml"/>
  </configurations>

  <publications>
    <!--get the artifact from our module name-->
    <artifact conf="master"/>
  </publications>
	<!--在这里添加外部依赖-->
  <dependencies>
   <dependency org="org.ccil.cowan.tagsoup" name="tagsoup" rev="1.2.1"/>
  </dependencies>

</ivy-module>

5. plugin.xml

<!--插件的描述信息-->
<plugin
   id="parse-html"
   name="Html Parse Plug-in"
   version="1.0.0"
   provider-name="nutch.org">
    
   <runtime>
      <library name="parse-html.jar">
         <export name="*"/>
      </library>
      <library name="tagsoup-1.2.1.jar"/>
   </runtime>
	
    <!--插件导入-->
   <requires>
      <import plugin="nutch-extensionpoints"/>
      <import plugin="lib-nekohtml"/>
   </requires>
	
    <!--扩展点的描述-->
   <extension id="org.apache.nutch.parse.html"
              name="HtmlParse"
              point="org.apache.nutch.parse.Parser">
      <!--id唯一标识,class对应的实现类-->
      <implementation id="org.apache.nutch.parse.html.HtmlParser"
                      class="org.apache.nutch.parse.html.HtmlParser">
          <!--参数-->
        <parameter name="contentType" value="text/html|application/xhtml+xml"/>
        <parameter name="pathSuffix" value=""/>
      </implementation>

   </extension>

</plugin>

6. 解读parse-html插件

HtmlParser

HtmlParser实现了Parser扩展点

public class HtmlParser implements Parser

Parser接口方法:

  • public ParseResult getParse(Content c) //解析数据的
  • public void setConf(Configuration configuration) //用于nutch-setting中的配置
  • public Configuration getConf()
setConf(Configuration conf)

nutch-setting.xml读取信息,因为nutch会在调用插件通过setConf(Configuration conf)往插件传递配置信息。

@Override
public void setConf(Configuration conf) {
    this.conf = conf;
    //创建HtmlParseFilters,里面有一个数组HtmlParseFilters装实现类的插件
    //HtmlParseFilters使用数组HtmlParseFilter[] htmlParseFilters装插件
    this.htmlParseFilters = new HtmlParseFilters(getConf());
	//获取解析实现类名字,空就默认使用nekohtml
    this.parserImpl = getConf().get("parser.html.impl", "neko");
    //编码方式
    this.defaultCharEncoding = getConf().get(
        "parser.character.encoding.default", "windows-1252");
    //一个dom工具
    this.utils = new DOMContentUtils(conf);
    //cache策略
    this.cachingPolicy = getConf().get("parser.caching.forbidden.policy",
                                       Nutch.CACHING_FORBIDDEN_CONTENT);
}

查看nutch-default.xml,里面的parser.html.impl参数,确实有parser.html.impl,如果nutch-default.xml没有定义时还是会用NekoHTML去解析HTML页面。

  • 从前面的build.xml引入了lib-nekohtml插件,这个就是NekoHTML
  • ivy.xml引入了tagsoup的ivy依赖,这个就是TagSoup,两者都能解析html页面
<property>
  <name>parser.html.impl</name>
  <value>neko</value>
  <description>HTML Parser implementation. Currently the following keywords
  are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
  </description>
</property>
parse(InputSource input)

再看看parse这个方法,

private DocumentFragment parse(InputSource input) throws Exception {
    //如果设置了tagsoup就用tagsoup来解析html
    if ("tagsoup".equalsIgnoreCase(parserImpl))
    	return parseTagSoup(input);
    else
    	return parseNeko(input);
}
getParse(Content content)

注意:在ParseResult filteredParse = this.htmlParseFilters.filter(content,parseResult, metaTags, root);会运行继承HtmlParseFilter扩展点的插件,所以我们需要解析html中的格外的标签中的数据时,可以通过实现HtmlParseFilter扩展点来自定义一些html中的标签数据发解析。

public ParseResult getParse(Content content) {
    //HTML meta标签
    HTMLMetaTags metaTags = new HTMLMetaTags();
    //拿到url
    URL base;
    try {
      base = new URL(content.getBaseUrl());
    } catch (MalformedURLException e) {
      return new ParseStatus(e)
          .getEmptyParseResult(content.getUrl(), getConf());
    }
    //文本信息
    String text = "";
    //标题
    String title = "";
    //解析出的外部连接
    Outlink[] outlinks = new Outlink[0];
    //元数据
    Metadata metadata = new Metadata();
    //解析出的dom树
    // parse the content
    DocumentFragment root;
    try {
      //拿到content封装成流
      byte[] contentInOctets = content.getContent();
      InputSource input = new InputSource(new ByteArrayInputStream(
          contentInOctets));
      //编码方式的解析
      EncodingDetector detector = new EncodingDetector(conf);
      detector.autoDetectClues(content, true);
      detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
      String encoding = detector.guessEncoding(content, defaultCharEncoding);

      metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
      metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);

      input.setEncoding(encoding);
      if (LOG.isTraceEnabled()) {
        LOG.trace("Parsing...");
      }
      root = parse(input);
    } catch (IOException e) {
      return new ParseStatus(e)
          .getEmptyParseResult(content.getUrl(), getConf());
    } catch (DOMException e) {
      return new ParseStatus(e)
          .getEmptyParseResult(content.getUrl(), getConf());
    } catch (SAXException e) {
      return new ParseStatus(e)
          .getEmptyParseResult(content.getUrl(), getConf());
    } catch (Exception e) {
      LOG.error("Error: ", e);
      return new ParseStatus(e)
          .getEmptyParseResult(content.getUrl(), getConf());
    }
    //解析出meta标签
    // get meta directives
    HTMLMetaProcessor.getMetaTags(metaTags, root, base);
    //把标签数据装到metadata里面
    // populate Nutch metadata with HTML meta directives
    metadata.addAll(metaTags.getGeneralTags());

    if (LOG.isTraceEnabled()) {
      LOG.trace("Meta tags for " + base + ": " + metaTags.toString());
    }
    // check meta directives
    if (!metaTags.getNoIndex()) { // okay to index
      StringBuffer sb = new StringBuffer();
      if (LOG.isTraceEnabled()) {
        LOG.trace("Getting text...");
      }
      //解析文本信息,就是提取标签中的文本
      utils.getText(sb, root); // extract text
      text = sb.toString();
      sb.setLength(0);
      if (LOG.isTraceEnabled()) {
        LOG.trace("Getting title...");
      }
      //提取title标签中的文本
      utils.getTitle(sb, root); // extract title
      title = sb.toString().trim();
    }

    if (!metaTags.getNoFollow()) { // okay to follow links
      ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
      URL baseTag = base;
      String baseTagHref = utils.getBase(root);
      if (baseTagHref != null) {
        try {
          baseTag = new URL(base, baseTagHref);
        } catch (MalformedURLException e) {
          baseTag = base;
        }
      }
      if (LOG.isTraceEnabled()) {
        LOG.trace("Getting links...");
      }
      //解析外部连接
      utils.getOutlinks(baseTag, l, root);
      outlinks = l.toArray(new Outlink[l.size()]);
      if (LOG.isTraceEnabled()) {
        LOG.trace("found " + outlinks.length + " outlinks in "
            + content.getUrl());
      }
    }
    //创建parseStatus
    ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
    if (metaTags.getRefresh()) {
      status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);
      status.setArgs(new String[] { metaTags.getRefreshHref().toString(),
          Integer.toString(metaTags.getRefreshTime()) });
    }
    //封装解析数据
    ParseData parseData = new ParseData(status, title, outlinks,
        content.getMetadata(), metadata);
    //解析结果
    ParseResult parseResult = ParseResult.createParseResult(content.getUrl(),
        new ParseImpl(text, parseData));
    
    //运行HtmlParseFilter解析过滤器,如parse-metatags等,具体可通过配置添加
    // run filters on parse
    ParseResult filteredParse = this.htmlParseFilters.filter(content,
        parseResult, metaTags, root);
    if (metaTags.getNoCache()) { // not okay to cache
      for (Map.Entry<org.apache.hadoop.io.Text, Parse> entry : filteredParse)
        entry.getValue().getData().getParseMeta()
            .set(Nutch.CACHING_FORBIDDEN_KEY, cachingPolicy);
    }
    return filteredParse;
  }

7.解读parse-metatags插件

MetaTagsParser

MetaTagsParser实现了HtmlParseFilter扩展点

public class MetaTagsParser implements HtmlParseFilter
filter方法
public ParseResult filter(Content content, ParseResult parseResult,
      HTMLMetaTags metaTags, DocumentFragment doc) {
	//拿到解析数据
    Parse parse = parseResult.get(content.getUrl());
    //拿到解析的元数据
    Metadata metadata = parse.getData().getParseMeta();
    /*
     * NUTCH-1559: do not extract meta values from ParseData's metadata to avoid
     * duplicate metatag values
     */
	//meta标签的元数据(k,v)
    Metadata generalMetaTags = metaTags.getGeneralTags();
    for (String tagName : generalMetaTags.names()) {
        //根据配置进行添加到解析结果里面
      addIndexedMetatags(metadata, tagName, generalMetaTags.getValues(tagName));
    }

    Properties httpequiv = metaTags.getHttpEquivTags();
    for (Enumeration<?> tagNames = httpequiv.propertyNames(); tagNames
        .hasMoreElements();) {
      String name = (String) tagNames.nextElement();
      String value = httpequiv.getProperty(name);
        //这里也是添加到解析结果里面
      addIndexedMetatags(metadata, name, value);
    }

    return parseResult;
  }
addIndexedMetatags方法

观察一下这个方法,你就知道使用metadata plugin时,在使用index-metadata时,为什么配置要进行index的字段名要加上metatag.这个前缀了。

  private void addIndexedMetatags(Metadata metadata, String metatag,
      String value) {
    String lcMetatag = metatag.toLowerCase(Locale.ROOT);
    if (metatagset.contains("*") || metatagset.contains(lcMetatag)) {
      if (LOG.isDebugEnabled()) {
        LOG.debug("Found meta tag: {}\t{}", lcMetatag, value);
      }
      metadata.add("metatag." + lcMetatag, value);
    }
  }
metadata plugin的配置

在看看配置并和addIndexedMetatags对比一下,这就可以看出为什么插件的index.parse.md要加上metatag.前缀

<property>
<name>metatags.names</name>
<value>description,keywords</value>
<description> Names of the metatags to extract, separated by ','.
  Use '*' to extract all metatags. Prefixes the names with 'metatag.'
  in the parse-metadata. For instance to index description and keywords,
  you need to activate the plugin index-metadata and set the value of the
  parameter 'index.parse.md' to 'metatag.description,metatag.keywords'.
</description>
</property>
 
 <property>
  <name>index.parse.md</name>
     <!--addIndexedMetatags方法解析出来的metadata有前缀metatag.-->
  <value>metatag.description,metatag.keywords</value>
  <description>
  Comma-separated list of keys to be taken from the parse metadata to generate fields.
  Can be used e.g. for 'description' or 'keywords' provided that these values are generated
  by a parser (see parse-metatags plugin)
  </description>
</property>
posted @   鸭梨的药丸哥  阅读(6)  评论(0编辑  收藏  举报  
相关博文:
阅读排行:
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· 上周热点回顾(2.24-3.2)
点击右上角即可分享
微信分享提示