Nutch开发(四)
Nutch开发(四)
文章目录
开发环境
- Linux,Ubuntu20.04LST
- IDEA
- Nutch1.18
- Solr8.11
转载请声明出处!!!By 鸭梨的药丸哥
1.Nutch插件设计介绍
Nutch高度可扩展,使用的插件系统是基于Eclipse2.x的插件系统。
Nutch暴露了几个扩展点,每个扩展点都是一个接口,通过实现接口来进行插件扩展的开发。Nutch提供以下扩展点,我们只需要实现对应的接口即可开发我们的Nutch插件
- IndexWriter – Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).
- IndexingFilter – Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).
- Parser – Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.
- HtmlParseFilter – Permits one to add additional metadata to HTML parses (from javadoc).
- Protocol – Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.
- URLFilter – URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.
- URLNormalizer – Interface used to convert URLs to normal form and optionally perform substitutions.
- ScoringFilter – A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.
- SegmentMergeFilter – Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.
2.解读插件目录结构
Nutch插件的目录都相似,这里介绍一下parse-html的目录就行了
/src #源码目录
build.xml #ant怎样编译这个plugin配置文件(编译出jar包放哪啊等配置信息)
ivy.xml #plugin的ivy配置信息(依赖管理,跟maven的pom.xml一样的东东)
plugin.xml #nutch描述这个plugin的信息(如,这个插件实现了哪些扩展点,插件的扩展点实现类名字等)
3. build.xml
build.xml告知ant如何编译这个插件的
<project name="parse-html" default="jar-core">
<import file="../build-plugin.xml"/>
<!-- Build compilation dependencies -->
<target name="deps-jar">
<!--build时依赖于另一个插件-->
<ant target="jar" inheritall="false" dir="../lib-nekohtml"/>
</target>
<!-- Add compilation dependencies to classpath -->
<path id="plugin.deps">
<fileset dir="${nutch.root}/build">
<include name="**/lib-nekohtml/*.jar" />
</fileset>
</path>
<!-- Deploy Unit test dependencies -->
<target name="deps-test">
<!--test时用到的依赖插件-->
<ant target="deploy" inheritall="false" dir="../lib-nekohtml"/>
<ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
</target>
</project>
4. ivy.xml
跟maven的pom.xml一样的东西。一些外部依赖可以在这里声明导入
<ivy-module version="1.0">
<info organisation="org.apache.nutch" module="${ant.project.name}">
<license name="Apache 2.0"/>
<ivyauthor name="Apache Nutch Team" url="https://nutch.apache.org/"/>
<description>
Apache Nutch
</description>
</info>
<configurations>
<include file="../../../ivy/ivy-configurations.xml"/>
</configurations>
<publications>
<!--get the artifact from our module name-->
<artifact conf="master"/>
</publications>
<!--在这里添加外部依赖-->
<dependencies>
<dependency org="org.ccil.cowan.tagsoup" name="tagsoup" rev="1.2.1"/>
</dependencies>
</ivy-module>
5. plugin.xml
<!--插件的描述信息-->
<plugin
id="parse-html"
name="Html Parse Plug-in"
version="1.0.0"
provider-name="nutch.org">
<runtime>
<library name="parse-html.jar">
<export name="*"/>
</library>
<library name="tagsoup-1.2.1.jar"/>
</runtime>
<!--插件导入-->
<requires>
<import plugin="nutch-extensionpoints"/>
<import plugin="lib-nekohtml"/>
</requires>
<!--扩展点的描述-->
<extension id="org.apache.nutch.parse.html"
name="HtmlParse"
point="org.apache.nutch.parse.Parser">
<!--id唯一标识,class对应的实现类-->
<implementation id="org.apache.nutch.parse.html.HtmlParser"
class="org.apache.nutch.parse.html.HtmlParser">
<!--参数-->
<parameter name="contentType" value="text/html|application/xhtml+xml"/>
<parameter name="pathSuffix" value=""/>
</implementation>
</extension>
</plugin>
6. 解读parse-html插件
HtmlParser
HtmlParser实现了Parser扩展点
public class HtmlParser implements Parser
Parser接口方法:
- public ParseResult getParse(Content c) //解析数据的
- public void setConf(Configuration configuration) //用于nutch-setting中的配置
- public Configuration getConf()
setConf(Configuration conf)
从nutch-setting.xml
读取信息,因为nutch会在调用插件通过setConf(Configuration conf)
往插件传递配置信息。
@Override
public void setConf(Configuration conf) {
this.conf = conf;
//创建HtmlParseFilters,里面有一个数组HtmlParseFilters装实现类的插件
//HtmlParseFilters使用数组HtmlParseFilter[] htmlParseFilters装插件
this.htmlParseFilters = new HtmlParseFilters(getConf());
//获取解析实现类名字,空就默认使用nekohtml
this.parserImpl = getConf().get("parser.html.impl", "neko");
//编码方式
this.defaultCharEncoding = getConf().get(
"parser.character.encoding.default", "windows-1252");
//一个dom工具
this.utils = new DOMContentUtils(conf);
//cache策略
this.cachingPolicy = getConf().get("parser.caching.forbidden.policy",
Nutch.CACHING_FORBIDDEN_CONTENT);
}
查看nutch-default.xml
,里面的parser.html.impl
参数,确实有parser.html.impl
,如果nutch-default.xml
没有定义时还是会用NekoHTML去解析HTML页面。
- 从前面的
build.xml
引入了lib-nekohtml
插件,这个就是NekoHTML - 而
ivy.xml
引入了tagsoup
的ivy依赖,这个就是TagSoup,两者都能解析html页面
<property>
<name>parser.html.impl</name>
<value>neko</value>
<description>HTML Parser implementation. Currently the following keywords
are recognized: "neko" uses NekoHTML, "tagsoup" uses TagSoup.
</description>
</property>
parse(InputSource input)
再看看parse这个方法,
private DocumentFragment parse(InputSource input) throws Exception {
//如果设置了tagsoup就用tagsoup来解析html
if ("tagsoup".equalsIgnoreCase(parserImpl))
return parseTagSoup(input);
else
return parseNeko(input);
}
getParse(Content content)
注意:在ParseResult filteredParse = this.htmlParseFilters.filter(content,parseResult, metaTags, root);
会运行继承HtmlParseFilter扩展点的插件,所以我们需要解析html中的格外的标签中的数据时,可以通过实现HtmlParseFilter扩展点来自定义一些html中的标签数据发解析。
public ParseResult getParse(Content content) {
//HTML meta标签
HTMLMetaTags metaTags = new HTMLMetaTags();
//拿到url
URL base;
try {
base = new URL(content.getBaseUrl());
} catch (MalformedURLException e) {
return new ParseStatus(e)
.getEmptyParseResult(content.getUrl(), getConf());
}
//文本信息
String text = "";
//标题
String title = "";
//解析出的外部连接
Outlink[] outlinks = new Outlink[0];
//元数据
Metadata metadata = new Metadata();
//解析出的dom树
// parse the content
DocumentFragment root;
try {
//拿到content封装成流
byte[] contentInOctets = content.getContent();
InputSource input = new InputSource(new ByteArrayInputStream(
contentInOctets));
//编码方式的解析
EncodingDetector detector = new EncodingDetector(conf);
detector.autoDetectClues(content, true);
detector.addClue(sniffCharacterEncoding(contentInOctets), "sniffed");
String encoding = detector.guessEncoding(content, defaultCharEncoding);
metadata.set(Metadata.ORIGINAL_CHAR_ENCODING, encoding);
metadata.set(Metadata.CHAR_ENCODING_FOR_CONVERSION, encoding);
input.setEncoding(encoding);
if (LOG.isTraceEnabled()) {
LOG.trace("Parsing...");
}
root = parse(input);
} catch (IOException e) {
return new ParseStatus(e)
.getEmptyParseResult(content.getUrl(), getConf());
} catch (DOMException e) {
return new ParseStatus(e)
.getEmptyParseResult(content.getUrl(), getConf());
} catch (SAXException e) {
return new ParseStatus(e)
.getEmptyParseResult(content.getUrl(), getConf());
} catch (Exception e) {
LOG.error("Error: ", e);
return new ParseStatus(e)
.getEmptyParseResult(content.getUrl(), getConf());
}
//解析出meta标签
// get meta directives
HTMLMetaProcessor.getMetaTags(metaTags, root, base);
//把标签数据装到metadata里面
// populate Nutch metadata with HTML meta directives
metadata.addAll(metaTags.getGeneralTags());
if (LOG.isTraceEnabled()) {
LOG.trace("Meta tags for " + base + ": " + metaTags.toString());
}
// check meta directives
if (!metaTags.getNoIndex()) { // okay to index
StringBuffer sb = new StringBuffer();
if (LOG.isTraceEnabled()) {
LOG.trace("Getting text...");
}
//解析文本信息,就是提取标签中的文本
utils.getText(sb, root); // extract text
text = sb.toString();
sb.setLength(0);
if (LOG.isTraceEnabled()) {
LOG.trace("Getting title...");
}
//提取title标签中的文本
utils.getTitle(sb, root); // extract title
title = sb.toString().trim();
}
if (!metaTags.getNoFollow()) { // okay to follow links
ArrayList<Outlink> l = new ArrayList<Outlink>(); // extract outlinks
URL baseTag = base;
String baseTagHref = utils.getBase(root);
if (baseTagHref != null) {
try {
baseTag = new URL(base, baseTagHref);
} catch (MalformedURLException e) {
baseTag = base;
}
}
if (LOG.isTraceEnabled()) {
LOG.trace("Getting links...");
}
//解析外部连接
utils.getOutlinks(baseTag, l, root);
outlinks = l.toArray(new Outlink[l.size()]);
if (LOG.isTraceEnabled()) {
LOG.trace("found " + outlinks.length + " outlinks in "
+ content.getUrl());
}
}
//创建parseStatus
ParseStatus status = new ParseStatus(ParseStatus.SUCCESS);
if (metaTags.getRefresh()) {
status.setMinorCode(ParseStatus.SUCCESS_REDIRECT);
status.setArgs(new String[] { metaTags.getRefreshHref().toString(),
Integer.toString(metaTags.getRefreshTime()) });
}
//封装解析数据
ParseData parseData = new ParseData(status, title, outlinks,
content.getMetadata(), metadata);
//解析结果
ParseResult parseResult = ParseResult.createParseResult(content.getUrl(),
new ParseImpl(text, parseData));
//运行HtmlParseFilter解析过滤器,如parse-metatags等,具体可通过配置添加
// run filters on parse
ParseResult filteredParse = this.htmlParseFilters.filter(content,
parseResult, metaTags, root);
if (metaTags.getNoCache()) { // not okay to cache
for (Map.Entry<org.apache.hadoop.io.Text, Parse> entry : filteredParse)
entry.getValue().getData().getParseMeta()
.set(Nutch.CACHING_FORBIDDEN_KEY, cachingPolicy);
}
return filteredParse;
}
7.解读parse-metatags插件
MetaTagsParser
MetaTagsParser实现了HtmlParseFilter扩展点
public class MetaTagsParser implements HtmlParseFilter
filter方法
public ParseResult filter(Content content, ParseResult parseResult,
HTMLMetaTags metaTags, DocumentFragment doc) {
//拿到解析数据
Parse parse = parseResult.get(content.getUrl());
//拿到解析的元数据
Metadata metadata = parse.getData().getParseMeta();
/*
* NUTCH-1559: do not extract meta values from ParseData's metadata to avoid
* duplicate metatag values
*/
//meta标签的元数据(k,v)
Metadata generalMetaTags = metaTags.getGeneralTags();
for (String tagName : generalMetaTags.names()) {
//根据配置进行添加到解析结果里面
addIndexedMetatags(metadata, tagName, generalMetaTags.getValues(tagName));
}
Properties httpequiv = metaTags.getHttpEquivTags();
for (Enumeration<?> tagNames = httpequiv.propertyNames(); tagNames
.hasMoreElements();) {
String name = (String) tagNames.nextElement();
String value = httpequiv.getProperty(name);
//这里也是添加到解析结果里面
addIndexedMetatags(metadata, name, value);
}
return parseResult;
}
addIndexedMetatags方法
观察一下这个方法,你就知道使用metadata plugin时,在使用index-metadata时,为什么配置要进行index的字段名要加上metatag.
这个前缀了。
private void addIndexedMetatags(Metadata metadata, String metatag,
String value) {
String lcMetatag = metatag.toLowerCase(Locale.ROOT);
if (metatagset.contains("*") || metatagset.contains(lcMetatag)) {
if (LOG.isDebugEnabled()) {
LOG.debug("Found meta tag: {}\t{}", lcMetatag, value);
}
metadata.add("metatag." + lcMetatag, value);
}
}
metadata plugin的配置
在看看配置并和addIndexedMetatags对比一下,这就可以看出为什么插件的index.parse.md
要加上metatag.
前缀
<property>
<name>metatags.names</name>
<value>description,keywords</value>
<description> Names of the metatags to extract, separated by ','.
Use '*' to extract all metatags. Prefixes the names with 'metatag.'
in the parse-metadata. For instance to index description and keywords,
you need to activate the plugin index-metadata and set the value of the
parameter 'index.parse.md' to 'metatag.description,metatag.keywords'.
</description>
</property>
<property>
<name>index.parse.md</name>
<!--addIndexedMetatags方法解析出来的metadata有前缀metatag.-->
<value>metatag.description,metatag.keywords</value>
<description>
Comma-separated list of keys to be taken from the parse metadata to generate fields.
Can be used e.g. for 'description' or 'keywords' provided that these values are generated
by a parser (see parse-metatags plugin)
</description>
</property>
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 分享4款.NET开源、免费、实用的商城系统
· 全程不用写代码,我用AI程序员写了一个飞机大战
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· 白话解读 Dapr 1.15:你的「微服务管家」又秀新绝活了
· 上周热点回顾(2.24-3.2)