Stanford CoreNLP开源项目的3种编译和运行方式
Stanford CoreNLP开源项目的3种编译和运行方式
Stanford CoreNLP, integrating our NER, POS tagger, and parser with a new coreference system
官网上是如上介绍Stanford CoreNLP的。它是Stanford的NLP小组将自己的几个关于自然语言处理的组件组合起来的一个开源项目。该工具将Stanford的NER,POS tagger,parser工具和一个新的coreference指代系统整合在一起,形成了一个完整的自然语言处理工具平台。
如果想做自然语言处理,可以考虑使用。
官网Link: http://nlp.stanford.edu/software/corenlp.shtml
下载地址:http://nlp.stanford.edu/software/stanford-corenlp-v1.1.0.tgz
2. Stanford coreNLP的简单使用方式
一般的java开发人员如果要使用到Stanford CoreNLP的话,一般只需要到官网上下载Stanford CoreNLP的jar包,放到项目中的classpath下即可使用。
1.下载,解压
2. eclipse下新建项目simpleNLP
3.新建lib文件夹,将stanford CoreNLP解压后的文件夹中:fastutil.jar、jgraph.jar、jgrapht.jar、stanford-corenlp-2011-06-19.jar、stanford-corenlp-models-2011-06-19.jar、xom.jar这几个jar包加到lib中。并对它们add to build path…
4.测试:编写TestCoreNLP.java如下:
import java.util.List; import java.util.Map; import java.util.Properties;
import edu.stanford.nlp.dcoref.CorefChain; import edu.stanford.nlp.ling.CoreLabel; import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation; import edu.stanford.nlp.ling.CoreAnnotations.TreeAnnotation; import edu.stanford.nlp.ling.CorefCoreAnnotations.CorefChainAnnotation; import edu.stanford.nlp.pipeline.Annotation; import edu.stanford.nlp.pipeline.StanfordCoreNLP; import edu.stanford.nlp.trees.Tree; import edu.stanford.nlp.trees.semgraph.SemanticGraph; import edu.stanford.nlp.trees.semgraph.SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation; import edu.stanford.nlp.util.CoreMap;
public class TestCoreNLP { public static void main(String[] args) { // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution Properties props = new Properties(); props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
// read some text in the text variable String text = "Add your text here";
// create an empty Annotation just with the given text Annotation document = new Annotation(text);
// run all Annotators on this text pipeline.annotate(document);
// these are all the sentences in this document // a CoreMap is essentially a Map that uses class objects as keys and has values with custom types List<CoreMap> sentences = document.get(SentencesAnnotation.class);
for(CoreMap sentence: sentences) { // traversing the words in the current sentence // a CoreLabel is a CoreMap with additional token-specific methods for (CoreLabel token: sentence.get(TokensAnnotation.class)) { // this is the text of the token String word = token.get(TextAnnotation.class); // this is the POS tag of the token String pos = token.get(PartOfSpeechAnnotation.class); // this is the NER label of the token String ne = token.get(NamedEntityTagAnnotation.class);
System.out.println(word+","+pos+","+ne); }
// this is the parse tree of the current sentence Tree tree = sentence.get(TreeAnnotation.class);
// this is the Stanford dependency graph of the current sentence SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class); }
// This is the coreference link graph // Each chain stores a set of mentions that link to each other, // along with a method for getting the most representative mention // Both sentence and token offsets start at 1! Map<Integer, CorefChain> graph = document.get(CorefChainAnnotation.class); }
} |
这段代码将text交给StanfordCoreNLP处理,StanfordCoreNLP的各个组件(annotator)按“tokenize, ssplit, pos, lemma, ner, parse, dcoref”顺序进行处理。分别是:分词、断句、定词性、词元化、分辨命名实体、语法分析、同义词分辨等7大组件。
处理完后List<CoreMap> sentences = document.get(SentencesAnnotation.class);中包含了所有分析结果,遍历即可获知结果。
这里简单的将单词、词性、是否实体打印出来。
执行结果:
Adding annotator tokenize Adding annotator ssplit Adding annotator pos Loading POS Model [edu/stanford/nlp/models/pos-tagger/wsj3t0-18-left3words/left3words-distsim-wsj-0-18.tagger] ... Loading default properties from trained tagger edu/stanford/nlp/models/pos-tagger/wsj3t0-18-left3words/left3words-distsim-wsj-0-18.tagger Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/wsj3t0-18-left3words/left3words-distsim-wsj-0-18.tagger ... done [4.4 sec]. done [4.5 sec]. Adding annotator lemma Adding annotator ner Loading classifier from edu/stanford/nlp/models/ner/all.3class.distsim.crf.ser.gz ... done [38.4 sec]. Loading classifier from edu/stanford/nlp/models/ner/muc.distsim.crf.ser.gz ... done [10.9 sec]. Loading classifier from edu/stanford/nlp/models/ner/conll.distsim.crf.ser.gz ... done [18.2 sec]. Adding annotator parse Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [11.5 sec]. Adding annotator dcoref Add,VB,O your,PRP$,O text,NN,O here,RB,O |
执行的时候记得设置-Xmx512m这个参数。否则提示java Heap Space…的异常。Stanford要求32位机器应该设置1800m,而64位机器应该设置3g。这个内存要求令人咋舌。
3. 项目源码在eclipse下的编译运行
开源代码的使用方式其实并不是只有第2节所述的jar包使用方式。更多的时候,我们是需要将源码下载下来,然后利用开发工具将源码加入到自己的项目,并进行修改,然后再编译,最后部署成一个jar包,虽然这样做要比第2节做法复杂,但是却可以对开源框架做有利于自己的修改。这是第2种方法做不到的。
1.下载,解压。源码在stanford-corenlp-2011-06-19-sources.jar中。将其中文件解压到src目录。
2. eclipse中新建一个工程NLPTest,工程下新建lib目录,将解压出的fastutil.jar、jgraph.jar、jgrapht.jar、stanford-corenlp-models-2011-06-19.jar、xom.jar这几个jar包放进lib中,并add to build path…
注意,这里相比第2节的第3步少加了个stanford-corenlp-2011-06-19.jar,其实这个jar文件就是coreNLP的源码编译好的class文件打成的jar包。我们就是要把这个jar包替换成源码。
3.将第1步解压的src目录下的所有原文件拷贝到工程的src目录下。等待eclipse的编译完成。
4.将coreNLP原解压文件中的input.txt拷贝到工程根目录下。
得到的工程目录如上。
5.测试:在工程的edu.stanford.nlp.pipleline包下有一个StanfordCoreNLP.java。它是整个工程的测试类,该目录下还有个StanfordCoreNLP.properties,这是工程的配置文件。
只要设置好StanfordCoreNLP这个类的运行参数,整个工程就能正常运行。参数设置如下:
运行后会在工程根目录下生成input.txt.xml文件。这是所有解析结果。
如下所示: