Stanford CoreNLP开源项目的3种编译和运行方式

Stanford CoreNLP开源项目的3种编译和运行方式

Stanford CoreNLP, integrating our NER, POS tagger, and parser with a new coreference system

官网上是如上介绍Stanford CoreNLP的。它是Stanford的NLP小组将自己的几个关于自然语言处理的组件组合起来的一个开源项目。该工具将Stanford的NER，POS tagger，parser工具和一个新的coreference指代系统整合在一起，形成了一个完整的自然语言处理工具平台。

如果想做自然语言处理，可以考虑使用。

官网Link： http://nlp.stanford.edu/software/corenlp.shtml

下载地址：http://nlp.stanford.edu/software/stanford-corenlp-v1.1.0.tgz

2. Stanford coreNLP的简单使用方式

一般的java开发人员如果要使用到Stanford CoreNLP的话，一般只需要到官网上下载Stanford CoreNLP的jar包，放到项目中的classpath下即可使用。

1.下载，解压

2. eclipse下新建项目simpleNLP

3.新建lib文件夹，将stanford CoreNLP解压后的文件夹中：fastutil.jar、jgraph.jar、jgrapht.jar、stanford-corenlp-2011-06-19.jar、stanford-corenlp-models-2011-06-19.jar、xom.jar这几个jar包加到lib中。并对它们add to build path…

4.测试：编写TestCoreNLP.java如下：

import java.util.List;

import java.util.Map;

import java.util.Properties;

import edu.stanford.nlp.dcoref.CorefChain;

import edu.stanford.nlp.ling.CoreLabel;

import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation;

import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation;

import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation;

import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation;

import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation;

import edu.stanford.nlp.ling.CoreAnnotations.TreeAnnotation;

import edu.stanford.nlp.ling.CorefCoreAnnotations.CorefChainAnnotation;

import edu.stanford.nlp.pipeline.Annotation;

import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import edu.stanford.nlp.trees.Tree;

import edu.stanford.nlp.trees.semgraph.SemanticGraph;

import edu.stanford.nlp.trees.semgraph.SemanticGraphCoreAnnotations.CollapsedCCProcessedDependenciesAnnotation;

import edu.stanford.nlp.util.CoreMap;

public class TestCoreNLP {

public static void main(String[] args) {

// creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution

Properties props = new Properties();

props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");

StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

// read some text in the text variable

String text = "Add your text here";

// create an empty Annotation just with the given text

Annotation document = new Annotation(text);

// run all Annotators on this text

pipeline.annotate(document);

// these are all the sentences in this document

// a CoreMap is essentially a Map that uses class objects as keys and has values with custom types

List<CoreMap> sentences = document.get(SentencesAnnotation.class);

for(CoreMap sentence: sentences) {

// traversing the words in the current sentence

// a CoreLabel is a CoreMap with additional token-specific methods

for (CoreLabel token: sentence.get(TokensAnnotation.class)) {

// this is the text of the token

String word = token.get(TextAnnotation.class);

// this is the POS tag of the token

String pos = token.get(PartOfSpeechAnnotation.class);

// this is the NER label of the token

String ne = token.get(NamedEntityTagAnnotation.class);

System.out.println(word+","+pos+","+ne);

}

// this is the parse tree of the current sentence

Tree tree = sentence.get(TreeAnnotation.class);

// this is the Stanford dependency graph of the current sentence

SemanticGraph dependencies = sentence.get(CollapsedCCProcessedDependenciesAnnotation.class);

}

// This is the coreference link graph

// Each chain stores a set of mentions that link to each other,

// along with a method for getting the most representative mention

// Both sentence and token offsets start at 1!

Map<Integer, CorefChain> graph =

document.get(CorefChainAnnotation.class);

}

这段代码将text交给StanfordCoreNLP处理，StanfordCoreNLP的各个组件（annotator）按“tokenize, ssplit, pos, lemma, ner, parse, dcoref”顺序进行处理。分别是：分词、断句、定词性、词元化、分辨命名实体、语法分析、同义词分辨等7大组件。

处理完后List<CoreMap> sentences = document.get(SentencesAnnotation.class);中包含了所有分析结果，遍历即可获知结果。

这里简单的将单词、词性、是否实体打印出来。

执行结果：

Adding annotator tokenize

Adding annotator ssplit

Adding annotator pos

Loading POS Model [edu/stanford/nlp/models/pos-tagger/wsj3t0-18-left3words/left3words-distsim-wsj-0-18.tagger] ... Loading default properties from trained tagger edu/stanford/nlp/models/pos-tagger/wsj3t0-18-left3words/left3words-distsim-wsj-0-18.tagger

Reading POS tagger model from edu/stanford/nlp/models/pos-tagger/wsj3t0-18-left3words/left3words-distsim-wsj-0-18.tagger ... done [4.4 sec].

done [4.5 sec].

Adding annotator lemma

Adding annotator ner

Loading classifier from edu/stanford/nlp/models/ner/all.3class.distsim.crf.ser.gz ... done [38.4 sec].

Loading classifier from edu/stanford/nlp/models/ner/muc.distsim.crf.ser.gz ... done [10.9 sec].

Loading classifier from edu/stanford/nlp/models/ner/conll.distsim.crf.ser.gz ... done [18.2 sec].

Adding annotator parse

Loading parser from serialized file edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz ... done [11.5 sec].

Adding annotator dcoref

Add,VB,O

your,PRP$,O

text,NN,O

here,RB,O

执行的时候记得设置-Xmx512m这个参数。否则提示java Heap Space…的异常。Stanford要求32位机器应该设置1800m，而64位机器应该设置3g。这个内存要求令人咋舌。

3. 项目源码在eclipse下的编译运行

开源代码的使用方式其实并不是只有第2节所述的jar包使用方式。更多的时候，我们是需要将源码下载下来，然后利用开发工具将源码加入到自己的项目，并进行修改，然后再编译，最后部署成一个jar包，虽然这样做要比第2节做法复杂，但是却可以对开源框架做有利于自己的修改。这是第2种方法做不到的。

1.下载，解压。源码在stanford-corenlp-2011-06-19-sources.jar中。将其中文件解压到src目录。

2. eclipse中新建一个工程NLPTest，工程下新建lib目录，将解压出的fastutil.jar、jgraph.jar、jgrapht.jar、stanford-corenlp-models-2011-06-19.jar、xom.jar这几个jar包放进lib中，并add to build path…