犀利小胖

 

stanford-parser使用说明

借助之前的分词器,这里我们可以利用句法分析器来对中文语句进行更深的研究,下面我来简单接受一下其调试

首先可以在已有的工程中导入,建立JAVA-Project

再对应的按照代码的种变化进行调整,主要调整两个地方:

1:new LexicalizedParser("grammar/chinesePCFG.ser.gz");

2: String[] sent = { "我", "是", "一名", "好", "学生", "。" };

因为是中文所以,选择的包是中文的,还有就语句要是已经分好词的句子,详细代码如下:

import java.util.ArrayList;
import java.util.Collection;
import java.util.List;
import java.util.*;
import java.io.StringReader;

import edu.stanford.nlp.objectbank.TokenizerFactory;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.process.DocumentPreprocessor;
import edu.stanford.nlp.process.PTBTokenizer;
import edu.stanford.nlp.ling.CoreLabel;  
import edu.stanford.nlp.ling.HasWord;  
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

class ParserDemo {

  public static void main(String[] args) {
    LexicalizedParser lp = 
      new LexicalizedParser("grammar/chinesePCFG.ser.gz");
    if (args.length > 0) {
      demoDP(lp, args[0]);
    } else {
      demoAPI(lp);
    }
  }

  public static void demoDP(LexicalizedParser lp, String filename) {
    // This option shows loading and sentence-segment and tokenizing
    // a file using DocumentPreprocessor
    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    // You could also create a tokenier here (as below) and pass it
    // to DocumentPreprocessor
    for (List<HasWord> sentence : new DocumentPreprocessor(filename)) {
      Tree parse = lp.apply(sentence);
      parse.pennPrint();
      System.out.println();
      
      GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
      Collection tdl = gs.typedDependenciesCCprocessed(true);
      System.out.println(tdl);
      System.out.println();
    }
  }

  public static void demoAPI(LexicalizedParser lp) {
    // This option shows parsing a list of correctly tokenized words
    String[] sent = { "我", "是", "一名", "好", "学生", "。" };
    List<CoreLabel> rawWords = new ArrayList<CoreLabel>();
    for (String word : sent) {
      CoreLabel l = new CoreLabel();
      l.setWord(word);
      rawWords.add(l);
    }
    Tree parse = lp.apply(rawWords);
    parse.pennPrint();
    System.out.println();

    // This option shows loading and using an explicit tokenizer
    //String sent2 = "今天是个晴朗的天气。";
    TokenizerFactory<CoreLabel> tokenizerFactory = 
      PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
    //List<CoreLabel> rawWords2 = 
      //tokenizerFactory.getTokenizer(new StringReader(sent2)).tokenize();
    //parse = lp.apply(rawWords2);

    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
    System.out.println(tdl);
    System.out.println();

    TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
    tp.printTree(parse);
  }

  private ParserDemo() {} // static methods only

}

最后能编译通过后,显示结果如下,这里用的是LDC(宾夕法尼亚州的中文语料库及其词性标注和短语标注)

 

posted on 2012-05-01 15:00  犀利小胖  阅读(3425)  评论(6编辑  收藏  举报

导航