斯坦福分词工具的试用
下载链接 戳这里
下载后的文件夹是这样的:
然后打开eclipse,新建项目,把源文件segDemo.java拷贝进去,把jar包全丢进去(右键项目, properties,Java Build Path,Add External Jars)
导入data数据包,并且修改源码中的路径,如图所示:
然后修改segDemo.java并且测试
1 package test; 2 import java.io.*; 3 import java.util.List; 4 import java.util.Properties; 5 6 import edu.stanford.nlp.ie.crf.CRFClassifier; 7 import edu.stanford.nlp.ling.CoreLabel; 8 9 10 /** This is a very simple demo of calling the Chinese Word Segmenter 11 * programmatically. It assumes an input file in UTF8. 12 * <p/> 13 * <code> 14 * Usage: java -mx1g -cp seg.jar SegDemo fileName 15 * </code> 16 * This will run correctly in the distribution home directory. To 17 * run in general, the properties for where to find dictionaries or 18 * normalizations have to be set. 19 * 20 * @author Christopher Manning 21 */ 22 23 public class SegDemo { 24 25 private static final String basedir = System.getProperty("SegDemo", "data"); 26 27 public static void main(String[] args) throws Exception { 28 System.setOut(new PrintStream(System.out, true, "utf-8")); 29 30 Properties props = new Properties(); 31 props.setProperty("sighanCorporaDict", basedir); 32 // props.setProperty("NormalizationTable", "data/norm.simp.utf8"); 33 // props.setProperty("normTableEncoding", "UTF-8"); 34 // below is needed because CTBSegDocumentIteratorFactory accesses it 35 props.setProperty("serDictionary", basedir + "/dict-chris6.ser.gz"); 36 if (args.length > 0) { 37 props.setProperty("testFile", args[0]); 38 } 39 props.setProperty("inputEncoding", "UTF-8"); 40 props.setProperty("sighanPostProcessing", "true"); 41 42 CRFClassifier<CoreLabel> segmenter = new CRFClassifier<>(props); 43 segmenter.loadClassifierNoExceptions(basedir + "/ctb.gz", props); 44 for (String filename : args) { 45 segmenter.classifyAndWriteAnswers(filename); 46 } 47 48 String sample = "我住在美国。"; 49 List<String> segmented = segmenter.segmentString(sample); 50 System.out.println(segmented); 51 } 52 53 }
输出:[我, 住在, 美国, 。]
之后请随意发挥吧~