stanford corenlp自定义切词类
stanford corenlp的中文切词有时不尽如意,那我们就需要实现一个自定义切词类,来完全满足我们的私人定制(加各种词典干预)。上篇文章《IKAnalyzer》介绍了IKAnalyzer的自由度,本篇文章就说下怎么把IKAnalyzer作为corenlp的切词工具。
《stanford corenlp的TokensRegex》提到了corenlp的配置CoreNLP-chinese.properties,其中customAnnotatorClass.segment就是用于指定切词类的,在这里我们只需要模仿ChineseSegmenterAnnotator来实现一个自己的Annotator,并设置在配置文件中即可。
1 | customAnnotatorClass.segment = edu.stanford.nlp.pipeline.ChineseSegmenterAnnotator |
下面是我的实现:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | public class IKSegmenterAnnotator extends ChineseSegmenterAnnotator { public IKSegmenterAnnotator() { super (); } public IKSegmenterAnnotator( boolean verbose) { super (verbose); } public IKSegmenterAnnotator(String segLoc, boolean verbose) { super (segLoc, verbose); } public IKSegmenterAnnotator(String segLoc, boolean verbose, String serDictionary, String sighanCorporaDict) { super (segLoc, verbose, serDictionary, sighanCorporaDict); } public IKSegmenterAnnotator(String name, Properties props) { super (name, props); } private List<String> splitWords(String str) { try { List<String> words = new ArrayList<String>(); IKSegmenter ik = new IKSegmenter( new StringReader(str), true ); Lexeme lex = null ; while ((lex = ik.next()) != null ) { words.add(lex.getLexemeText()); } return words; } catch (IOException e) { //LOGGER.error(e.getMessage(), e); System.out.println(e); List<String> words = new ArrayList<String>(); words.add(str); return words; } } @Override public void runSegmentation(CoreMap annotation) { //0 2 // A BC D E // 1 10 1 1 // 0 12 3 4 // 0, 0+1 , String text = annotation.get(CoreAnnotations.TextAnnotation. class ); List<CoreLabel> sentChars = annotation.get(ChineseCoreAnnotations.CharactersAnnotation. class ); List<CoreLabel> tokens = new ArrayList<CoreLabel>(); annotation.set(CoreAnnotations.TokensAnnotation. class , tokens); //List<String> words = segmenter.segmentString(text); List<String> words = splitWords(text); System.err.println(text); System.err.println( "--->" ); System.err.println(words); int pos = 0 ; for (String w : words) { CoreLabel fl = sentChars.get(pos); fl.set(CoreAnnotations.ChineseSegAnnotation. class , "1" ); if (w.length() == 0 ) { continue ; } CoreLabel token = new CoreLabel(); token.setWord(w); token.set(CoreAnnotations.CharacterOffsetBeginAnnotation. class , fl.get(CoreAnnotations.CharacterOffsetBeginAnnotation. class )); pos += w.length(); fl = sentChars.get(pos - 1 ); token.set(CoreAnnotations.CharacterOffsetEndAnnotation. class , fl.get(CoreAnnotations.CharacterOffsetEndAnnotation. class )); tokens.add(token); } } } |
在外面为IKAnalyzer初始化词典,指定扩展词典和删除词典
1 2 3 4 5 6 7 8 9 10 | //为ik初始化词典,删除干扰词 Dictionary.initial(DefaultConfig.getInstance()); String delDic = System.getProperty(READ_IK_DEL_DIC, null ); BufferedReader reader = new BufferedReader( new FileReader(delDic)); String line = null ; List<String> delWords = new ArrayList<>(); while ((line = reader.readLine()) != null ) { delWords.add(line); } Dictionary.getSingleton().disableWords(delWords); |
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步