sharpICTCLAS 中在找出所有词组组合时的优化

在Segment.GenerateWordNet()中，循环每一个原子分词，找出所有可能的词组组合。
像下面的一句话的进行分词：

string sSentence = @"三星SHX-132型号的(手机)1元钱２５６.８９元12.14%百分比12％";

原子分词：

始##始 1

三 7

星 7

SHX-132 5

型 7

号 7

的 7

( 6

手 7

机 7

) 6

1 5

元 7

钱 7

２５６.８９ 9

元 7

12.14% 5

百 7

分 7

比 7

12 5

％ 6

末##末 4

在进入词组判断时，调用GenerateWordNet()

// 将所有可能的组词存入m_segGraph

for (int i = 0; i < atomSegment.Count; i++)//All the word

{

sWord = atomSegment[i].sWord;//Get the current atom

int j = i + 1;

//是否过滤一下？只考虑中文字？？

//WordInfo wordInfo = coreDict.GetMatchedWordInfo(sWord);

while (j < atomSegment.Count && coreDict.GetMaxMatch(sWord, out sMaxMatchWord, out nPOSRet))

{

if (sMaxMatchWord == sWord) // 就是我们要找的词

{

WordInfo info = coreDict.GetWordInfo(sWord); // 该词可能就有多种词性

}

sWord += atomSegment[j++].sWord;

}

在while循环中，需要判断下面这些词组：
“三星SHX-132”，“星SHX-132”，“SHX-132”，“SHX-132型”，“的(”，“(”，“手机)”，“机)”，“)”，“1”，“钱２５６.８９”，“２５６.８９”，“元12.14%”等等。
词组判断共进行了44次，无效词组判断有17次，无效效率达到38.6%
个人认为，数字、符号、分隔符、字母无需进行词组判断，可以节省一些时间效率。

posted on 2007-08-16 13:02 amber lee zhao 阅读(304) 评论(0) 编辑收藏举报

刷新页面返回顶部

飞流直下三千尺

sharpICTCLAS 中在找出所有词组组合时的优化

导航

公告