Alex He

...永远保持希望与激情...约会未来更强大的自己...

 

Introduction to MMSEG[wiki]

Contents

1. Overview

2. Algorithm

a) Matching Algorithm

b) Ambiguity Resolution Rules

3.Example

Overview

MMSEG is a Chinese word segmentation algorithm, the basic principle is maximum matching(we will describe latterly). Furthermore, to resolve ambiguities not resolved by the complex maximum matching algorithm, four more ambiguity resolution rules have been implemented.

Algorithm

a) Matching Algorithm

These are two kinds of algorithm for maximum matching, Simple maximum matching and Complex maximum matching.

Simple Matching: The basic form is to resolve the ambiguity of a single word. For example, suppose C1, C2,... ,Cn represent characters in a string(sentence). We are at the beginning of the string and want to know where the words are. We first search the lexicon from dictionary to see if _C1_ is a one-character word, then search _C1C2_ to see if it is a two-character word, and so on, until the combination is longer the longest words in the lexicon. The most plausible word will be the longest match. We take this word, then continue this process until the last word of the string is identified.

Complex Matching: Another variant of maximum matching is more complex than the basic form. Their maximum matching rule says that the most plausible segmentation is the three-word chunk with maximum length. Again, we are at the beginning of a string(sentence) and want to know where the words are. If there are ambiguous segmentations (e.g., _C1_ is a word, but _C1C2_ is also a word, and so on), then we look ahead two more words to find all possible three-word chunks beginning with _C1_ or _C1C2_.

So, we will have much more chunks beginning with character “C1” or “C1C2”,may be beginning with “C1C2C3”, and each chunk have three words. And now, the most important thing is choose the best chunk in those chunks with some principles, that to say, we must use some rules or algorithms filtering other chunks. We will description the ambiguity resolution rules later.

When we choose a best chunk, we get the first word in the chunk save as the word of our string(Chinese sentence) segmenting result, remove the word from the string then continue this process until the last word of the string is identified.

b) Ambiguity Resolution Rules

Now, we introduce four rules to resolute ambiguity chunks in the matching algorithm.

clip_image001

Rule1: (a) Simple maximum matching: Pick the word with the maximum length. (b) Complex maximum matching: Pick the first word from the chunk with maximum length. In the above example, it picks “C1C2” from the second chunk “C1C2_C3_C4”. If there are more than one chunk with maximum length, apply the next rule.

Rule2: After the use of rule1, it may have more than one chunk. For example, the following chunks have the same length

clip_image002

Rule 2 picks the first word from the chunk with largest average word length. In the above example, it picks “C1C2” from the second chunk “C1C2_C3C4”. The assumption of this rule is that it is more likely to encounter multi-character words than one-character words.

This rule is useful only for condition in which one or more word position in the chunks are empty. When the chunks are real three-word chunks, this rule is not useful. Because three-word chunks with the same total length will certainly have the same average length. Therefore we need another solution.

Rule3: There are quite a few ambiguous conditions in which the Rule 1 and Rule 2 cannot resolve. For example, these two chunks have the same length:

clip_image003

Rule 3 picks the first of the chunk with smallest variance of word lengths. In the above example, it picks “C1C2” from the first chunk “C1C2_ C3C4_ C5C6”. The assumption of this rule is that word lengths are usually evenly distributed. If there are more than one chunks with the same value of smallest variance of word lengths, apply the next rule.

Rule4: This example shows two chunks with the same length, variance, and average word length:

clip_image004

Both chunks have two one-character words and one two-character word. Which one is more likely to be the correct one? Here we will focus on one-character words. Chinese characters differ in their degree of morphemic freedom. Some characters are rarely used as free morphemes, but others have larger degree of freedom. The frequency of occurrence of a character can serve as an index of its degree of morphemic freedom. A high frequency character is more likely to be a one-character word, and vice versa.

The formula used to calculate the sum of degree of morphemic freedom is to sum log(frequency) of all one-character word(s) in a chunk. In the above example, the sum of degree of morphemic freedom of chunk1 and chunk2 are “log(frequent(“C1”))+log(frequent(“C3”))” and “log((frequent(“C1”))+log(frequent(“C4”))”. The rationale for logarithm transformation is that the same amount of frequency difference does not have a consistent effect across all frequency ranges.

Rule 4 than picks the first word of the chunk with largest sum of log(frequency). Since it is very unlikely that two characters will have exactly the same frequency value, there should be no ambiguity after this rule has been applied.

Example

Now, we give an example to explain the algorithm. If we want to segment this Chinese sentence “研究生命起源的文章”. We use complex matching(a chunk have three words), according the words in the dictionary, we have 7 chunks list as follows:

Chunk1:研_究_生

Chunk2:研_究_生命

Chunk3:研究_生_命

Chunk4:研究_生命_起

Chunk5:研究_生命_起源

Chunk6:研究生_命_起

Chunk7:研究生_命_起源

Then, Four ambiguity resolution rules were used to chose a best matching chunk.

At first, we used rule 1: Maximum matching

Number

Chunk

Length

1

研_究_生

3

2

研_究_生命

4

3

研究_生_命

4

4

研究_生命_起

5

5

研究_生命_起源

6

6

研究生_命_起

5

7

研究生_命_起源

6

If we have just one chunk in this step, return it. If there are more than one chunks with maximum length, apply the next rule. Now, we get the chunks have maximum length, it may have much more than one. It’s Chunk5 and Chunk7.

Secondly, we use rule2: Largest average word length

Number

Chunk

Average word Length

5

研究_生命_起源

6/3=2

7

研究生_命_起源

6/3=2

In the same way, If we have just one chunk in this step, return it. If there are more than one chunks with largest average word length, apply the next rule. Now, we get chunk5 and chunk7 also.

Thirdly, we use rule3: Smallest variance of word lengths

Variance is the expectation, or mean, of the squared deviation of that variable from its expected value or mean. Take the chunk7 for example, the mean of length is 2, the words in the chunk are 3,1,and 2,so the variance is (3-2)2+(1-2)2+(2-2)2=2

Number

Chunk

Variance

5

研究_生命_起源

0

7

研究生_命_起源

2

Rule 3 picks the first of the chunk with smallest variance of word lengths. In the above example, it picks 研究_生命_起源from the first chunk. If there are more than one chunks with the same value of smallest variance of word lengths, apply the next rule.

Fourthly, we use rule4: Largest sum of degree of morphemic freedom of one-character words.

This example shows two chunks with the same length, variance, and average word length:

Number

Chunk

the sum of degree of morphemic freedom

1

主要_是_因为

log(frequent(“是”))

2

主_要是_因为

log(frequent(“主”))

The formula used to calculate the sum of degree of morphemic freedom is to sum log(frequency) of all one-character word(s) in a chunk. The rationale for logarithm transformation is that the same amount of frequency difference does not have a consistent effect across all frequency ranges.

Rule 4 than picks the first word of the chunk with largest sum of log(frequency). Since it is very unlikely that two characters will have exactly the same frequency value, there should be no ambiguity after this rule has been applied.

The round of calculate and filter other chunks, we will have only one chunk finally. And we get the first word of the final chunk as a segment word, and use other words to segment later. Just like we have the chunk5(“研究_生命_起源”), we get the first word “研究”, and the sentence remove word “研究” will segment later. That is, we use sentence “生命起源的文章” to segment, , then continue this process until the last word of the string is identified.

References

1. MMSEG: A Word Identification System for Mandarin Chinese Text Based on Two Variants of the Maximum Matching Algorithm[On- line]. Available: http://technology.chtsai.org/mmseg/

2. Word identification for Mandarin Chinese sentences. Proceedings of the Fifteenth International Conference on Computational Linguistics, Nantes: COLING-92.

posted on 2012-08-03 10:28  Alex木头  阅读(298)  评论(0编辑  收藏  举报

导航