Well, today I learned the word normalization and stemming. 

After word tokenization, we should stem to map them to a normal form. For examples, u should refer "are is " to "be", and refer "windows" to "window" and so on. Afterwards, we can use Linux tool to implement.

Firstly, u know, divede every word into one line and display.

 

translate the captial to lowercase

 

grep, which is global search regular expression and print out the line, allow u to use regular expression.

then, u should sort before use uniq by count, afterwards, u can sort by num by default increasing, if u don't use '-r'.

note, u can use 'G' to go to end, and use 'g' to go to the begin.

 

u konw some words which is end with ing are no need to find out, so we can modify the regular expression.

well, the result is better even thought there are still some words included. there are a long road for me.

 

In conclusion, we should segmentation words then we normalize those words.

 

 

 

posted on 2013-04-01 10:22  MrMission  阅读(497)  评论(1编辑  收藏  举报