定义来自维基百科
Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing such as parsing or text mining.
Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural language or in computer languages, according to the rules of a formal grammar. Same meanning with Part of Speech(POS).
Text segmentation is the process of dividing written text into meaningful units, such as words, sentences, or topics.
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens.