Python自然语言处理学习笔记(26)：3.10 小结

3.10 Summary 小结

• In this book we view a text as a list of words. A “raw text” is a potentially long string containing words and whitespace formatting, and is how we typically store and visualize a text.

• A string is specified in Python using single or double quotes: 'Monty Python', "Monty Python".

• The characters of a string are accessed using indexes, counting from zero: 'Monty Python'[0] gives the value M. The length of a string is found using len().

• Substrings are accessed using slice notation: 'Monty Python'[1:5] gives the value onty. If the start index is omitted, the substring begins at the start of the string; if the end index is omitted, the slice continues to the end of the string.

• Strings can be split into lists: 'Monty Python'.split() gives ['Monty', 'Python']. Lists can be joined into strings: '/'.join(['Monty', 'Python']) gives 'Monty/ Python'.

• We can read text from a file f using text = open(f).read(). We can read text from a URL u using text = urlopen(u).read(). We can iterate over the lines of a text file using for line in open(f).

• Texts found on the Web may contain unwanted material (such as headers, footers, and markup), that need to be removed before we do any linguistic processing.

• Tokenization is the segmentation of a text into basic units—or tokens—such as words and punctuation. Tokenization based on whitespace is inadequate（不恰当） for many applications because it bundles（捆） punctuation together with words. NLTK provides an off-the-shelf（现成的）tokenizer nltk.word_tokenize().

• Lemmatization is a process that maps the various forms of a word (such as appeared, appears) to the canonical（标准的） or citation（引用） form of the word, also known as the lexeme（词位） or lemma (e.g., appear).

• Regular expressions are a powerful and flexible method of specifying patterns. Once we have imported the re module, we can use re.findall() to find all substrings in a string that match a pattern.

• If a regular expression string includes a backslash, you should tell Python not to preprocess the string, by using a raw string with an r prefix: r'regexp'.

• When backslash is used before certain characters, e.g., \n, this takes on a special meaning (newline character); however, when backslash is used before regular expression wildcards and operators, e.g., \., \|, \$, these characters lose their special meaning and are matched literally.

• A string formatting expression template % arg_tuple consists of a format string template that contains conversion specifiers like %-6s and %0.2d.

posted @ 2011-08-11 22:20 牛皮糖NewPtone 阅读(727) 评论(0) 编辑收藏举报

刷新页面返回顶部

Python自然语言处理学习笔记(26)：3.10 小结

公告