Python自然语言处理学习笔记(27):3.11 深入阅读

3.11 Further Reading 深入阅读

 

Extra materials for this chapter are posted at http://www.nltk.org/ , including links to freely available resources on the Web. Remember to consult the Python reference materials at http://docs.python.org/ . (For example, this documentation covers “universal newline support,” explaining how to work with the different newline conventions used by various operating systems.)

For more examples of processing words with NLTK, see the tokenization, stemming, and corpus HOWTOs at http://www.nltk.org/howto . Chapters 2 and 3 of (Jurafsky &Martin, 2008) contain more advanced material on regular expressions and morphology.

For more extensive discussion of text processing with Python, see (Mertz, 2003). For information about normalizing non-standard words, see (Sproat et al., 2001).

There are many references for regular expressions, both practical and theoretical. For an introductory tutorial to using regular expressions in Python, see Kuchling’s Regular Expression HOWTO, http://www.amk.ca/python/howto/regex/. For a comprehensive and detailed manual in using regular expressions, covering their syntax in most major programming languages, including Python, see (Friedl, 2002). Other presentations include Section 2.1 of (Jurafsky & Martin, 2008), and Chapter 3 of (Mertz, 2003).

There are many online resources for Unicode. Useful discussions of Python’s facilities

for handling Unicode are:

PEP-100  http://www.python.org/dev/peps/pep-0100/ 

Jason Orendorff, Unicode for Programmers,

 http://www.jorendorff.com/articles/uni code/  

A. M. Kuchling, Unicode HOWTO,

http://www.amk.ca/python/howto/unicode

Frederik Lundh, Python Unicode Objects,

http://effbot.org/zone/unicode-objects.htm

Joel Spolsky, The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), http://www.joelonsoftware.com/articles/Unicode.html

 

The problem of tokenizing Chinese text is a major focus of SIGHAN, the ACL Special Interest Group on Chinese Language Processing (http://sighan.org/). Our method for segmenting English text follows (Brent & Cartwright, 1995); this work falls in the area of language acquisition (Niyogi, 2006).

Collocations are a special case of multiword expressions. A multiword expression is a small phrase whose meaning and other properties cannot be predicted from its words alone, e.g., part-of-speech (Baldwin & Kim, 2010).

Simulated annealing is a heuristic for finding a good approximation to the optimum value of a function in a large, discrete search space, based on an analogy with annealing in metallurgy. The technique is described in many Artificial Intelligence texts.

The approach to discovering hyponyms in text using search patterns like x and other ys is described by (Hearst, 1992).

posted @ 2011-08-11 22:21  牛皮糖NewPtone  阅读(606)  评论(0编辑  收藏  举报