Python自然语言处理学习笔记(31):4.3 关于风格

4.3   Questions of Style 关于风格

 

Programming is as much an art as a science(编程作为一门像艺术一样的科学). The undisputed "bible" of programming, a 2,500 page multi-volume work by Donald Knuth, is called The Art of Computer Programming. Many books have been written on Literate Programming, recognizing that humans, not just computers, must read and understand programs. Here we pick up on some issues of programming style that have important ramifications(分支) for the readability of your code, including code layout, procedural vs declarative style, and the use of loop variables.

 

Python Coding Style   Python代码风格

 

When writing programs you make many subtle choices about names, spacing, comments, and so on. When you look at code written by other people, needless differences in style make it harder to interpret the code. Therefore, the designers of the Python language have published a style guide for Python code, available at http://www.python.org/dev/peps/pep-0008/. The underlying value presented in the style guide is consistency(一致性), for the purpose of maximizing the readability of code(为了最大化代码的可读性). We briefly review some of its key recommendations here, and refer readers to the full guide for detailed discussion with examples.

Code layout should use four spaces per indentation level(缩进为4个空格哦). You should make sure that when you write Python code in a file, you avoid tabs for indentation(避免使用tab来产生缩进,因为不同的文本编辑器产生空格数量不同,我吃过类似的苦头..., since these can be misinterpreted by different text editors and the indentation can be messed up. Lines should be less than 80 characters long; if necessary you can break a line inside parentheses, brackets, or braces, because Python is able to detect that the line continues over to the next line, e.g.:

 

>>> cv_word_pairs = [(cv, w) for w in rotokas_words

...                          for cv in re.findall('[ptksvr][aeiou]', w)]

>>> cfd = nltk.ConditionalFreqDist(

...           (genre, word)

...           for genre in brown.categories()

...           for word in brown.words(categories=genre))

>>> ha_words = ['aaahhhh', 'ah', 'ahah', 'ahahah', 'ahh', 'ahhahahaha',

...             'ahhh', 'ahhhh', 'ahhhhhh', 'ahhhhhhhhhhhhhh', 'ha',

...             'haaa', 'hah', 'haha', 'hahaaa', 'hahah', 'hahaha']

If you need to break a line outside parentheses, brackets, or braces, you can often add extra parentheses, and you can always add a backslash at the end of the line that is broken:

 

>>> if (len(syllables) > 4 and len(syllables[2]) == 3 and

...    syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]):

...     process(syllables)

>>> if len(syllables) > 4 and len(syllables[2]) == 3 and \

...    syllables[2][2] in [aeiou] and syllables[2][3] == syllables[1][3]:

...     process(syllables)

Note

Typing spaces instead of tabs soon becomes a chore(繁琐的事). Many programming editors have built-in support for Python, and can automatically indent code and highlight any syntax errors (including indentation errors). For a list of Python-aware editors, please seehttp://wiki.python.org/moin/PythonEditors.

 

Procedural vs Declarative Style 过程式VS声明式风格

 

We have just seen how the same task can be performed in different ways, with implications for efficiency. Another factor influencing program development is programming style. Consider the following program to compute the average length of words in the Brown Corpus:

 

>>> tokens = nltk.corpus.brown.words(categories='news')

>>> count = 0

>>> total = 0

>>> for token in tokens:

...     count += 1

...     total += len(token)

>>> print total / count

4.2765382469

In this program we use the variable count to keep track of the number of tokens seen, and total to store the combined length of all words. This is a low-level style, not far removed from machine code, the primitive operations performed by the computer's CPU(这是低级的风格,和机器代码差不多,就像是被计算机CPU执行的原始代码). The two variables are just like a CPU's registers, accumulating values at many intermediate stages, values that are meaningless until the end. We say that this program is written in a procedural style, dictating the machine operations step by step. Now consider the following program that computes the same thing:

 

>>> total = sum(len(t) for t in tokens)

>>> print total / len(tokens)

4.2765382469

The first line uses a generator expression to sum the token lengths, while the second line computes the average as before. Each line of code performs a complete, meaningful task, which can be understood in terms of high-level properties like: "total is the sum of the lengths of the tokens". Implementation details are left to the Python interpreter. The second program uses a built-in function, and constitutes programming at a more abstract level; the resulting code is more declarative. Let's look at an extreme example:

 

>>> word_list = []

>>> len_word_list = len(word_list)

>>> i = 0

>>> while i < len(tokens):

...     j = 0

...     while j < len_word_list and word_list[j] < tokens[i]:

...         j += 1

...     if j == 0 or tokens[i] != word_list[j]:

...         word_list.insert(j, tokens[i])

...         len_word_list += 1

...     i += 1

The equivalent declarative version uses familiar built-in functions, and its purpose is instantly recognizable:

 

>>> word_list = sorted(set(tokens))

Another case where a loop counter seems to be necessary is for printing a counter with each line of output. Instead, we can use enumerate(), which processes a sequences and produces a tuple of the form (i, s[i]) for each item in s, starting with (0, s[0]). Here we enumerate the keys of the frequency distribution, and capture the integer-string pair in the variables rank and word. We print rank+1 so that the counting appears to start from 1, as required when producing a list of ranked items.

 

>>> fd = nltk.FreqDist(nltk.corpus.brown.words())

>>> cumulative = 0.0

>>> for rank, word in enumerate(fd):

...     cumulative += fd[word] * 100 / fd.N()

...     print "%3d %6.2f%% %s" % (rank+1, cumulative, word)

...     if cumulative > 25:

...         break

...

 1   5.40% the

 2 10.42% ,

 3 14.67% .

 4 17.78% of

 5 20.19% and

 6 22.40% to

 7 24.29% a

 8 25.97% in

It's sometimes tempting to use loop variables to store a maximum or minimum value seen so far. Let's use this method to find the longest word in a text.

 

>>> text = nltk.corpus.gutenberg.words('milton-paradise.txt')

>>> longest = ''

>>> for word in text:

...     if len(word) > len(longest):

...         longest = word

>>> longest

'unextinguishable'

However, a more transparent(易懂的) solution uses two list comprehensions, both having forms that should be familiar by now:

 

>>> maxlen = max(len(word) for word in text)

>>> [word for word in text if len(word) == maxlen]

['unextinguishable', 'transubstantiate', 'inextinguishable', 'incomprehensible']

Note that our first solution found the first word having the longest length, while the second solution found all of the longest words (which is usually what we would want). Although there's a theoretical efficiency difference between the two solutions, the main overhead is reading the data into main memory; once it's there, a second pass through the data is effectively instantaneous. We also need to balance our concerns about program efficiency with programmer efficiency. A fast but cryptic solution will be harder to understand and maintain.

 

Some Legitimate Uses for Counters 一些用于计数器的合理使用

 

There are cases where we still want to use loop variables in a list comprehension. For example, we need to use a loop variable to extract successive overlapping n-grams from a list:

 

>>> sent = ['The', 'dog', 'gave', 'John', 'the', 'newspaper']

>>> n = 3

>>> [sent[i:i+n] for i in range(len(sent)-n+1)]

[['The', 'dog', 'gave'],

 ['dog', 'gave', 'John'],

 ['gave', 'John', 'the'],

 ['John', 'the', 'newspaper']]

It is quite tricky to get the range of the loop variable right. Since this is a common operation in NLP, NLTK supports it with functions bigrams(text) and trigrams(text), and a general purpose ngrams(text, n).

Here's an example of how we can use loop variables in building multidimensional structures. For example, to build an array with m rows and n columns, where each cell is a set, we could use a nested list comprehension:

 

>>> m, n = 3, 7

>>> array = [[set() for i in range(n)] for j in range(m)]

>>> array[2][5].add('Alice')

>>> pprint.pprint(array)

[[set([]), set([]), set([]), set([]), set([]), set([]), set([])],

 [set([]), set([]), set([]), set([]), set([]), set([]), set([])],

 [set([]), set([]), set([]), set([]), set([]), set(['Alice']), set([])]]

Observe that the loop variables i and j are not used anywhere in the resulting object, they are just needed for a syntactically correct for statement. As another example of this usage, observe that the expression ['very' for i in range(3)] produces a list containing three instances of 'very', with no integers in sight.

Note that it would be incorrect to do this work using multiplication, for reasons concerning object copying that were discussed earlier in this section.

(如果array[2][5]=7的话,有一列变为7,是否理解为这个数组的每一行都指向都一个数组对象,修改了其中的一个元素,其他的都会改变

 

>>> array = [[set()] * n] * m

>>> array[2][5].add(7)

>>> pprint.pprint(array)

[[set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])],

 [set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])],

 [set([7]), set([7]), set([7]), set([7]), set([7]), set([7]), set([7])]]

Iteration is an important programming device. It is tempting to adopt idioms from other languages. However, Python offers some elegant and highly readable alternatives, as we have seen.

posted @ 2011-08-12 23:12  牛皮糖NewPtone  阅读(854)  评论(0编辑  收藏  举报