Python自然语言处理学习笔记(10):2.2 条件频率分布
转载请注明出处“一块努力的牛皮糖”:http://www.cnblogs.com/yuxc/
新手上路,翻译不恰之处,恳请指出,不胜感谢
Updated 1st 2011.8.5
2.2 Conditional Frequency Distributions 条件频率分布
We introduced frequency distributions in Section 1.3. We saw that given some list mylist of words or other items, FreqDist(mylist)would compute the number of occurrences of each item in the list. Here we will generalize this idea(这里我们将泛化这个想法).
When the texts of a corpus are divided into several categories (by genre, topic, author,etc.), we can maintain separate frequency distributions for each category. This will allow us to study systematic differences between the categories. In the previous section, we achieved this using NLTK’s ConditionalFreqDist data type. A conditional frequency distribution(条件频率分布)is a collection of frequency distributions, each one for a different"condition."(所谓“条件”其实就是类型嘛) The condition will often be the category of the text. Figure 2-4 depicts a fragment of a conditional frequency distribution having just two conditions, one for news text and one for romance text.
Figure 2-4. Counting words appearing in a text collection (文本搜集)(a conditional frequency distribution).
Conditions and Events 条件和事件
A frequency distribution counts observable events, such as the appearance of words in a text. A conditional frequency distribution needs to pair each event with a condition(用条件来对每一个事件配对).So instead of processing a sequence of words①, we have to process a sequence of pairs②:
>>> pairs = [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...]
Each pair has the form (condition, event). If we were processing the entire Brown Corpus by genre, there would be 15 conditions (one per genre) and 1,161,192 events (one per word)
Counting Words by Genre 按类型统计单词
In Section 2.1, we saw a conditional frequency distribution where the condition was the section of the Brown Corpus, and for each condition we counted words. Whereas FreqDist() takes a simple list as input, ConditionalFreqDist()takes a list of pairs.
>>> cfd = nltk.ConditionalFreqDist(
... (genre, word)
... for genre in brown.categories()
... for word in brown.words(categories=genre))
Let’s break this down(分解), and look at just two genres, news and romance. For each genre②, we loop over every word in the genre③, producing pairs consisting of the genre and the word①:
... for genre in ['news', 'romance'] ②
... for word in brown.words(categories=genre)] ③
>>> len(genre_word)
170576
So, as we can see in the following code, pairs at the beginning of the list genre_word will be of the form ('news', word) ①, whereas those at the end will be of the form ('romance', word)②.
[('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] ①
>>> genre_word[-4:]
[('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] ②
We can now use this list of pairs to create a ConditionalFreqDist, and save it in a variable cfd. As usual, we can type the name of the variable to inspect it①, and verify it has two conditions②:
>>> cfd ①
<ConditionalFreqDist with 2 conditions>
>>> cfd.conditions()
['news', 'romance'] ②
Let’s access the two conditions, and satisfy ourselves that each is just a frequency distribution:
<FreqDist with 100554 outcomes>
>>> cfd['romance']
<FreqDist with 70022 outcomes>
>>> list(cfd['romance'])
[',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'I', 'in', 'he', 'had',
'?', 'her', 'that', 'it', 'his', 'she', 'with', 'you', 'for', 'at', 'He', 'on', 'him'
'said', '!', '--', 'be', 'as', ';', 'have', 'but', 'not', 'would', 'She', 'The', ...]
>>> cfd['romance']['could']
193
Plotting and Tabulating Distributions 绘图和制表分布
Apart from combining two or more frequency distributions, and being easy to initialize, a ConditionalFreqDistprovides some useful methods for tabulation and plotting.
The plot in Figure 2-1 was based on a conditional frequency distribution reproduced in the following code. The condition is either of the words america or citizen②, and the counts being plotted are the number of times the word occurred in a particular speech. It exploits the fact that the filename for each speech—for example, 1865-Lincoln.txt—contains the year as the first four characters①. This code generates the pair ('america', '1865') for every instance of a word whose lowercased form starts with america—such as Americans—in the file 1865-Lincoln.txt.
>>> cfd = nltk.ConditionalFreqDist(
... (target, fileid[:4])
... for fileid in inaugural.fileids()
... for w in inaugural.words(fileid)
... for target in ['america', 'citizen']
... if w.lower().startswith(target))
The plot in Figure 2-2 was also based on a conditional frequency distribution, reproduced in the following code. This time, the condition is the name of the language, and the counts being plotted are derived from(来源于) word lengths①. It exploits the fact that the filename for each language is the language name followed by '-Latin1' (the character encoding字符编码).
>>> languages = ['Chickasaw', 'English', 'German_Deutsch',
... 'Greenlandic_Inuktikut', 'Hungarian_Magyar', 'Ibibio_Efik']
>>> cfd = nltk.ConditionalFreqDist(
... (lang, len(word)) ①
... for lang in languages
... for word in udhr.words(lang + '-Latin1'))
In the plot() and tabulate() methods, we can optionally specify which conditions to display with a conditions= parameter. When we omit it, we get all the conditions. Similarly, we can limit the samples to display with a samples= parameter. This makes it possible to load a large quantity of data into a conditional frequency distribution, and then to explore it by plotting or tabulating selected conditions and samples. It also gives us full control over the order of conditions and samples in any displays. For example, we can tabulatethe cumulative frequency data just for two languages, and for words less than 10 characters long, as shown next. We interpret the last cell on the top row to mean that 1,638 words of the English text have nine or fewer letters.
... samples=range(10), cumulative=True)
0 1 2 3 4 5 6 7 8 9
English 0 185 525 883 997 1166 1283 1440 1558 1638
German_Deutsch 0 171 263 614 717 894 1013 1110 1213 1275
如果关了cumulative=False,结果就如下:
In [8]: cfd.tabulate(conditions=['English', 'German_Deutsch'],
...: samples=range(10), cumulative=False)
0 1 2 3 4 5 6 7 8 9
English 0 185 340 358 114 169 117 157 118 80
German_Deutsch 0 171 92 351 103 177 119 97 103 62
Your Turn: Working with the news and romance genres from the Brown Corpus, find out which days of the week are most newsworthy(有报道价值的), and which are most romantic. Define a variable called days containing a list of days of the week, i.e., ['Monday', ...]. Now tabulate the counts for these words using cfd.tabulate(samples=days). Now try the same thing using plot in place of tabulate. You may control the output order of days with the help of an extra parameter: conditions=['Monday', ...].
You may have noticed that the multiline expressions we have been using with conditional frequency distributions look like list comprehensions, but without the brackets. In general, when we use a list comprehension as a parameter to a function, like set([w.lower for w in t]), we are permitted to omit the square brackets and just write set(w.lower() for w in t). (See the discussion of “generator expressions” in Section 4.2 for more about this.)
Generating Random Text with Bigrams 用双连词产生随机文本
We can use a conditional frequency distribution to create a table of bigrams (wordpairs, introduced in Section 1.3). The bigrams() function takes a list of words and builds a list of consecutive word pairs:
... 'and', 'the', 'earth', '.']
>>> nltk.bigrams(sent)
[('In', 'the'), ('the', 'beginning'), ('beginning', 'God'), ('God', 'created'),
('created', 'the'), ('the', 'heaven'), ('heaven', 'and'), ('and', 'the'),
('the', 'earth'), ('earth', '.')]
In Example 2-1, we treat each word as a condition, and for each one we effectively create a frequency distribution over the following words. The function generate_model() contains a simple loop to generate text. When we call the function, we choose a word (such as 'living') as our initial context. Then, once inside the loop, we print the current value of the variable word, and reset word to be the most likely token in that context (using max()); next time through the loop, we use that word as our new context. As you can see by inspecting the output, this simple approach to text generation tends to get stuck(陷入僵局) in loops. Another method would be to randomly choose the next word from among the available words.
Example 2-1. Generating random text: This program obtains all bigrams from the text of the book of Genesis, then constructs a conditional frequency distribution to record which words are most likely to follow a given word; e.g., after the word living, the most likely word is creature; the generate_model() function uses this data, and a seed(种子) word, to generate random text.
for i in range(num):
print word,
word = cfdist[word].max()
text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)①
>>> print cfd['living']
<FreqDist: 'creature': 7, 'thing': 4, 'substance': 2, ',': 1, '.': 1, 'soul': 1>
>>> generate_model(cfd, 'living')
living creature that he said , and the land of the land of the land
Conditional frequency distributions are a useful data structure for many NLP tasks.Their commonly used methods are summarized in Table 2-4.
Table 2-4. NLTK’s conditional frequency distributions: Commonly used methods and idioms for defining, accessing, and visualizing a conditional frequency distribution of counters
Example |
Description |
cfdist = ConditionalFreqDist(pairs) |
Create a conditional frequency distribution from a list of pairs pair是一个二元组 |
cfdist.conditions() |
Alphabetically sorted list of conditions 按字母排序的 |
cfdist[condition] |
The frequency distribution for this condition |
cfdist[condition][sample] |
Frequency for the given sample for this condition |
cfdist.tabulate() |
Tabulate the conditional frequency distribution |
cfdist.tabulate(samples, conditions) |
Tabulation limited to the specified samples and conditions |
cfdist.plot() |
Graphical plot of the conditional frequency distribution |
cfdist.plot(samples, conditions) |
Graphical plot limited to the specified samples and conditions |
cfdist1 < cfdist2 |
Test if samples in cfdist1 occur less frequently than in cfdist2 |