Python自然语言处理学习笔记(41):5.2 标注语料库
Representing Tagged Tokens 表示标注的语言符号
By convention in NLTK, a tagged token is represented using a tuple consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple():
|
We can construct a list of tagged tokens directly from a string. The first step is to tokenize the string to access the individual word/tag strings, and then to convert each of these into a tuple (using str2tuple()).
|
Reading Tagged Corpora 读取已标注的语料库
Several of the corpora included with NLTK have been tagged for their part-of-speech. Here's an example of what you might see if you opened a file from the Brown Corpus with a text editor:
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.
Other corpora use a variety of formats for storing part-of-speech tags. NLTK's corpus readers provide a uniform interface so that you don't have to be concerned with the different file formats. In contrast with the file extract shown above, the corpus reader for the Brown Corpus represents the data as shown below. Note that part-of-speech tags have been converted to uppercase, since this has become standard practice(标准惯例) since the Brown Corpus was published.
|
Whenever a corpus contains tagged text, the NLTK corpus interface will have a tagged_words() method. Here are some more examples, again using the output format illustrated for the Brown Corpus:
|
Not all corpora employ the same set of tags; see the tagset help functionality and the readme() methods mentioned above for documentation. Initially we want to avoid the complications of these tagsets, so we use a built-in mapping to a simplified tagset:
|
Tagged corpora for several other languages are distributed with NLTK, including Chinese, Hindi, Portuguese, Spanish, Dutch and Catalan. These usually contain non-ASCII text, and Python always displays this in hexadecimal when printing a larger structure such as a list.
|
If your environment is set up correctly, with appropriate editors and fonts, you should be able to display individual strings in a human-readable way. For example, Figure 5.1 shows data accessed using nltk.corpus.indian.
Figure 5.1: POS-Tagged Data from Four Indian Languages: Bangla, Hindi, Marathi, and Telugu
If the corpus is also segmented into sentences, it will have a tagged_sents() method that divides up the tagged words into sentences rather than presenting them as one big list. This will be useful when we come to developing automatic taggers, as they are trained and tested on lists of sentences, not words.
Tagged corpora use many different conventions for tagging words. To help us get started, we will be looking at a simplified tagset (shown in Table 5.1).
Tag | Meaning | Examples |
ADJ | adjective | new, good, high, special, big, local |
ADV | adverb | really, already, still, early, now |
CNJ | conjunction | and, or, but, if, while, although |
DET | determiner | the, a, some, most, every, no |
EX | existential | there, there's |
FW | foreign word | dolce, ersatz, esprit, quo, maitre |
MOD | modal verb | will, can, would, may, must, should |
N | noun | year, home, costs, time, education |
NP | proper noun | Alison, Africa, April, Washington |
NUM | number | twenty-four, fourth, 1991, 14:24 |
PRO | pronoun | he, their, her, its, my, I, us |
P | preposition | on, of, at, with, by, into, under |
TO | the word to | to |
UH | interjection | ah, bang, ha, whee, hmpf, oops |
V | verb | is, has, get, do, make, see, run |
VD | past tense | said, took, told, made, asked |
VG | present participle | making, going, playing, working |
VN | past participle | given, taken, begun, sung |
WH | wh determiner | who, which, when, what, where, howTable 5.1: Simplified Part-of-Speech Tagset |
Let's see which of these tags are the most common in the news category of the Brown corpus:
|
Note
Your Turn: Plot the above frequency distribution using tag_fd.plot(cumulative=True). What percentage of words are tagged using the first five tags of the above list? 60%
We can use these tags to do powerful searches using a graphical POS-concordance tool nltk.app.concordance(). Use it to search for any combination of words and POS tags, e.g. N N N N, hit/VD, hit/VN, or the ADJ man.
Nouns 名词
Nouns generally refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intelligence. Nouns can appear after determiners and adjectives, and can be the subject or object of the verb, (名词可以出现在限定词和形容词之后,并且可以做动词的主语或宾语)as shown in Table 5.2.
Word | After a determiner | Subject of the verb |
woman | the woman who I saw yesterday ... | the woman sat down |
Scotland | the Scotland I remember as a child ... | Scotland has five million people |
book | the book I bought yesterday ... | this book recounts the colonization of Australia |
intelligence | the intelligence displayed by the child ... | Mary's intelligence impressed her teachersTable 5.2: Syntactic Patterns involving some Nouns |
The simplified noun tags are N for common nouns like book, and NP for proper nouns like Scotland.
Let's inspect some tagged text to see what parts of speech occur before a noun, with the most frequent ones first. To begin with, we construct a list of bigrams whose members are themselves word-tag pairs such as (('The', 'DET'), ('Fulton', 'NP')) and (('Fulton', 'NP'), ('County', 'N')). Then we construct a FreqDist from the tag parts of the bigrams.
|
(a,b)也就是(('The', 'DET'), ('Fulton', 'NP')),如果b[1]==’N’,则给出前面这个词的词性a[1]
This confirms our assertion that nouns occur after determiners and adjectives, including numeral adjectives (tagged as NUM).
Verbs 动词
Verbs are words that describe events and actions, e.g. fall, eat in Table 5.3. In the context of a sentence, verbs typically express a relation involving the referents of one or more noun phrases.
Word | Simple | With modifiers and adjuncts (italicized) |
fall | Rome fell | Dot com stocks suddenly fell like a stone |
eat | Mice eat cheese | John ate the pizza with gustoTable 5.3: Syntactic Patterns involving some Verbs |
What are the most common verbs in news text? Let's sort all the verbs by frequency:
|
Note that the items being counted in the frequency distribution are word-tag pairs. Since words and tags are paired, we can treat the word as a condition and the tag as an event, and initialize a conditional frequency distribution with a list of condition-event pairs. This lets us see a frequency-ordered list of tags given a word:
|
We can reverse the order of the pairs, so that the tags are the conditions, and the words are the events(词作为条件,标签作为事件). Now we can see likely words for a given tag:
|
To clarify the distinction between VD (past tense) and VN (past participle), let's find words which can be both VD and VN, and see some surrounding text:
|
In this case, we see that the past participle of kicked is preceded by a form of the auxiliary verb have. Is this generally true?
Note
Your Turn: Given the list of past participles specified by cfd2['VN'].keys(), try to collect a list of all the word-tag pairs that immediately precede items in that list.
Adjectives and Adverbs 形容词和副词
Two other important word classes are adjectives and adverbs. Adjectives describe nouns, and can be used as modifiers (e.g. large in the large pizza), or in predicates (e.g. the pizza is large). English adjectives can have internal structure (e.g. fall+ing in the falling stocks). Adverbs modify verbs to specify the time, manner, place or direction of the event described by the verb (e.g. quickly in the stocks fell quickly). Adverbs may also modify adjectives (e.g. really in Mary's teacher was really nice).
English has several categories of closed class words in addition to prepositions, such as articles (also often called determiners) (e.g., the, a), modals (e.g., should, may), and personal pronouns (e.g., she, they). Each dictionary and grammar classifies these words differently.
Note
Your Turn: If you are uncertain about some of these parts of speech, study them using nltk.app.concordance(), or watch some of the Schoolhouse Rock! grammar videos available at YouTube, or consult the Further Reading section at the end of this chapter.
Unsimplified Tags 未简化的标签
Let's find the most frequent nouns of each noun part-of-speech type. The program in Example 5.2 finds all tags starting with NN, and provides a few example words for each one. You will see that there are many variants of NN; the most important contain $ for possessive nouns, S for plural nouns (since plural nouns typically end in s) and P for proper nouns. In addition, most of the tags have suffix modifiers: -NC for citations, -HL for words in headlines and -TL for titles (a feature of Brown tabs).
| ||
| ||
Example 5.2 (code_findtags.py): Program to Find the Most Frequent Noun Tags |
When we come to constructing part-of-speech taggers later in this chapter, we will use the unsimplified tags.
Exploring Tagged Corpora 探索标注的语料库
Let's briefly return to the kinds of exploration of corpora we saw in previous chapters, this time exploiting POS tags.
Suppose we're studying the word often and want to see how it is used in text. We could ask to see the words that follow often
|
However, it's probably more instructive use the tagged_words() method to look at the part-of-speech tag of the following words:
|
Notice that the most high-frequency parts of speech following often are verbs. Nouns never appear in this position (in this particular corpus).
Next, let's look at some larger context, and find words involving particular sequences of tags and words (in this case "<Verb> to <Verb>"). In code-three-word-phrase we consider each three-word window in the sentence , and check if they meet our criterion . If the tags match, we print the corresponding words .
| ||
| ||
Example 5.3 (code_three_word_phrase.py): Figure 5.3: Searching for Three-Word Phrases Using POS Tags |
Finally, let's look for words that are highly ambiguous as to their part of speech tag. Understanding why such words are tagged as they are in each context can help us clarify the distinctions between the tags.
|
Note
Your Turn: Open the POS concordance tool nltk.app.concordance() and load the complete Brown Corpus (simplified tagset). Now pick some of the above words and see how the tag of the word correlates with the context of the word. E.g. search for near to see all forms mixed together, near/ADJ to see it used as an adjective, near N to see just those cases where a noun follows, and so forth.