Python自然语言处理学习笔记(41):5.2 标注语料库

5.2   Tagged Corpora 标注语料库

 

Representing Tagged Tokens 表示标注的语言符号

By convention in NLTK, a tagged token is represented using a tuple consisting of the token and the tag. We can create one of these special tuples from the standard string representation of a tagged token, using the function str2tuple():

 

>>> tagged_token = nltk.tag.str2tuple('fly/NN')

>>> tagged_token

('fly', 'NN')

>>> tagged_token[0]

'fly'

>>> tagged_token[1]

'NN'

We can construct a list of tagged tokens directly from a string. The first step is to tokenize the string to access the individual word/tag strings, and then to convert each of these into a tuple (using str2tuple()).

 

>>> sent = '''

... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN

... other/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC

... Fulton/NP-tl County/NN-tl purchasing/VBG departments/NNS which/WDT it/PPS

... said/VBD ``/`` ARE/BER well/QL operated/VBN and/CC follow/VB generally/RB

... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT

... interest/NN of/IN both/ABX governments/NNS ''/'' ./.

... '''

>>> [nltk.tag.str2tuple(t) for t in sent.split()]

[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'),

('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]

 

Reading Tagged Corpora 读取已标注的语料库

 

Several of the corpora included with NLTK have been tagged for their part-of-speech. Here's an example of what you might see if you opened a file from the Brown Corpus with a text editor:

The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd / no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./.

Other corpora use a variety of formats for storing part-of-speech tags. NLTK's corpus readers provide a uniform interface so that you don't have to be concerned with the different file formats. In contrast with the file extract shown above, the corpus reader for the Brown Corpus represents the data as shown below. Note that part-of-speech tags have been converted to uppercase, since this has become standard practice(标准惯例) since the Brown Corpus was published.

 

>>> nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]

>>> nltk.corpus.brown.tagged_words(simplify_tags=True)

[('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]

Whenever a corpus contains tagged text, the NLTK corpus interface will have a tagged_words() method. Here are some more examples, again using the output format illustrated for the Brown Corpus:

 

>>> print nltk.corpus.nps_chat.tagged_words()

[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]

>>> nltk.corpus.conll2000.tagged_words()

[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]

>>> nltk.corpus.treebank.tagged_words()

[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]

Not all corpora employ the same set of tags; see the tagset help functionality and the readme() methods mentioned above for documentation. Initially we want to avoid the complications of these tagsets, so we use a built-in mapping to a simplified tagset:

 

>>> nltk.corpus.brown.tagged_words(simplify_tags=True)

[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]

>>> nltk.corpus.treebank.tagged_words(simplify_tags=True)

[('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...]

Tagged corpora for several other languages are distributed with NLTK, including Chinese, Hindi, Portuguese, Spanish, Dutch and Catalan. These usually contain non-ASCII text, and Python always displays this in hexadecimal when printing a larger structure such as a list.

 

>>> nltk.corpus.sinica_treebank.tagged_words()

[('\xe4\xb8\x80', 'Neu'), ('\xe5\x8f\x8b\xe6\x83\x85', 'Nad'), ...]

>>> nltk.corpus.indian.tagged_words()

[('\xe0\xa6\xae\xe0\xa6\xb9\xe0\xa6\xbf\xe0\xa6\xb7\xe0\xa7\x87\xe0\xa6\xb0', 'NN'),

('\xe0\xa6\xb8\xe0\xa6\xa8\xe0\xa7\x8d\xe0\xa6\xa4\xe0\xa6\xbe\xe0\xa6\xa8', 'NN'),

...]

>>> nltk.corpus.mac_morpho.tagged_words()

[('Jersei', 'N'), ('atinge', 'V'), ('m\xe9dia', 'N'), ...]

>>> nltk.corpus.conll2002.tagged_words()

[('Sao', 'NC'), ('Paulo', 'VMI'), ('(', 'Fpa'), ...]

>>> nltk.corpus.cess_cat.tagged_words()

[('El', 'da0ms0'), ('Tribunal_Suprem', 'np0000o'), ...]

If your environment is set up correctly, with appropriate editors and fonts, you should be able to display individual strings in a human-readable way. For example, Figure 5.1 shows data accessed using nltk.corpus.indian.

 

 

Figure 5.1: POS-Tagged Data from Four Indian Languages: Bangla, Hindi, Marathi, and Telugu

If the corpus is also segmented into sentences, it will have a tagged_sents() method that divides up the tagged words into sentences rather than presenting them as one big list. This will be useful when we come to developing automatic taggers, as they are trained and tested on lists of sentences, not words.

 

A Simplified Part-of-Speech Tagset   简化的词性标记集合

 

Tagged corpora use many different conventions for tagging words. To help us get started, we will be looking at a simplified tagset (shown in Table 5.1).

Tag

Meaning

Examples

ADJ

adjective

new, good, high, special, big, local

ADV

adverb

really, already, still, early, now

CNJ

conjunction

and, or, but, if, while, although

DET

determiner

the, a, some, most, every, no

EX

existential

there, there's

FW

foreign word

dolce, ersatz, esprit, quo, maitre

MOD

modal verb

will, can, would, may, must, should

N

noun

year, home, costs, time, education

NP

proper noun

Alison, Africa, April, Washington

NUM

number

twenty-four, fourth, 1991, 14:24

PRO

pronoun

he, their, her, its, my, I, us

P

preposition

on, of, at, with, by, into, under

TO

the word to

to

UH

interjection

ah, bang, ha, whee, hmpf, oops

V

verb

is, has, get, do, make, see, run

VD

past tense

said, took, told, made, asked

VG

present participle

making, going, playing, working

VN

past participle

given, taken, begun, sung

WH

wh determiner

who, which, when, what, where, howTable 5.1:

Simplified Part-of-Speech Tagset

Let's see which of these tags are the most common in the news category of the Brown corpus:

 

>>> from nltk.corpus import brown

>>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True)

>>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)

>>> tag_fd.keys()

['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD', ...]

Note

Your Turn: Plot the above frequency distribution using tag_fd.plot(cumulative=True). What percentage of words are tagged using the first five tags of the above list?   60%

We can use these tags to do powerful searches using a graphical POS-concordance tool nltk.app.concordance(). Use it to search for any combination of words and POS tags, e.g. N N N N, hit/VD, hit/VN, or the ADJ man.

 

Nouns 名词

Nouns generally refer to people, places, things, or concepts, e.g.: woman, Scotland, book, intelligence. Nouns can appear after determiners and adjectives, and can be the subject or object of the verb, (名词可以出现在限定词和形容词之后,并且可以做动词的主语或宾语)as shown in Table 5.2.

Word

After a determiner

Subject of the verb

woman

the woman who I saw yesterday ...

the woman sat down

Scotland

the Scotland I remember as a child ...

Scotland has five million people

book

the book I bought yesterday ...

this book recounts the colonization of Australia

intelligence

the intelligence displayed by the child ...

Mary's intelligence impressed her teachersTable 5.2:

Syntactic Patterns involving some Nouns

The simplified noun tags are N for common nouns like book, and NP for proper nouns like Scotland.

Let's inspect some tagged text to see what parts of speech occur before a noun, with the most frequent ones first. To begin with, we construct a list of bigrams whose members are themselves word-tag pairs such as (('The', 'DET'), ('Fulton', 'NP')) and (('Fulton', 'NP'), ('County', 'N')). Then we construct a FreqDist from the tag parts of the bigrams.

 

>>> word_tag_pairs = nltk.bigrams(brown_news_tagged)

>>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N'))

['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN', ...]

 

  (a,b)也就是(('The', 'DET'), ('Fulton', 'NP')),如果b[1]==’N’,则给出前面这个词的词性a[1]

This confirms our assertion that nouns occur after determiners and adjectives, including numeral adjectives (tagged as NUM).

 

Verbs 动词

Verbs are words that describe events and actions, e.g. fall, eat in Table 5.3. In the context of a sentence, verbs typically express a relation involving the referents of one or more noun phrases.

Word

Simple

With modifiers and adjuncts (italicized)

fall

Rome fell

Dot com stocks suddenly fell like a stone

eat

Mice eat cheese

John ate the pizza with gustoTable 5.3:

Syntactic Patterns involving some Verbs

What are the most common verbs in news text? Let's sort all the verbs by frequency:

 

>>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True)

>>> word_tag_fd = nltk.FreqDist(wsj)

>>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')]

['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V',

'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD',

'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V',

'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', ...]

Note that the items being counted in the frequency distribution are word-tag pairs. Since words and tags are paired, we can treat the word as a condition and the tag as an event, and initialize a conditional frequency distribution with a list of condition-event pairs. This lets us see a frequency-ordered list of tags given a word:

 

>>> cfd1 = nltk.ConditionalFreqDist(wsj)

>>> cfd1['yield'].keys()

['V', 'N']

>>> cfd1['cut'].keys()

['V', 'VD', 'N', 'VN']

We can reverse the order of the pairs, so that the tags are the conditions, and the words are the events(词作为条件,标签作为事件). Now we can see likely words for a given tag:

 

>>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj)

>>> cfd2['VN'].keys()

['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold',

'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said', ...]

To clarify the distinction between VD (past tense) and VN (past participle), let's find words which can be both VD and VN, and see some surrounding text:

 

>>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]]

['Asked', 'accelerated', 'accepted', 'accused', 'acquired', 'added', 'adopted', ...]

>>> idx1 = wsj.index(('kicked', 'VD'))

>>> wsj[idx1-4:idx1+1]

[('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', 'ADV'),

('kicked', 'VD')]

>>> idx2 = wsj.index(('kicked', 'VN'))

>>> wsj[idx2-4:idx2+1]

[('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', 'VN')]

In this case, we see that the past participle of kicked is preceded by a form of the auxiliary verb have. Is this generally true?

Note

Your Turn: Given the list of past participles specified by cfd2['VN'].keys(), try to collect a list of all the word-tag pairs that immediately precede items in that list.

 

Adjectives and Adverbs  形容词和副词

Two other important word classes are adjectives and adverbs. Adjectives describe nouns, and can be used as modifiers (e.g. large in the large pizza), or in predicates (e.g. the pizza is large). English adjectives can have internal structure (e.g. fall+ing in the falling stocks). Adverbs modify verbs to specify the time, manner, place or direction of the event described by the verb (e.g. quickly in the stocks fell quickly). Adverbs may also modify adjectives (e.g. really in Mary's teacher was really nice).

English has several categories of closed class words in addition to prepositions, such as articles (also often called determiners) (e.g., the, a), modals (e.g., should, may), and personal pronouns (e.g., she, they). Each dictionary and grammar classifies these words differently.

Note

Your Turn: If you are uncertain about some of these parts of speech, study them using nltk.app.concordance(), or watch some of the Schoolhouse Rock! grammar videos available at YouTube, or consult the Further Reading section at the end of this chapter.

 

Unsimplified Tags  未简化的标签

Let's find the most frequent nouns of each noun part-of-speech type. The program in Example 5.2 finds all tags starting with NN, and provides a few example words for each one. You will see that there are many variants of NN; the most important contain $ for possessive nouns, S for plural nouns (since plural nouns typically end in s) and P for proper nouns. In addition, most of the tags have suffix modifiers: -NC for citations, -HL for words in headlines and -TL for titles (a feature of Brown tabs).

 

def findtags(tag_prefix, tagged_text):

    cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text

                                  if tag.startswith(tag_prefix))

    return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())

 

>>> tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news'))

>>> for tag in sorted(tagdict):

...     print tag, tagdict[tag]

...

NN ['year', 'time', 'state', 'week', 'man']

NN$ ["year's", "world's", "state's", "nation's", "company's"]

NN$-HL ["Golf's", "Navy's"]

NN$-TL ["President's", "University's", "League's", "Gallery's", "Army's"]

NN-HL ['cut', 'Salary', 'condition', 'Question', 'business']

NN-NC ['eva', 'ova', 'aya']

NN-TL ['President', 'House', 'State', 'University', 'City']

NN-TL-HL ['Fort', 'City', 'Commissioner', 'Grove', 'House']

NNS ['years', 'members', 'people', 'sales', 'men']

NNS$ ["children's", "women's", "men's", "janitors'", "taxpayers'"]

NNS$-HL ["Dealers'", "Idols'"]

NNS$-TL ["Women's", "States'", "Giants'", "Officers'", "Bombers'"]

NNS-HL ['years', 'idols', 'Creations', 'thanks', 'centers']

NNS-TL ['States', 'Nations', 'Masters', 'Rules', 'Communists']

NNS-TL-HL ['Nations']

Example 5.2 (code_findtags.py):  Program to Find the Most Frequent Noun Tags

When we come to constructing part-of-speech taggers later in this chapter, we will use the unsimplified tags.

 

Exploring Tagged Corpora  探索标注的语料库

Let's briefly return to the kinds of exploration of corpora we saw in previous chapters, this time exploiting POS tags.

Suppose we're studying the word often and want to see how it is used in text. We could ask to see the words that follow often

 

>>> brown_learned_text = brown.words(categories='learned')

>>> sorted(set(b for (a, b) in nltk.ibigrams(brown_learned_text) if a == 'often'))

[',', '.', 'accomplished', 'analytically', 'appear', 'apt', 'associated', 'assuming',

'became', 'become', 'been', 'began', 'call', 'called', 'carefully', 'chose', ...]

However, it's probably more instructive use the tagged_words() method to look at the part-of-speech tag of the following words:

 

>>> brown_lrnd_tagged = brown.tagged_words(categories='learned', simplify_tags=True)

>>> tags = [b[1] for (a, b) in nltk.ibigrams(brown_lrnd_tagged) if a[0] == 'often']

>>> fd = nltk.FreqDist(tags)

>>> fd.tabulate()

 VN    V   VD DET ADJ ADV    P CNJ    ,   TO   VG   WH VBZ    .

 15   12    8    5    5    4    4    3    3    1    1    1    1    1

Notice that the most high-frequency parts of speech following often are verbs. Nouns never appear in this position (in this particular corpus).

Next, let's look at some larger context, and find words involving particular sequences of tags and words (in this case "<Verb> to <Verb>"). In code-three-word-phrase we consider each three-word window in the sentence , and check if they meet our criterion . If the tags match, we print the corresponding words .

 

from nltk.corpus import brown

def process(sentence):

    for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):

        if (t1.startswith('V') and t2 == 'TO' and t3.startswith('V')):

            print w1, w2, w3

 

>>> for tagged_sent in brown.tagged_sents():

...     process(tagged_sent)

...

combined to achieve

continue to place

serve to protect

wanted to wait

allowed to place

expected to become

...

Example 5.3 (code_three_word_phrase.py): Figure 5.3: Searching for Three-Word Phrases Using POS Tags

Finally, let's look for words that are highly ambiguous as to their part of speech tag. Understanding why such words are tagged as they are in each context can help us clarify the distinctions between the tags.

 

>>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True)

>>> data = nltk.ConditionalFreqDist((word.lower(), tag)

...                                 for (word, tag) in brown_news_tagged)

>>> for word in data.conditions():

...     if len(data[word]) > 3:

...         tags = data[word].keys()

...         print word, ' '.join(tags)

...

best ADJ ADV NP V

better ADJ ADV V DET

close ADV ADJ V N

cut V N VN VD

even ADV DET ADJ V

grant NP N V -

hit V VD VN N

lay ADJ V NP VD

left VD ADJ N VN

like CNJ V ADJ P -

near P ADV ADJ DET

open ADJ V N ADV

past N ADJ DET P

present ADJ ADV V N

read V VN VD NP

right ADJ N DET ADV

second NUM ADV DET N

set VN V VD N -

that CNJ V WH DET

Note

Your Turn: Open the POS concordance tool nltk.app.concordance() and load the complete Brown Corpus (simplified tagset). Now pick some of the above words and see how the tag of the word correlates with the context of the word. E.g. search for near to see all forms mixed together, near/ADJ to see it used as an adjective, near N to see just those cases where a noun follows, and so forth.

posted @ 2011-08-24 23:22  牛皮糖NewPtone  阅读(3561)  评论(0编辑  收藏  举报