Python自然语言处理学习笔记(51):监督式分类的更多例子

6.2   Further Examples of Supervised Classification    监督式分类的更多例子

Sentence Segmentation 句子分割

Sentence segmentation can be viewed as a classification task for punctuation: whenever we encounter a symbol that could possibly end a sentence, such as a period or a question mark, we have to decide whether it terminates the preceding sentence.

The first step is to obtain some data that has already been segmented into sentences and convert it into a form that is suitable for extracting features:

 

>>> sents = nltk.corpus.treebank_raw.sents()

>>> tokens = []

>>> boundaries = set()

>>> offset = 0

>>> for sent in nltk.corpus.treebank_raw.sents():

...     tokens.extend(sent)

...     offset += len(sent)

...     boundaries.add(offset-1)

Here, tokens is a merged list of tokens from the individual sentences, and boundaries is a set containing the indexes of all sentence-boundary tokens. Next, we need to specify the features of the data that will be used in order to decide whether punctuation indicates a sentence-boundary:

 

>>> def punct_features(tokens, i):

...     return {'next-word-capitalized': tokens[i+1][0].isupper(),

...             'prevword': tokens[i-1].lower(),

...             'punct': tokens[i],

...             'prev-word-is-one-char': len(tokens[i-1]) == 1}

Based on this feature extractor, we can create a list of labeled featuresets by selecting all the punctuation tokens, and tagging whether they are boundary tokens or not:

 

>>> featuresets = [(punct_features(tokens, i), (i in boundaries))

...                for i in range(1, len(tokens)-1)

...                if tokens[i] in '.?!']

Using these featuresets, we can train and evaluate a punctuation classifier:

 

>>> size = int(len(featuresets) * 0.1)

>>> train_set, test_set = featuresets[size:], featuresets[:size]

>>> classifier = nltk.NaiveBayesClassifier.train(train_set)

>>> nltk.classify.accuracy(classifier, test_set)

0.97419354838709682

To use this classifier to perform sentence segmentation, we simply check each punctuation mark to see whether it's labeled as a boundary; and divide the list of words at the boundary marks. The listing in Example 6.8 shows how this can be done.

 

def segment_sentences(words):

    start = 0

    sents = []

    for i, word in enumerate(words):

        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:

            sents.append(words[start:i+1])

            start = i+1

    if start < len(words):

        sents.append(words[start:])

    return sents


Identifying Dialogue Act Types 识别对话行为类型

When processing dialogue, it can be useful to think of utterances as a type of action performed by the speaker. This interpretation is most straightforward for performative(表述行为的) statements such as "I forgive you" or "I bet you can't climb that hill." But greetings, questions, answers, assertions, and clarifications can all be thought of as types of speech-based actions. Recognizing the dialogue acts(对话行为) underlying the utterances in a dialogue can be an important first step in understanding the conversation.

The NPS Chat Corpus, which was demonstrated in Section 2.1, consists of over 10,000 posts from instant messaging sessions. These posts have all been labeled with one of 15 dialogue act types, such as "Statement," "Emotion," "ynQuestion", and "Continuer." We can therefore use this data to build a classifier that can identify the dialogue act types for new instant messaging posts. The first step is to extract the basic messaging data. We will call xml_posts() to get a data structure representing the XML annotation for each post:

 

>>> posts = nltk.corpus.nps_chat.xml_posts()[:10000]

Next, we'll define a simple feature extractor that checks what words the post contains:

 

>>> def dialogue_act_features(post):

...     features = {}

...     for word in nltk.word_tokenize(post):

...         features['contains(%s)' % word.lower()] = True

...     return features

Finally, we construct the training and testing data by applying the feature extractor to each post (using post.get('class') to get a post's dialogue act type), and create a new classifier:

 

>>> featuresets = [(dialogue_act_features(post.text), post.get('class'))

...                for post in posts]

>>> size = int(len(featuresets) * 0.1)

>>> train_set, test_set = featuresets[size:], featuresets[:size]

>>> classifier = nltk.NaiveBayesClassifier.train(train_set)

>>> print nltk.classify.accuracy(classifier, test_set)

0.66


Recognizing Textual Entailment
上下文蕴含识别

Recognizing textual entailment (RTE) is the task of determining whether a given piece of text T entails another text called the "hypothesis" (as already discussed in Section 1.5). To date(至今), there have been four RTE Challenges, where shared development and test data is made available to competing teams. Here are a couple of examples of text/hypothesis pairs from the Challenge 3 development dataset. The label True indicates that the entailment holds(有效), and False, that it fails to hold.

Challenge 3, Pair 34 (True)

T: Parviz Davudi was representing Iran at a meeting of the Shanghai Co-operation Organisation (SCO), the fledgling association that binds Russia, China and four former Soviet republics of central Asia together to fight terrorism.

H: China is a member of SCO.

Challenge 3, Pair 81 (False)

T: According to NC Articles of Organization, the members of LLC company are H. Nelson Beavers, III, H. Chester Beavers and Jennie Beavers Stewart.

H: Jennie Beavers Stewart is a share-holder of Carolina Analytical Laboratory.

It should be emphasized that the relationship between text and hypothesis is not intended to be logical entailment(逻辑蕴涵), but rather(相反地) whether a human would conclude that the text provides reasonable evidence for taking the hypothesis to be true.

We can treat RTE as a classification task, in which we try to predict the True/False label for each pair. Although it seems likely that successful approaches to this task will involve a combination of parsing, semantics and real world knowledge(常识), many early attempts at RTE achieved reasonably good results with shallow analysis, based on similarity between the text and hypothesis at the word level. In the ideal case, we would expect that if there is an entailment, then all the information expressed by the hypothesis should also be present in the text. Conversely, if there is information found in the hypothesis that is absent from the text, then there will be no entailment.

In our RTE feature detector (Example 6.9), we let words (i.e., word types) serve as proxies(代理)for information, and our features count the degree of word overlap, and the degree to which there are words in the hypothesis but not in the text (captured by the method hyp_extra()). Not all words are equally important — Named Entity mentions such as the names of people, organizations and places are likely to be more significant, which motivates us to extract distinct information for words and nes (Named Entities 命名实体). In addition, some high frequency function words are filtered out as "stopwords".

[xx]

give some intro to RTEFeatureExtractor??

 

def rte_features(rtepair):

    extractor = nltk.RTEFeatureExtractor(rtepair)

    features = {}

    features['word_overlap'] = len(extractor.overlap('word'))

    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))

    features['ne_overlap'] = len(extractor.overlap('ne'))

    features['ne_hyp_extra'] = len(extractor.hyp_extra('ne'))

    return features

Example 6.9 (code_rte_features.py): Figure 6.9: "Recognizing Text Entailment" Feature Extractor. The RTEFeatureExtractor class builds a bag of words for both the text and the hypothesis after throwing away some stopwords, then calculates overlap and difference.

To illustrate the content of these features, we examine some attributes of the text/hypothesis Pair 34 shown earlier:

 

>>> rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33]

>>> extractor = nltk.RTEFeatureExtractor(rtepair)

>>> print extractor.text_words

set(['Russia', 'Organisation', 'Shanghai', 'Asia', 'four', 'at',

'operation', 'SCO', ...])

>>> print extractor.hyp_words

set(['member', 'SCO', 'China'])

>>> print extractor.overlap('word')

set([])

>>> print extractor.overlap('ne')

set(['SCO', 'China'])

>>> print extractor.hyp_extra('word')

set(['member'])

These features indicate that all important words in the hypothesis are contained in the text, and thus there is some evidence for labeling this as True.

The module nltk.classify.rte_classify reaches just over 58% accuracy on the combined RTE test data using methods like these. Although this figure is not very impressive, it requires significant effort, and more linguistic processing, to achieve much better results.

 

Scaling Up to Large Datasets 扩大到大型数据集

 

Python provides an excellent environment for performing basic text processing and feature extraction. However, it is not able to perform the numerically intensive calculations required by machine learning methods nearly as quickly as lower-level languages such as C(要好好学C…. Thus, if you attempt to use the pure-Python machine learning implementations (such as nltk.NaiveBayesClassifier) on large datasets, you may find that the learning algorithm takes an unreasonable(不合理的) amount of time and memory to complete.

If you plan to train classifiers with large amounts of training data or a large number of features, we recommend that you explore NLTK's facilities for interfacing with external machine learning packages(提供了额外的机器学习包的接口). Once these packages have been installed, NLTK can transparently invoke them (via system calls) to train classifier models significantly faster than the pure-Python classifier implementations. See the NLTK webpage for a list of recommended machine learning packages that are supported by NLTK.

posted @ 2011-08-31 23:16  牛皮糖NewPtone  阅读(1632)  评论(0编辑  收藏  举报