Python自然语言处理学习笔记(47)：5.8 小结

5.8 Summary 小结

• Words can be grouped into classes, such as nouns, verbs, adjectives, and adverbs. These classes are known as lexical categories or parts-of-speech. Parts-of-speech are assigned short labels, or tags, such as NN and VB.

单词可以分成类，例如名词，动词，形容词以及副词。这些类被称为词汇类别或者词性。词性被赋给了短标签或者标记，例如NN或者VB。

• The process of automatically assigning parts-of-speech to words in text is called part-of-speech tagging, POS tagging, or just tagging.

给文中的单词自动标注词性的过程称为词性标注。

• Automatic tagging is an important step in the NLP pipeline, and is useful in a variety of situations, including predicting the behavior of previously unseen words, analyzing word usage in corpora, and text-to-speech systems.

自动标注在NLP流程中是重要的一步，并且在各种情况下都非常有效，包括预测先前未出现单词的行为，分析语料库的单词使用，以及文字转语音系统。

• Some linguistic corpora, such as the Brown Corpus, have been POS tagged.

一些语言语料库，例如布朗语料库，已经进行了POS标记。

• A variety of tagging methods are possible, e.g., default tagger, regular expression tagger, unigram tagger, and n-gram taggers. These can be combined using a technique known as backoff.

各种不同的标记方法都是合适的，例如，缺省tagger,正则表达式tagger,unigram tagger以及n-gram tagger。这些可以使用一种称为backoff的技术进行组合。

• Taggers can be trained and evaluated using tagged corpora.

Tagger可以进行训练并且用标记了的语料库进行评分。

• Backoff is a method for combining models: when a more specialized model (such as a bigram tagger) cannot assign a tag in a given context, we back off to a more general model (such as a unigram tagger).

Backoff是一个用于组合模型的方法：当一个更详细的模型（例如bigram tagger）不能为给定内容分配标记，我们后退到一个更加一般化的模型（例如unigram tagger）

• Part-of-speech tagging is an important, early example of a sequence classification task in NLP: a classification decision at any one point in the sequence makes use of words and tags in the local context.

词性标注是NLP中一个重要的，早先的序列分类任务：在序列任意某点的分类决策使用了局部语境中的单词和标记。

• A dictionary is used to map between arbitrary types of information, such as a string and a number: freq['cat'] = 12. We create dictionaries using the brace notation: pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}.

字典用来映射任意类型之间的信息，例如字符串和数字：freq[‘cat’]=12。我们使用大括号标记来创建字典：pos = {}, pos = {'furiously': 'adv', 'ideas': 'n', 'colorless': 'adj'}.

• N-gram taggers can be defined for large values of n, but once n is larger than 3, we usually encounter the sparse data problem; even with a large quantity of training data, we see only a tiny fraction of possible contexts.

N-gram tag可以定义为较大数值的n，但是一旦n大于3，我们常常会面临稀疏数据问题，即时使用大量的训练数据，我们仅可以看到可能的上下文的细小部分。

• Transformation-based tagging involves learning a series of repair rules of the form “change tag s to tag t in context c,” where each rule fixes mistakes and possibly introduces a (smaller) number of errors.

基于转换的标记包含了一系列的“change tag s to tag t in context c”形式的修复规则，每个规则修复错误并且可能地引入更小的错误。

posted @ 2011-08-30 22:46 牛皮糖NewPtone 阅读(585) 评论(0) 编辑收藏举报

刷新页面返回顶部

Python自然语言处理学习笔记(47)：5.8 小结

公告