随笔- 297 文章- 1 评论- 24 阅读- 73万

python+NLTK 自然语言学习处理六：分类和标注词汇一

在一段句子中是由各种词汇组成的。有名词，动词，形容词和副词。要理解这些句子，首先就需要将这些词类识别出来。将词汇按它们的词性(parts-of-speech,POS)分类并相应地对它们进行标注。这个过程叫做词性标注。

要进行词性标注，就需要用到词性标注器(part-of-speech tagger).代码如下

text=nltk.word_tokenize("customer found there are abnormal issue")

print(nltk.pos_tag(text))

提示错误：这是因为找不到词性标注器

LookupError:

**********************************************************************

Resource averaged_perceptron_tagger not found.

Please use the NLTK Downloader to obtain the resource:

>>> import nltk

>>> nltk.download('averaged_perceptron_tagger')

Searched in:

- '/home/zhf/nltk_data'

- '/usr/share/nltk_data'

- '/usr/local/share/nltk_data'

- '/usr/lib/nltk_data'

- '/usr/local/lib/nltk_data'

- '/usr/nltk_data'

- '/usr/lib/nltk_data'

**********************************************************************

运行nltk.download进行下载，并将文件拷贝到前面错误提示的搜索路径中去，

>>> import nltk

>>> nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to

[nltk_data] /root/nltk_data...

[nltk_data] Unzipping taggers/averaged_perceptron_tagger.zip.

True

以及对应的帮助文档

>>> nltk.download('tagsets')

[nltk_data] Downloading package tagsets to /root/nltk_data...

[nltk_data] Unzipping help/tagsets.zip.

True

运行结果：

[('customer', 'NN'), ('found', 'VBD'), ('there', 'EX'), ('are', 'VBP'), ('abnormal', 'JJ'), ('issue', 'NN')]

在这里得到了每个词以及每个词的词性。下表是一个简化的词性标记集

标记	含义	例子
ADJ	形容词	new, good, high, special, big, local
ADV	动词	really, already, still, early, now
CNJ	连词	and, or, but, if, while, although
DET	限定词	the, a, some, most, every, no
EX	存在量词	there, there’s
FW	外来词	dolce, ersatz, esprit, quo, maitre
MOD	情态动词	will, can, would, may, must, should
N	名词	year, home, costs, time, education
NP	专有名词	Alison, Africa, April, Washington
NUM	数词	twenty-four, fourth, 1991, 14:24
PRO	代词	he, their, her, its, my, I, us
P	介词	on, of, at, with, by, into, under
TO	词 to	to
UH	感叹词	ah, bang, ha, whee, hmpf, oops
V	动词	is, has, get, do, make, see, run
VD	过去式	said, took, told, made, asked
VG	现在分词	making, going, playing, working
VN	过去分词	given, taken, begun, sung
WH	Wh 限定词	who, which, when, what, where, how

如果解析的对象是由单独的词/标记字符串构成的，可以用str2tuple的方法将词和标记解析出来并形成元组。使用方法如下：

[nltk.tag.str2tuple(t) for t in "customer/NN found/VBD there/EX are/VBP abnormal/JJ issue/NN".split()]

运行结果：

[('customer', None), ('found', None), ('there', None), ('are', None), ('abnormal', None), ('issue', None)]

对于在NLTK中自带的各种文本，也自带词性标记器

nltk.corpus.brown.tagged_words()

[('The', 'AT'), ('Fulton', 'NP-TL'), ...]

那么借助与Freqdist和以及绘图工具。我们就可以画出各个词性的频率分布图，便于我们观察句子结构

brown_news_tagged=nltk.corpus.brown.tagged_words(categories='news')

tag_fd=nltk.FreqDist(tag for (word,tag) in brown_news_tagged)

tag_fd.plot(50,cumulative=True)

结果如下，绘制出了前50个

假如我们正在学习一个词，想看下它在文本中的应用，比如后面都用的什么词。可以采用如下的方法，我想看下oftern后面都跟的是一些什么词语

brown_learned_text=nltk.corpus.brown.words(categories='learned')

ret=sorted(set(b for(a,b) in nltk.bigrams(brown_learned_text) if a=='often'))

在这里用到了bigrams方法，这个方法主要是形成双连词。

比如下面的这段文本，生成双连词

for word in nltk.bigrams("customer found there are abnormal issue".split()):

print(word)

结果如下：

('customer', 'found')

('found', 'there')

('there', 'are')

('are', 'abnormal')

('abnormal', 'issue')

光看后面跟了那些词语还不够，我们还需要查看后面的词语都是一些什么词性。

1 首先是对词语进行词性标记。形成词语和词性的二元组

2 然后根据bigrams形成连词，然后根据第一个词是否是often，得到后面词语的词性

brown_learned_text=nltk.corpus.brown.tagged_words(categories='learned')

tags=[b[1] for (a,b) in nltk.bigrams(brown_learned_text) if a[0]=='often']

fd=nltk.FreqDist(tags)

fd.tabulate()

结果如下：

VBN VB VBD JJ IN QL , CS RB AP VBG RP VBZ QLP BEN WRB . TO HV

15 10 8 5 4 3 3 3 3 1 1 1 1 1 1 1 1 1 1

同样的，如果我们想的到三连词，可以采用trigrams的方法。

posted @ 2018-04-09 22:07 red_leaf_412 阅读(4846) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· 10年+ .NET Coder 心语 ── 封装的思维：从隐藏、稳定开始理解其本质意义
· 地球OL攻略 —— 某应届生求职总结
· 提示词工程——AI应用必不可少的技术
· Open-Sora 2.0 重磅开源！
· 周边上新：园子的第一款马克杯温暖上架

公告

昵称： red_leaf_412
园龄： 7年9个月
粉丝： 85
关注： 4

+加关注

2025年3月

日

一

二

三

四

五

六

随笔分类

随笔档案

文章分类

python学习(1)

red_leaf_412

python+NLTK 自然语言学习处理六：分类和标注词汇一

公告

搜索

常用链接

随笔分类

随笔档案

文章分类

阅读排行榜

评论排行榜

推荐排行榜

最新评论