NLTK 停用词、罕见词
一、停用词 stopwords
停用词:跟要做的实际主题不相关的文本,在 NPL任务中(信息检索、分类)毫无意义;通常情况下,冠词 和 代词都会被列为;一般歧义不大,移除后影响小。
一般情况下,给定语言的停用词都是人工制定,跨语料库,针对最常见单词的停用词表。停用词表可能使用网站上找到已有的,也可能是基于给定语料库自动生成。
简单的生成停用词表方式之一:基于相关单词在文档中出现的频率。
NLTK 库中涵盖了 22 种语言的停用词表。
相关函数:
- nltk.corpus. stopwords
1、查看停用词
from nltk.corpus import stopwords # 加载停用词
stopwords.readme().replace('\n', ' ') # 停用词说明文档,由于有很多 \n 符号,所以这样操作来方便查看
'''
'Stopwords Corpus This corpus contains lists of stop words for several languages. These are high-frequency grammatical words which are usually ignored in text retrieval applications. They were obtained from: http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/snowball/stopwords/ The stop words for the Romanian language were obtained from: http://arlc.ro/resources/ The English list has been augmented https://github.com/nltk/nltk_data/issues/22 The German list has been corrected https://github.com/nltk/nltk_data/pull/49 A Kazakh list has been added https://github.com/nltk/nltk_data/pull/52 A Nepali list has been added https://github.com/nltk/nltk_data/pull/83 An Azerbaijani list has been added https://github.com/nltk/nltk_data/pull/100 A Greek list has been added https://github.com/nltk/nltk_data/pull/103 An Indonesian list has been added https://github.com/nltk/nltk_data/pull/112 '
'''
# 查看停用词表,不同语言;没有对中文的支持
stopwords.fileids()
'''
['arabic',
'azerbaijani',
'danish',
'dutch',
'english',
'finnish',
'french',
'german',
'greek',
'hungarian',
'indonesian',
'italian',
'kazakh',
'nepali',
'norwegian',
'portuguese',
'romanian',
'russian',
'slovene',
'spanish',
'swedish',
'tajik',
'turkish']
'''
# 查看英文停用词表
stopwords.raw('english').replace('\n', ' ')
'''
"i me my myself we our ours ourselves you you're you've you'll you'd your yours yourself yourselves he him his himself she she's her hers herself it it's its itself they them their theirs themselves what which who whom this that that'll these those am is are was were be been being have has had having do does did doing a an the and but if or because as until while of at by for with about against between into through during before after above below to from up down in out on off over under again further then once here there when where why how all any both each few more most other some such no nor not only own same so than too very s t can will just don don't should should've now d ll m o re ve y ain aren aren't couldn couldn't didn didn't doesn doesn't hadn hadn't hasn hasn't haven haven't isn isn't ma mightn mightn't mustn mustn't needn needn't shan shan't shouldn shouldn't wasn wasn't weren weren't won won't wouldn wouldn't "
'''
2、停用词过滤
test_words = [word.lower() for word in tokens]
# 转化为集合,方便求和停用词表的交集
test_words_set = set(test_words)
test_words_set
'''
{',',
'.',
'and',
'api',
'articles',
'browse',
'code',
'developer',
'documentation',
'including',
'latest',
'reference',
'sample',
'the',
'tutorials'}
'''
# 查看和停用词表的交集
stopwords_english = set(stopwords.words('english'))
test_words_set.intersection(stopwords_english)
# {'and', 'the'}
# 把停用词过滤掉
filtered = [w for w in test_words_set if(w not in stopwords_english) ]
filtered
'''
['documentation',
'api',
'tutorials',
'articles',
'.',
'including',
'latest',
'code',
'sample',
'developer',
',',
'reference',
'browse']
'''
二、罕见词
去除噪音性质的分词;
根据不同场景指定不同的规则,如 html 的标签、过长的名字等
import nltk
from nltk.tokenize import word_tokenize
str = 'arXiv is a free distribution service and an open-access archive for 1,812,439 scholarly articles. Materials on this site are not peer-reviewed by arXiv.'
tokens = word_tokenize(str)
# 获取相关术语在语料库中的分布情况
freq_dist = nltk.FreqDist(tokens)
'''
FreqDist({'arXiv': 2, '.': 2, 'is': 1, 'a': 1, 'free': 1, 'distribution': 1, 'service': 1, 'and': 1, 'an': 1, 'open-access': 1, ...})
'''
# 选取其中最稀有的词组成一个表,用这个表来过滤原始语料库
rarewords = freq_dist.keys()[-5:]
rarewords
after_rare_words = [w for w in tokens not in rarewords]