nltk(二)
1.collocations模块
用于计算一组单词中,没window_size个单词中n个词同时出现的次数
from nltk.collocations import * sent = 'this this is is a a test test'.split() b = BigramCollocationFinder.from_words(sent, window_size=2) b.ngram_fd.items()
BigramCollocationFinder
用于计算两个单词出现的次数
TrigramCollocationFinder 用于计算三个单词出现的次数
QuadgramCollocationFinder 用于计算n个单词出现的次数
2.data模块
用于管理语言包的路径信息
nltk.data.path 返回语言包路径list
nltk.data.
PathPointer路径指针基类
有FileSystemPathPointer和BufferedGzipFile两个子类分别用于处理普通文件和压缩文件
3.featstruct
模块
用于表示特征,功能类似与dict和list
Feature 用于存放一个特征,有个name属性和value
有SlashFeature和RangeFeature两个子类
FeatStruct有若干个特征
有FeatDict和FeatList两个子类
from nltk.featstruct import FeatStruct FeatStruct('[a=?x]').unify(FeatStruct('[b=?x]'))
4.grammar
模块
用于处理自定义文法
import nltk from nltk import CFG grammar = nltk.CFG.fromstring(""" S -> NP VP VP -> V NP | V NP PP PP -> P NP V -> "saw" | "ate" | "walked" NP -> "John" | "Mary" | "Bob" | Det N | Det N PP Det -> "a" | "an" | "the" | "my" N -> "man" | "dog" | "cat" | "telescope" | "park" P -> "in" | "on" | "by" | "with" """) sent = 'Mary saw Bob'.split() rd_parser = nltk.RecursiveDescentParser(grammar) for i in rd_parser.parse(sent): print(i)
5.probability
模块
主要包括词频列表,词频字典,概率分布(ELEProbDist)
from nltk.probability import ConditionalFreqDist from nltk.tokenize import word_tokenize sent = "the the the dog dog some other words that we do not care about" cfdist = ConditionalFreqDist() for word in word_tokenize(sent): print(word) condition = len(word) cfdist[condition][word] += 1 cfdist2 = ConditionalFreqDist((len(word), word) for word in word_tokenize(sent))
6.text
模块
用于处理文本信息,主要包括单词查找,单词拆分,文本包装器
import nltk.corpus from nltk.text import TextCollection from nltk.book import text1, text2, text3 gutenberg = TextCollection(nltk.corpus.gutenberg) mytexts = TextCollection([text1, text2, text3])
7.tree
模块
用于生成和打印语法树