<NLP with python>笔记：二

Ch1 Language Processing with Python

1.1 Computing with Language: Texts and Words

　　将文本看作是待处理的原始数据。

Searching Text

　　nltk.text.Text.concordance（self,word）:输出文本中某个word的索引；

　　nltk.text.Text.similar(self,word):输出文本中与指定词出现在相同地方的word；

　　nltk.text.Text.common_contexts(self,words):找出文本中，words中的word同时出现的语句；

　　nltk.text.Text.dispersion_plot(self,words）：显示文本中，words中的不用word出现的位置；(!!!bug)

Counting Voabulary

　　前面的例子中，最明显的不同在于词汇使用的不同。

　　token：包含标点符号和单词。

　　词汇：文本中的不同单词的个数。

　　nltk.text.Text.count(self,word)：统计文本中指定word出现的次数；

1.2 A Closer Look at Python: Texts as Lists of Words

Lists

　　将文本看作是单词和标点符号序列。python中使用list来存储；索引/分片/可变对象

Variables

　　变量=表达式。python的变量赋值

Strings

　　索引/分片/不可变对象

1.3 Computing with Language: Simple Statistics

Frequency Distributions

　　频率分布：文本中每一个词汇出现的频率。

　　nltk.probability.FreqDist(self,sample=None):返回频率分布对象（继承了dict的特性,并封装了很多函数）。

　　FreqDist对象封装了很多函数，如plot函数，N,B等，hapaxed用于返回之出现一次的单词。

Fine-Gained（颗粒选择） Selection of Words

　　列表解析式：来选择单词长度足够长的单词。[w for w in v if f(v)]

Collocations and Bigrams(固定搭配和二元-gram)

　　nltk.util.bigrams((sequence):从sequence中生成所有的二元搭配，返回iter对象

　　固定搭配：出现频次高的bigram词语。

　　nltk.text.Text.collocations(self,num=20):先对文本使用bigram得到二元词组，然后进行统计返回出现频次最高的bigram。

Counting Other Things

　　FreqDist（）的其他函数

1.4 Back to Python:Making Decisions and Taking Control

Conditionals

　　< > != ==

1.5 Automatic Natural Lauguage Understanding

Word Sense Disambiguation(单词歧义去除）/Pronoun Resolution(代词消解）/Generating Language Output（问答系统，机器翻译）/Machine Transaltion（机器翻译）/对话系统

1.6 Summary

posted @ 2016-06-19 20:11 Mscer 阅读(473) 评论(0) 收藏举报

刷新页面返回顶部

Mscer