文本特征提取---词袋模型,TF-IDF模型,N-gram模型(Text Feature Extraction Bag of Words TF-IDF N-gram )
假设有一段文本:"I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." 那么怎么提取这段文本的特征呢?
一个简单的方法就是使用词袋模型(bag of words model)。选定文本内一定的词放入词袋,统计词袋内所有词在文本中出现的次数(忽略语法和单词出现的顺序),将其用向量的形式表示出来。
text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." from sklearn.feature_extraction.text import CountVectorizer CV=CountVectorizer() words=CV.fit_transform([text1]) #这里注意要把文本字符串变为列表进行输入 print(words)
(0, 3) 1
(0, 4) 1
(0, 0) 1
(0, 11) 1
(0, 2) 1
(0, 10) 1
(0, 7) 2
(0, 8) 2
(0, 9) 1
(0, 6) 1
(0, 1) 1
(0, 5) 1
(0, 7) 2 代表第7个词"Huzihu"出现了2次。
我们一般提取文本特征是用于文档分类,那么就需要知道各个文档之间的相似程度。可以通过计算文档特征向量之间的欧氏距离(Euclidean distance)来进行比较。
文本二:"My cousin has a cute dog. He likes sleeping and eating. He is friendly to others."
文本三:"We all need to make plans for the future, otherwise we will regret when we're old."
text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others." text3= "We all need to make plans for the future, otherwise we will regret when we're old." corpus=[text1,text2,text3] #把三个文档放入语料库 from sklearn.feature_extraction.text import CountVectorizer CV=CountVectorizer() words=CV.fit_transform(corpus) words_frequency=words.todense() #用todense()转化成矩阵 print(CV.get_feature_names()) print(words_frequency)
['all', 'and', 'are', 'cat', 'cousin', 'cute', 'dog', 'eating', 'for', 'friendly', 'friends', 'future', 'good', 'has', 'have', 'he', 'his', 'huzihu', 'is', 'likes', 'make', 'my', 'name', 'need', 'old', 'others', 'otherwise', 'plans', 're', 'really', 'regret', 'sleeping', 'the', 'to', 'we', 'when', 'will'] [[0 1 1 ..., 1 0 0] [0 1 0 ..., 0 0 0] [1 0 0 ..., 3 1 1]]
from sklearn.metrics.pairwise import euclidean_distances for i,j in ([0,1],[0,2],[1,2]): dist=euclidean_distances(words_frequency[i],words_frequency[j]) print("文本{}和文本{}特征向量之间的欧氏距离是:{}".format(i+1,j+1,dist))
文本1和文本2特征向量之间的欧氏距离是:[[ 5.19615242]] 文本1和文本3特征向量之间的欧氏距离是:[[ 6.08276253]] 文本2和文本3特征向量之间的欧氏距离是:[[ 6.164414]]
现在思考一下,应该选什么样的词放入词袋呢?有一些词并不能提供多少有用的信息,比如:the, be, you, he...这些词被称为停止词(stop words)。由于文本内包含的词的数量非常之多(词袋内的每一个词都是一个维度),因此我们需要尽量减少维度,去除这些噪音,以便更好地计算和拟合。
另外,也可以下载NLTK(Natural Language Toolkit)自然语言工具包,使用其里面的停用词。
下面,我们就用NLTK来试一试(使用之前,请大家先下载安装:pip install NLTK):
text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others." text3= "We all need to make plans for the future, otherwise we will regret when we're old." corpus=[text1,text2,text3] from nltk.corpus import stopwords noise=stopwords.words("english") from sklearn.feature_extraction.text import CountVectorizer CV=CountVectorizer(stop_words=noise) words=CV.fit_transform(corpus) words_frequency=words.todense() print(CV.get_feature_names()) print(words_frequency)
['cat', 'cousin', 'cute', 'dog', 'eating', 'friendly', 'friends', 'future', 'good', 'huzihu', 'likes', 'make', 'name', 'need', 'old', 'others', 'otherwise', 'plans', 'really', 'regret', 'sleeping'] [[1 0 1 ..., 1 0 0] [0 1 1 ..., 0 0 1] [0 0 0 ..., 0 1 0]]
text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others." text3= "We all need to make plans for the future, otherwise we will regret when we're old." corpus=[text1,text2,text3] from nltk import RegexpTokenizer from nltk.stem.snowball import SnowballStemmer def stemming(token): stemming=SnowballStemmer("english") stemmed=[stemming.stem(each) for each in token] return stemmed def tokenize(text): tokenizer=RegexpTokenizer(r'\w+') #设置正则表达式规则 tokens=tokenizer.tokenize(text) stems=stemming(tokens) return stems from nltk.corpus import stopwords noise=stopwords.words("english") from sklearn.feature_extraction.text import CountVectorizer CV=CountVectorizer(stop_words=noise,tokenizer=tokenize,lowercase=False) words=CV.fit_transform(corpus) words_frequency=words.todense() print(CV.get_feature_names()) print(words_frequency)
['cat', 'cousin', 'cute', 'dog', 'eat', 'friend', 'futur', 'good', 'huzihu', 'like', 'make', 'name', 'need', 'old', 'otherwis', 'plan', 'realli', 'regret', 'sleep'] [[1 0 1 ..., 1 0 0] [0 1 1 ..., 0 0 1] [0 0 0 ..., 0 1 0]]
此外,还需注意的是词形的变化。比如说单复数:"foot"和"feet",过去式和现在进行时:"understood"和"understanding",主动和被动:"eat"和"eaten",等等。这些词都应该被视为同一个特征。解决的办法是进行词形还原(lemmatization)。这里就不演示了,可以用NLTK中的WordNetLemmatizer来进行词形还原(from nltk.stem.wordnet import WordNetLemmatizer)。
最后,再想一下,长文本和短文本包含的信息是不对等的,一般来说,长文本包含的关键词要比短文本多,因此,我们需要对文本进行归一化处理,将每个单词出现的次数除以该文本中所有单词的个数,这被称之为词频(term frequency)(注:之前说的词频是指绝对频率,这里的词频是指相对频率)。其次,我们在对文档进行分类时,假如某个词在各文本中都有出现,那么这个词就无法给分类带来多少有用的信息。因此,对于出现频率高的词和频率低的词,我们应该区分对待,它们的重要性是不一样的。解决的办法就是用逆文档频率(inverse document frequency)来给词进行加权。IDF会根据单词在文本中出现的频率进行加权,出现频率高的词,加权系数就低,反之,出现频率低的词,加权系数就高。这两者相结合被称之为TF-IDF(term frequncy, inverse document frequency)。可以用sklearn的TfidfVectorizer来实现。
text1="I have a cat, his name is Huzihu. Huzihu is really cute and friendly. We are good friends." text2="My cousin has a cute dog. He likes sleeping and eating. He is friendly to others." text3= "We all need to make plans for the future, otherwise we will regret when we're old." corpus=[text1,text2,text3] from nltk import RegexpTokenizer from nltk.stem.snowball import SnowballStemmer def stemming(token): stemming=SnowballStemmer("english") stemmed=[stemming.stem(each) for each in token] return stemmed def tokenize(text): tokenizer=RegexpTokenizer(r'\w+') #设置正则表达式规则 tokens=tokenizer.tokenize(text) stems=stemming(tokens) return stems from nltk.corpus import stopwords noise=stopwords.words("english") from sklearn.feature_extraction.text import TfidfVectorizer CV=TfidfVectorizer(stop_words=noise,tokenizer=tokenize,lowercase=False) words=CV.fit_transform(corpus) words_frequency=words.todense() print(CV.get_feature_names()) print(words_frequency) from sklearn.metrics.pairwise import euclidean_distances for i,j in ([0,1],[0,2],[1,2]): dist=euclidean_distances(words_frequency[i],words_frequency[j]) print("文本{}和文本{}特征向量之间的欧氏距离是:{}".format(i+1,j+1,dist))
['cat', 'cousin', 'cute', 'dog', 'eat', 'friend', 'futur', 'good', 'huzihu', 'like', 'make', 'name', 'need', 'old', 'otherwis', 'plan', 'realli', 'regret', 'sleep'] [[ 0.30300252 0. 0.23044123 ..., 0.30300252 0. 0. ] [ 0. 0.40301621 0.30650422 ..., 0. 0. 0.40301621] [ 0. 0. 0. ..., 0. 0.37796447 0. ]] 文本1和文本2特征向量之间的欧氏距离是:[[ 1.25547312]] 文本1和文本3特征向量之间的欧氏距离是:[[ 1.41421356]] 文本2和文本3特征向量之间的欧氏距离是:[[ 1.41421356]]
词袋模型的缺点: 1. 无法反映词之间的关联关系。例如:"Humans like cats."和"Cats like humans"具有相同的特征向量。
2. 无法捕捉否定关系。例如:"I will not eat noodles today."和"I will eat noodles today."尽管意思相反,但是从特征向量来看它们非常相似。
