文本向量化及词袋模型 - NLP学习(3-1)

分词(Tokenization) - NLP学习(1)

N-grams模型、停顿词(stopwords)和标准化处理 - NLP学习(2)

 

之前我们都了解了如何对文本进行处理:(1)如用NLTK文本处理库将文本的句子成分分成了N-Gram模型,与此同时引入了正则表达式去除一些多余的句子成分;(2)将停顿词去除;(3)一些通用的标准化处理,如大小写、提取词干等。在这一节我们将看看如何对文本中的单词进行统计,并以此来查看一个单词在特定文档中或者整个文本集中的重要性。统计单词的任务是为了给特定的词语或者字一个量化的衡量标准,因为在量化过程中我们会赋予某个词或者字一个重要性值,根据该值我们可以做很多其他相关的事,例如关键字查找或者对该单词判定是积极还是消极等等。在之前我们用独热编码(one hot encoding)的方式对英文文本中的单词进行量化表示,但是这个方式有一个比较明显的缺陷就是当我们的文本集很大的时候,那么独热编码得到的向量的维度也会非常的大,并不利于计算机进行处理,导致计算开销大。因此希望可以找到一个更好的办法对单词进行编码并且体现出单词对于文本的重要性。

词袋模型

词袋模型(Bag of Words)是对文本中的单词进行统计,简单点说就是统计某个单词在一个文本中出现的频率或者次数。为什么说这样子的统计是有效的呢?假设某一个单词出现的频率够高或者足够多,那么也就是说该单词对于文本的重要性足够大或者说它就是在传达文本想要表达的意思,这里需要注意的是这仅仅是一个假设,因为我们也知道某些单词出现的频率相当的高,但是可能并不能表达出什么,例如英文中的“The”,又例如我们中文文本中可能会高频词的出现“的”,所以这也就是为什么我们上一小节中讲到一些标准化处理技巧的原因所在,我们需要去除一定没用的单词或者字,以让保留下来的成分能够充分准确的表示文本想要表达的意思(meaning)。下面通过代码来看看如何对文本中的单词进行统计:

 1 # 对单词出现在某个文本的次数进行统计:频率
 2 # 一个单词出现的频率越高,则意味着它更能表达这篇文章所要表达的意思
 3 from nltk.tokenize import TreebankWordTokenizer
 4 from collections import Counter
 5 
 6 
 7 sentence = """The faster Ray get to the bus stop, the faster and the faster Ray, can get to the school."""
 8 tokenizer = TreebankWordTokenizer()
 9 tokens = tokenizer.tokenize(sentence.lower())
10 print("The tokens are: ", tokens)
11 
12 bag_of_words = Counter(tokens)
13 print("The frequency of each word: ", bag_of_words)

上述代码是将一个句子利用NLTK的TreebankWordTokenizer函数对单词进行分词处理,而后利用Python自带的Counter函数来统计文本中每个单词出现的次数,得到的输出结果为:

The tokens are:  ['the', 'faster', 'ray', 'get', 'to', 'the', 'bus', 'stop', ',', 'the', 'faster', 'and', 'the', 'faster', 'ray', ',', 'can', 'get', 'to', 'the', 'school', '.']
The frequency of each word:  Counter({'the': 5, 'faster': 3, 'ray': 2, 'get': 2, 'to': 2, ',': 2, 'bus': 1, 'stop': 1, 'and': 1, 'can': 1, 'school': 1, '.': 1})

这里不容忽视一个点是,仅仅对单词进行简单的统计是忽略了单词间的连续性的,因为我们知道单词与单词之间的组合是由一定的意思的,所以简单的统计使其失去了这层含义,有可能导致我们不能很好的去还原原本句子或者文本想要表达的意思。虽然忽略了顺序的重要性的词袋在大文本分析中可能意义不大,但是对于一些小文本或者句子分析还是能起到相当不错的效果,例如垃圾邮件检测。

这里有一个概念需要注意:从上面的Python字典中,我们统计出了每个单词出现的频率(Term Frequency),这个也就是我们通常所说的TF。除此之外,我们可以看出之前提到的一点,就是有一些高频率出现的词不一定是有用的,上述输出结果(词袋),我们可以看出the出现的频次最高,但是它并没有给我们带来什么有用的信息,因此我们可以将之去掉,而剩下的我们知道频次出现最高的前两个就是faster和ray。接下来,我们来计算下ray出现的频率:

1 print(bag_of_words.most_common(4))                  # Counter对象中的most_common()函数获取词袋中4个出现最多的词
2 
3 times_harry_appears = bag_of_words['ray']    # 从字典(词袋)中获取ray出现的频次
4 num_unique_words = len(bag_of_words)        # 看下词袋中一共有多少不同的词
5 
6 tf = times_harry_appears / num_unique_words   # 计算ray出现的频率
7 print(round(tf, 4))

上述代码的输出结果为:0.1667。上述代码成功的将文本转换成数字,有便于进一步做计算及分析。然而这还不够,因为我们仅仅是将频次转换为频率,并且这个依旧保存在Python的字典中,如果我们想要量化地去分析文本,特别是用计算机处理,需要将这些数字向量化。为了能更清晰的表述,下面用维基百科关于风筝(Kite)的介绍的一段文本来做进一步分析:

'A kite is traditionally a tethered heavier-than-air craft with wing surfaces that react\nagainst the air to create lift and drag. A kite consists of wings, tethers, and anchors. Kites\noften have a bridle to guide the face of the kite at the correct angle so the wind can lift it.\nA kite’s wing also may be so designed so a bridle is not needed; when kiting a sailplane\nfor launch, the tether meets the wing at a single point. A kite may have fixed or moving\nanchors. Untraditionally in technical kiting, a kite consists of tether-set-coupled wing\nsets; even in technical kiting, though, a wing in the system is still often called the kite.\nThe lift that sustains the kite in flight is generated when air flows around the kite’s\nsurface, producing low pressure above and high pressure below the wings. The\ninteraction with the wind also generates horizontal drag along the direction of the wind.\nThe resultant force vector from the lift and drag force components is opposed by the\ntension of one or more of the lines or tethers to which the kite is attached. The anchor\npoint of the kite line may be static or moving (such as the towing of a kite by a running\nperson, boat, free-falling anchors as in paragliders and fugitive parakites or vehicle).\nThe same principles of fluid flow apply in liquids and kites are also used under water.\nA hybrid tethered craft comprising both a lighter-than-air balloon as well as a kite lifting\nsurface is called a kytoon.\nKites have a long and varied history and many different types are flown individually and\nat festivals worldwide. Kites may be flown for recreation, art or other practical uses.\nSport kites can be flown in aerial ballet, sometimes as part of a competition. Power kites\nare multi-line steerable kites designed to generate large forces which can be used to\npower activities such as kite surfing, kite landboarding, kite fishing, kite buggying and a\nnew trend snow kiting. Even Man-lifting kites have been made.'

 导入相关的库进行处理等到相应的标识符,接着将文本中的停顿词去掉,因为高频次的停顿词并没有太多有用的信息。

1 from collections import Counter
2 from nltk.tokenize import TreebankWordTokenizer
3 
4 
5 tokenizer = TreebankWordTokenizer()
6 tokens = tokenizer.tokenize(kite_text.lower())
7 token_counts = Counter(tokens)
8 print(token_counts)
import nltk

stopwords = nltk.corpus.stopwords.words('english')
tokens = [x for x in tokens if x not in stopwords]
kite_counts = Counter(tokens)
print(kite_counts)

# Vectorizing
document_vector = []
doc_length = len(tokens)

for key, value in kite_counts.most_common():
    document_vector.append(round(value / doc_length, 4))
    
print(document_vector)

输出结果为:

[0.0727, 0.0636, 0.0364, 0.0227, 0.0182, 0.0182, 0.0136, 0.0136, 0.0136, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0091, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045, 0.0045]

上面得到的数组(向量)也就是最终我们用于文本计算分析的,但是独立分析单个文本意义不是很大,当然一般情况下,我们都是分析多个文本及多个文本集的,这样子我们就可以得到一个相对大得多的向量组或者矩阵。有一个点需要注意的是,不同的文本因为含有的单词数不一样,如果我们单独地对每个文本进行向量化,那么我们并不能得到统一大小的向量组,所以这里要求每个文本生成的向量的维度必须是要一样的。这里主要分成两步:

  • 计算每个标识符的TF而不仅仅是得到每个文档中单词的频次;
  • 确保所有的向量都是同一维度的。

接下来就来看看如何确保所有的向量都是同一个维度的。

 1 docs = ["The faster Harry got to the store, the faster and faster Harry would get home."]
 2 docs.append("Harry is hairy and faster than Jill.")
 3 docs.append("Jill is not as hairy as Harry.")
 4 
 5 doc_tokens = []
 6 for doc in docs:
 7     doc_tokens += [sorted(tokenizer.tokenize(doc.lower()))]
 8 print("文本标识符: ", doc_tokens)
 9 print("第一个文本的长度: ", len(doc_tokens[0]))
10 
11 all_doc_tokens = sum(doc_tokens, [])
12 print("三个文本的总标识符: ", len(all_doc_tokens))
13 
14 lexicon = sorted(set(all_doc_tokens))
15 print("打印出的总词汇量: ", lexicon)
16 print("总词汇量: ", len(lexicon))

上述代码对应的输出结果是:

文本标识符:  [[',', '.', 'and', 'faster', 'faster', 'faster', 'get', 'got', 'harry', 'harry', 'home', 'store', 'the', 'the', 'the', 'to', 'would'], ['.', 'and', 'faster', 'hairy', 'harry', 'is', 'jill', 'than'], ['.', 'as', 'as', 'hairy', 'harry', 'is', 'jill', 'not']]
第一个文本的长度:  17
三个文本的总标识符:  33
打印出的总词汇量:  [',', '.', 'and', 'as', 'faster', 'get', 'got', 'hairy', 'harry', 'home', 'is', 'jill', 'not', 'store', 'than', 'the', 'to', 'would']
总词汇量:  18

第一个打印出来的是一个二维数组,主要是存放了三个文本分词之后得到的标识符,接下来打印出来了第一个文本中的所有标识符的数目,三个文本的总长度为33,经过去重之后我们可以得到独一无二的词汇表,这里一共有18个不同的单词,这也就要求向量的统一长度为18,可以注意到并不是每个文本都含有18个文本那么多,所以向量化后的某些向量可以含有值为0的。

import copy
from collections import OrderedDict


zero_vector = OrderedDict((token, 0) for token in lexicon)
print(zero_vector, '\n\n')

doc_vectors = []
for doc in docs:
    vec = copy.copy(zero_vector)
    tokens = tokenizer.tokenize(doc.lower())
    token_counts = Counter(tokens)
    for key, value in token_counts.items():
        vec[key] = round(value / len(lexicon), 4)
    doc_vectors.append(vec)

for i, doc_vec in enumerate(doc_vectors):
    print("{} : {}".format(i + 1, doc_vec), '\n')

输出结果为:

OrderedDict([(',', 0), ('.', 0), ('and', 0), ('as', 0), ('faster', 0), ('get', 0), ('got', 0), ('hairy', 0), ('harry', 0), ('home', 0), ('is', 0), ('jill', 0), ('not', 0), ('store', 0), ('than', 0), ('the', 0), ('to', 0), ('would', 0)]) 


1 : OrderedDict([(',', 0.0556), ('.', 0.0556), ('and', 0.0556), ('as', 0), ('faster', 0.1667), ('get', 0.0556), ('got', 0.0556), ('hairy', 0), ('harry', 0.1111), ('home', 0.0556), ('is', 0), ('jill', 0), ('not', 0), ('store', 0.0556), ('than', 0), ('the', 0.1667), ('to', 0.0556), ('would', 0.0556)]) 

2 : OrderedDict([(',', 0), ('.', 0.0556), ('and', 0.0556), ('as', 0), ('faster', 0.0556), ('get', 0), ('got', 0), ('hairy', 0.0556), ('harry', 0.0556), ('home', 0), ('is', 0.0556), ('jill', 0.0556), ('not', 0), ('store', 0), ('than', 0.0556), ('the', 0), ('to', 0), ('would', 0)]) 

3 : OrderedDict([(',', 0), ('.', 0.0556), ('and', 0), ('as', 0.1111), ('faster', 0), ('get', 0), ('got', 0), ('hairy', 0.0556), ('harry', 0.0556), ('home', 0), ('is', 0.0556), ('jill', 0.0556), ('not', 0.0556), ('store', 0), ('than', 0), ('the', 0), ('to', 0), ('would', 0)]) 

在这个过程中我们首先创建了一个空向量,然后将我们计算得到的结果赋值给这个空向量,以此得到我们需要的同一维度的向量。从上可以看出,有些标识符的值TF值为0,这是因为不同文本含有的单词不一样,这个有不一定另外一个有。这就是文本向量化及词袋模型的构造过程,这篇主要就讲解如何将文本的单词向量化表示,并初步了解TF,TF的概念简单易懂,但是它跟我们之后的TF-IDF的计算是紧密结合的,所以需要清楚的被提出来,再接下来的两个小节(3-2和3-3)中,还会继续学习齐波夫定律(Zipf's Law)在NLP中是扮演怎么样一个角色,由此引出TF-IDF和并且简单的介绍下主题模型(Topic Modeling)是如何运作的。

posted @ 2019-02-11 15:57  JieLongZ  阅读(2971)  评论(1编辑  收藏  举报