文本特征提取 - 词频与逆向文件频率

1. 词频与逆向文件频率

1.1 公式

词频 term frequency, tf ：词 (term) \(t_i\) 在文件（document）\(d_j\) 中的词频 \(\mathrm{tf}_{ij}\) 定义为：

\[\mathrm{tf}_{ij} = {\frac {n_{ij}}{\sum \limits_{t_{i'} \in d_j } n_{i'j}}} \]

其中：

\(\mathcal{D} = \{ d_1,d_2,\cdots,d_j, \cdots \}\) 表示文件集合，\(D = |\mathcal{D}|\) 为文档数量
- 如果是对句子的情感分析，则每个句子可以看作是一个文档（document）
\(n_{ij}\) 表示词 \(t_i\) 在文件 \(d_j\) 中的出现的频数（即次数）
\({\frac {n_{ij}}{\sum \limits_{t_{i'} \in d_j } n_{i'j}}}\) 表示在文件 \(d_j\) 中，所有出现的词的频数之和（即文件 \(d_j\) 的中总词数）

逆向文件频率（inverse document frequency，idf）：用于衡量词的普遍重要度，计算公式为：

\[\text{idf}_i = \lg \frac{D}{\text{df}(t_i)+1} \]

其中：

\(\text{df}(t_i)\) 表示在文档集 \(\mathcal{D}\) 中，包括词 \(t_i\) 的文档个数。
分母中加 1 为了避免分母为 0。

注意：在 sklearn 库中 TfidfTransformer 类对 idf 的计算方式与上述公式稍有不同。

\(\text{tf-idf}\)：用于衡量某个词的重要程度

\[\text{tf-idf}_{ij} = \text{tf}_{ij} \times \text{idf}_i \]

1.2 基于 `Python` 实现

1.2.1 `Collections` 库统计词频

Collections 库简介：参考 blog site

实例：

import pandas as pd
from collections import Counter
counter = Counter(tokens)
freq = dict(counter)
freq = dict(counter.most_common(100))  # 选取频数最多的前100个
pd.Series(freq).plot(figsize=(20, 7), kind='bar')

1.2.2 `NLTK` 库

在 NLTK 库，只能实现对单个词的 tf，idf，tf-idf 统计，无法像 scikit-learn 库中 TfidfTransformer 类直接输出所有词的计算结果。利用 TextCollection 类实现。

实例

from nltk.text import TextCollection

corpus = ['This is the first document.',
          'This document is the second document.',
          'And this is the third one.',
          'Is this the first document?' ]

texts = TextCollection(corpus)   # TextCollection 对象
word = 'this'  
doc = corpus[0]  # 计算 'this' 词在第一个文档中的tf，idf，tf-idf
tf = texts.tf(word, doc)
idf = texts.idf(word)
tf_idf = texts.tf_idf(word, doc)

1.2.3 `scikit-learn` 库

CountVectorizer() 类：用于计算词频数（绝对频数）

主要参数：

ngram_range：二元 tuple 类型，词元长度范围
stop_words: {'english'} 或者 list 类型
- 当该参数为 list 类型，只有当设置 analyzer=word ，才会启用
analyzer : {'word', 'char', 'char_wb'}，字符或者单词
max_df 和 min_df：在 [0, 1] 之间的 float 类型或者 int 类型，设置词元的频率范围
- 默认：max_df=1.0, min_df=1，即全部词频。

实例，

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.',            # 一个元素表示一个文档，总共4个文档
          'This document is the second document.', 
          'And this is the third one.',
          'Is this the first document?',]

vectorizer = CountVectorizer() 
word_cnt = vectorizer.fit_transform(corpus)  # 统计词频tf，返回 sparse matrix 类型
word_cnt = word_cnt.toarray()
# 输出
print(vectorizer.get_feature_names_out())
print(word_cnt)

TfidfTransformer() 类：用于计算词频 tf，idf，和 tf-idf

注意：该类计算 idf 的方式式与上文公式稍有不同

主要参数：

norm: {'l1', 'l2'} ，默认为 l2。将 tf-idf 归一化
use_idf: 默认为 True
- use_idf=False：计算时令 \(\text{idf}_i=1\)，即计算词频 tf
smooth_idf: 默认为 False
- smooth_idf=False: idf(t) = log[n/df(t)] + 1
- smooth_idf=True: idf(t) = log[(n+1)/(df(t)+1)] + 1
sublinear_tf

主要属性：

.idf_：idf 值

实例，地址

from sklearn.feature_extraction.text import TfidfTransformer

counts = [[3, 0, 1], [2, 0, 0], [3, 0, 0], 
          [4, 0, 0], [3, 2, 0], [3, 0, 2]]
# 词频统计结果，每一行表示一个文档，每一列表示每一个词
# 3个词在6个文档中统计结果

transformer = TfidfTransformer(smooth_idf=False)
# 默认 norm='l2', use_idf=True, sublinear_tf=False
tfidf = transformer.fit_transform(counts)  # 返回的结果为 sparse matrix 类型
tfidf = tfidf.toarray() # 转换成 np.ndarray 的类型

# 通过 .idf_ 属性获得 idf 的统计结果
print(transformer.idf_)

参考文献

Wikipedia, tf-idf, website

posted @ 2022-05-18 17:15 veager 阅读(140) 评论(0) 编辑收藏举报

刷新页面返回顶部

veager

文本特征提取 - 词频与逆向文件频率

1. 词频与逆向文件频率

1.1 公式

1.2 基于 Python 实现

1.2.1 Collections 库 统计词频

1.2.2 NLTK 库

1.2.3 scikit-learn 库

参考文献

1.2 基于 `Python` 实现

1.2.1 `Collections` 库统计词频

1.2.2 `NLTK` 库

1.2.3 `scikit-learn` 库