TF-IDF
Term Frequency-Inverse Document Frequency(TF-IDF) 用于衡量一个单词(term)在一组文档(document)中对于一个文档(document)的重要性。它属于统计学方法。
Term Frequency(TF): $$\text{TF} = \frac{\text{number of times the term appears in the document}}{\text{total number of terms in the document}}$$
Inverse Document Frequency(IDF):
\[\text{IDF} = \log(\frac{\text{number of the documents in the corpus}}{\text{number of documents in the corpus contain the term}})
\]
为了避免除0的问题,可采用如下形式。
\[\text{IDF} = \log(\frac{\text{number of the documents in the corpus}}{\text{number of documents in the corpus contain the term} + 1})
\]
TF-IDF 通过计算 TF 和 IDF 的乘积获得。
\[\text{TF-IDF} = \text{TF} \cdot \text{IDF}
\]
参考:
https://www.learndatasci.com/glossary/tf-idf-term-frequency-inverse-document-frequency