星云外

tf-idf

tf-idf

The tf-idf weight (term frequency-inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionallyto the number of times a word appears in the document but is offset bythe frequency of the word in the corpus. Variations of the tf-idfweighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

One of the simplest ranking functionsis computed by summing the tf-idf for each query term; many moresophisticated ranking functions are variants of this simple model.

Contents

Motivation

Suppose we have a set of English text documents and wish todetermine which document is most relevant to the query "the brown cow."A simple way to start out is by eliminating documents that do notcontain all three words "the," "brown," and "cow," but this stillleaves many documents. To further distinguish them, we might count thenumber of times each term occurs in each document and sum them alltogether; the number of times a term occurs in a document is called itsterm frequency. However, because the term "the" is so common,this will tend to incorrectly emphasize documents which happen to usethe word "the" more, without giving enough weight to the moremeaningful terms "brown" and "cow". Also the term "the" is not a goodkeyword to distinguish relevant and non-relevant documents and termslike "brown" and "cow" that occur rarely are good keywords todistinguish relevant documents from the non-relevant documents. Hencean inverse document frequency factor is incorporated whichdiminishes the weight of terms that occur very frequently in thecollection and increases the weight of terms that occur rarely.

Mathematical details

The term count in the given document is simply the number of times a given termappears in that document. This count is usually normalized to prevent abias towards longer documents (which may have a higher term countregardless of the actual importance of that term in the document) togive a measure of the importance of the term ti within the particular document dj. Thus we have the term frequency, defined as follows.

 \mathrm{tf_{i,j}} = \frac{n_{i,j}}{\sum_k n_{k,j}}

where ni,j is the number of occurrences of the considered term (ti) in document dj, and the denominator is the sum of number of occurrences of all terms in document dj.

The inverse document frequency is a measure of the general importance of the term (obtained by dividing the number of all documents by the number of documents containing the term, and then taking the logarithm of that quotient).

 \mathrm{idf_{i}} =  \log \frac{|D|}{|\{d: t_{i} \in d\}|}

with

  • | D | : total number of documents in the corpus
  •  |\{d : t_{i} \in d\}|  : number of documents where the term ti appears (that is  n_{i,j} \neq 0). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to use 1 + |\{d : t_{i} \in d\}|

Then

 \mathrm{(tf\mbox{-}idf)_{i,j}} = \mathrm{tf_{i,j}} \times  \mathrm{idf_{i}}

A high weight in tf-idf is reached by a high term frequency(in the given document) and a low document frequency of the term in thewhole collection of documents; the weights hence tend to filter outcommon terms. The tf-idf value for a term will always be greater thanor equal to zero.

Example

Consider a document containing 100 words wherein the word cow appears 3 times. Following the previously defined formulas, the term frequency (TF) for cow is then 0.03 (3 / 100). Now, assume we have 10 million documents and cowappears in one thousand of these. Then, the inverse document frequencyis calculated as ln(10 000 000 / 1 000) = 9.21. The tf-idf score is theproduct of these quantities: 0.03 * 9.21 = 0.28.

posted on 2010-04-02 13:53  星云外  阅读(832)  评论(0编辑  收藏  举报