The Term Count Model

Demystifying Term Vector Calculations

Background

In this section, I present a non technical tutorial on Term Vector Theory for SEO and SEM specialists. As previously discussed, term weights are computed using local and global information. In the classic Vector Space Model, the weight of a term i is given by

Eq 1: Term Weight = Term Vector

Dr. Corliss, from Markette University, provides an excellent lecture material on this subject [1]. Lee, Chuang and Seamon compare six different term weight models derived from Eq 1 (tf only, IDF only, and combinations of these) [2]. Readers interested in a term vector open source code may want to check the MINERVA implementation [3]. Jamie Callan, from Carnegie Mellon University, has a must -read lecture on the Vector Space Model, complete with how-to calculations. [4].

Term Counts

As Callan pointed out, while the standard approach for many IR systems is tf*IDF weights, historically weights have been computed using local information

Eq 2 Term Weight = wi = tfi

In this model, and as with any weighting scheme, we need three things:

  • a database collection to retrieve documents from
  • an input query
  • an index of terms

The first two are pretty obvious.

The index term is a repository of unique terms that have been extracted from documents according to some selection rules. If I'm a modern IR system, I would impose my machine power and extract about 50 unique terms from each document (a la AltaVista's Term Vector Database). But if I'm a primitive and lazy IR system, I would extract terms from the descriptors of the document. Does the phrase "meta data" or "meta tags" rings a bell?

Demystifying Term Vectors

We can demystify term vectors with a hypothetical example. Consider an index term consisting of the words "car", "auto" and "insurance". The database collection consists of 3 documents, only. The term counts or number of times these terms occur in each document is:

  1. doc 1: auto (3 times), car (1 times), insurance (3 times)
  2. doc 2: auto (1 times), car (2 times), insurance (4 times)
  3. doc 3: auto (2 times), car (3 times), insurance (0 times)

Let's construct a space with these terms, where each term represents a dimension (coordinate). If we query this collection for "insurance", we can represent the query as point with coordinates (0, 0, 1). This is so since the counts for the query are 0 for auto, 0 for car and 1 for insurance. We can do the same with each document. This is summarized in the following table:

Term Vector

Let's analyze this table, column by column.

  • 1st column: This column describes the term space. This space consists of three dimensions: auto, car and insurance.
  • 2nd, 3rd and 4th columns: These columns are the term counts. The counts are the coordinates of a point in the term space that correspond to each document. The coordinates of each point are then (3,1,3), (1,2,4) and (2,3,0), respectively. If the origin coordinates are (0, 0, 0), then the displacement of each point from the origin can be represented by a vector. The length or magnitude of this vector can be measured with Pithagoras's Theorem [5].
  • 4th column: Shows the coordinates for the query which in this case are (0, 0, 1).

This treatment can be extended to include more terms, dimensions, documents, and words in a query. We simply add more rows and columns to the table.

Computing Vector Magnitudes

By definition, a vector has magnitude and direction. Let's calculate these quantitites.

To calculate the magnitude of each vector, we apply Pythagoras's Theorem. For n dimensions, we can write |Di| = (a12 + a22 + a32...+ an21/2. In our example n = 3, so document and query vector magnitudes are

Co-Occurrence

Computing Dot Products (Scalars)

Before computing cosine values we need to know the dot product between query and document vectors. The dot product (also known as the scalar product or inner product) is computed from coordinate values. For document 1 we have

Term Vector

Term Vector

As we can see, in our example the dot product is just the sum of products between term and query counts.

NOTE ON DOT PRODUCTS

Some authors use dot products as similarity measures between documents and queries.

Sim(Q,Dj) = Q • Dj = qT*Dj

where qT is the transpose query matrix. In our example,

Q = qT = (0 0 1)

and

Sim(Q,D1) = (0 0 1)(3 1 3) = 0*3 + 0*1 + 1*3 = 3
Sim(Q,D2) = (0 0 1)(1 2 4) = 0*1 + 0*2 + 1*4 = 4
Sim(Q,D3) = (0 0 1)(2 3 0) = 0*2 + 0*3 + 1*0 = 0

The largest similary score corresponds to D2 since this repeats more (4 times) the queried term.

The term-document matrix A is defined as

A = [D1, D2, D3,...Dn]

Relevance scores for a query Q are obtained by a simple matrix product qT*A. Elements of the resultant row vector are taken for similary scores.

qT*A = (3, 4, 0)

The point is that similarity measures expressed in terms of dot products are easy to calculate with linear algebra. One simply construct a query vector matrix q, compute its transpose (qT) and postmultiply it by the term-document matrix A.

Computing Cosine Values

Since a dot product is defined as the product of the magnitudes of vectors times the cosine angle between these [5],

dot product = (magnitudes product)*(cosine angle)

then, solving for the cosine angle gives

Term Vector

In plain English, all these calculations mean this: project document and query vectors into a term space and calculate the cosine angle between these. The assumption here is that documents whose vectors are close to the query vector are more relevant to the query than documents whose vectors are away from the query vector.

A Linear Algebra Approach

Linear algebra provides a nice clean shot to all these calculations. One only needs to compute cosine similarity measures as the ratio of two products: (a)query-documents dot products and (b) Frobenius Norm products. Before proceeding any further let me explain the later.

The Frobenius Norm of a matrix, also known as the Euclidean Norm, is defined as the square root of the sum of the absolute squares of its elements. Essentially, take a matrix, square all its elements, add them together and square root the result. The computed number is the Frobenius Norm of the matrix.

Since rows and columns of a matrix are one-row matrices and one-column matrices and these represent vectors, their own Frobenius Norms equal the length of their vectors. Thus, the Frobenius Norm of Doc1 is

ld1 = (32 + 12 + 32)1/2 = (19)1/2 = 4.3589

which as we saw is its Euclidean Distance from the origin of the term space. We can compute the Frobenius Norm for the query (lq) in the same way.

To explain how these concepts can be applied, let's construct the following array from the table given in our example.

Array

Now we define the following matrices:

A = m x n term-document matrix consisting of m rows (terms) and n columns (documents)
q = m x 1 query matrix
qT = transpose of the query matrix
ld = 1 x n matrix whose elements are the lengths of the document vectors 
lq = 1 x 1 matrix whose only element is the length of the query vector

Now we compute the corresponding products and take individual ratios.

Array

As mentioned before, the beauty of linear algebra is that provides a one clean shot to the above calculations. Essentially, dot product document and query vectors and their lengths and then take their ratios. Neat! If a vector model defines elements of A as the product of local, global and normalization weights -instead of mere term counts- one can still use this approach.

Ranking the Results

Finally we sort and rank documents in descending order of cosine values

Rank 1: Doc 2 = 0.8729
Rank 2: Doc 1 = 0.6882
Rank 3: Doc 3 = 0

As we can see, for the query "insurance"

  1. Document 2 is very relevant.
  2. Document 1 is less relevant.
  3. Document 3 is completely irrelevant.

Yet, Another Simpler Approach

Revisiting the linear algebra shortcut above, we might ask if there is an even simpler way of computing cosine similarities. The answer is YES. The following approach is provided. However, note that for each query you would need to recompute a new matrix, since the query is part of the initial array.

  1. Construct a term-document matrix. Include the query in this matrix and in the first column.
  2. Rewrite this matrix by length-normalizing all columns, so these represent unit vectors. Let this be matrix A.
  3. Compute the ATA matrix.

You are done! The following figure depicts this single shot approach.

Doc-doc-similarity-matrix

This produces a matrix that is symmetric around its diagonal. Since unit vectors are used, entries of ATA are both dot products and cosine similarities. So, what advantages this approach offers over the one previously described?

First, all scores are stored in a single matrix. Second, the first row (or column) of this matrix stores query-document similarity scores. All diagonal elements are equal to 1 since this is the similarity between an attribute with itself. And last, but not least, all other entries of this matrix are document-document similarities. Thus, document-document comparisons are possible. Cool! Ah, the power of linear algebra.

In general, the closer a cosine is to 1, the more relevant a document should be. If the cosine is zero, then the documents and query are orthogonal in the term space. This means that documents and query are not related. This is the case of Document 3. True that we could have arrived to this conclusion by just looking at term counts. This is not a convenient shortcut, but a serious drawback of the model. Let's see why.

Limitations of the Model

The Term Count Model

  1. is sensitive to term repetition. Lee, Chuang and Seamon compared six different term weight models derived from Eq 1 and found "no particular situation in which" the model would excel [2].
  2. tend to favor large documents since these contain many words that are very often repeated and their term-document matrices have more entries. Thus, long documents score higher simply because they are longer, not because they are more relevant.

We can do better by multiplying tf values times IDF values, that is, by considering local and global information. In this way we include the shear volume of information that is sensitive to the queried term. Thus, in the above calculations we just need to replace wi = tfi with wi = tfi*IDF values and populate the term-document matrix with these term weights.

A Word on Keyword Density

At this point, one may think. "Wait a second. I can divide tf by the total number of words, calculate a keyword density value and conclude the same about the importance of a term in a document". Not so fast. Current commercial search engines do not use this model and with good reasons. The system can be deceived by just repeating over and over a given term (keyword spamming).

These days search engines use local, global and web graph information, not mere term counts or "keyword density" values. Regarding keyword density, this myth has been debunked in The Keyword Density of Non-Sense.

Next: The Classic Vector Space Model

Prev: Term Vector Theory and Keyword Weights

References
  1. Vector Space Models
  2. Document Ranking and the Vector-Space Model
  3. Term Vector
  4. Retrieval Models: Boolean and Vector Space
  5. Vectors
posted on 2011-12-28 09:49  ‰流浪※  阅读(246)  评论(0编辑  收藏  举报