Problem statement
Deerwester et al. address a common issue in information retrieval,
which is the often unsatisfying recall due to the differences how
documents are indexed and with what terms users would like to find
them. As such, synonymous query terms fail to retrieve a smiliar
document and thus have a serious impact on recall. What is more, polysemy
might return documents which are of no interest to the user, which severes
precision.
The authors point out three factors which are responsible for recent
systems to fail in such tasks:
- Incomplete identification of index words: documents usually never contain all terms a user might query the data with.
- There is no way of dealing with polysemeous words.
- Independency assumption of word types
Their assumption is that there is an underlying latent semantic
structure (in which terms might either refer to the document they appear
in, or to the overall topic, or both) in the data which will enhance
term-matching retrieval. In this vein, they introduce "latent semantic
indexing" (LSI) which uses singular-value decomposition (SVD). In fact,
they want to model the relationship between terms and documents directly.
Similar documents would occur close to each other in the semantic space
because they contain many similar terms. The opposite would be true for
dissimilar documents.
Approach
The approach they chose is two-mode factor analysis. This techniques
allows them to model documents and terms as points in a space, where the
similarity between them would be given by their dot product. The ingestion
of new documents or terms would be achieved by taking the weighted sum of
its component weight vectors. Note that the space would have looked
differently if the terms or documents to ingest had been present during
the original construction of the semantic space.
Turning to SVD in detail: SVD is similar to eigenanalysis, with the
small difference that it results in three instead of two matrices (T0,
S0, D0'). Multiplication of these three matrices
would result in the original matrix X. However, we assume that this matrix
does not capture the relevant information we need, which is why we limit
the number of dimensions by the k topmost values in S0,
resulting in a new matrix which should capture semantic similiarity
better. k here represents an arbitrary number of artificial concepts. In
this reduced semantic space, the n most similar documents can be
calculated by taking the cosine similarity measure (the higher the value
the more similar are the documents). The factor k should not be too small,
since we want to capture the most important and most relevant concepts,
but it should neither be too high, because we would introduce too much
noise or irrelevant detail this way.
- Comparing two terms: X̂X̂' = TS²T'
- Comparing two documents =
- Comparing a termn and a document:
Deerwester et al. did not preprocess their data in any way. They used
two data sets and tested their approach against three others.
Illustration of SVD taken from Deerwester et al. (1990) |
Results
Overall, they perform better on the MED data set, and equally poorly on
the CISI data set, where precision was rather poor.
Conclusion
Modelling terms and documents in the same semantic space makes the
problem a straightforward matter for retrieving documents relevant to user
queries. However, it only offers a partial solution to the polysemy
problem.
Rating: *****
Written in simple and plain English. Use good examples and explain the concept of LSI and SVD very good!
Comments
Post a Comment