Skip to main content

Indexing by Latent Semantic Analysis (Deerwester et al., 1990)

Problem statement

Deerwester et al. address a common issue in information retrieval, which is the often unsatisfying recall due to the differences how documents are indexed and with what terms users would like to find them.  As such, synonymous query terms fail to retrieve a smiliar document and thus have a serious impact on recall. What is more, polysemy might return documents which are of no interest to the user, which severes precision.
The authors point out three factors which are responsible for recent systems to fail in such tasks:
  1. Incomplete identification of index words: documents usually never contain all terms a user might query the data with.
  2. There is no way of dealing with polysemeous words.
  3. Independency assumption of word types  
Their assumption is that there is an underlying latent semantic structure (in which terms might either refer to the document they appear in, or to the overall topic, or both) in the data which will enhance term-matching retrieval. In this vein, they introduce "latent semantic indexing" (LSI) which uses singular-value decomposition (SVD). In fact, they want to model the relationship between terms and documents directly. Similar documents would occur close to each other in the semantic space because they contain many similar terms. The opposite would be true for dissimilar documents.

Approach

The approach they chose is two-mode factor analysis. This techniques allows them to model documents and terms as points in a space, where the similarity between them would be given by their dot product. The ingestion of new documents or terms would be achieved by taking the weighted sum of its component weight vectors. Note that the space would have looked differently if the terms or documents to ingest had been present during the original construction of the semantic space. 
Turning to SVD in detail: SVD is similar to eigenanalysis, with the small difference that it results in three instead of two matrices (T0, S0, D0'). Multiplication of these three matrices would result in the original matrix X. However, we assume that this matrix does not capture the relevant information we need, which is why we limit the number of dimensions by the k topmost values in S0, resulting in a new matrix which should capture semantic similiarity better. k here represents an arbitrary number of artificial concepts. In this reduced semantic space, the n most similar documents can be calculated by taking the cosine similarity measure (the higher the value the more similar are the documents). The factor k should not be too small, since we want to capture the most important and most relevant concepts, but it should neither be too high, because we would introduce too much noise or irrelevant detail this way.
  • Comparing two terms: X̂X̂' = TS²T'
  • Comparing two documents = 
  • Comparing a termn and a document: 
Deerwester et al. did not preprocess their data in any way. They used two data sets and tested their approach against three others.

Illustration of SVD taken from Deerwester et al. (1990)

Results

Overall, they perform better on the MED data set, and equally poorly on the CISI data set, where precision was rather poor.

Conclusion

Modelling terms and documents in the same semantic space makes the problem a straightforward matter for retrieving documents relevant to user queries. However, it only offers a partial solution to the polysemy problem.

Rating: *****

Written in simple and plain English. Use good examples and explain the concept of LSI and SVD very good!

Comments

Popular posts from this blog

Extracting Text from PDFs

TET bindings for Python If you are about to extract text from digitised archives, you hope to receive clean XML files, because text extraction from XML files is an almost trivial undertaking. If you receive PDFs, on the other hand, things become a little more difficult. At least, OCR has been done already, so text recognition from pure images is done. For unknown reasons though, the output from the OCR has not been saved in an XML format, where you have the text in one file and the corresponding image (usually a tiff) in another file, but in a PDF where the text is saved in a stream alongside the image. Fortunately, there are libraries which allow you to extract this text stream from PDFs in an efficient manner. One such tool is PDFlib . Within the product range of PDFlib, you will find the Text and Image Extractor Toolkit, or TET for short. We have used the Python bindings to extract the text from the PDFs. A PDF of the first issue of the NZZ. But how to get the text out of

From LSI to PLSI

Probabilistic Latent Semantic Indexing  (Hofmann 1999) (Note: Hofmann has a very concise writing style. This summary thus contains passages which are more or less directly copied from Hofmann's paper, simply because you could not summarise them any further. This is just to point out in all clarity that everything in this summary represents Hofmann's work (without making direct "quotations" explicit)). Problem Statement Hofmann proposes probabilistic latent semantic indexing, which builds on the method by  Deerwester et al. (1990) . The main difference is that Hofmann's method has a solid statistical foundation and that it performs significantly better on term matching tasks. The problem statement is similar: in human machine-interaction, the challenge is to retrieve relevant documents and present those to the user after he or she has formulated a request. These requests are often stated as a natural language query, within which the user enters some key