Skip to main content

From LSI to PLSI

Probabilistic Latent Semantic Indexing (Hofmann 1999)

(Note: Hofmann has a very concise writing style. This summary thus contains passages which are more or less directly copied from Hofmann's paper, simply because you could not summarise them any further. This is just to point out in all clarity that everything in this summary represents Hofmann's work (without making direct "quotations" explicit)).

Problem Statement

Hofmann proposes probabilistic latent semantic indexing, which builds on the method by Deerwester et al. (1990). The main difference is that Hofmann's method has a solid statistical foundation and that it performs significantly better on term matching tasks. The problem statement is similar: in human machine-interaction, the challenge is to retrieve relevant documents and present those to the user after he or she has formulated a request. These requests are often stated as a natural language query, within which the user enters some keywords and/or free-form text.  While LSI maps terms as well as documents in the latent semantic space via SVD, Probabilistic Latent Semantic Indexing (PLSI)  uses an objective function to arrive at the optimal decomposition.

Approach

The Aspect Model

The aspect model is a statistical mixture model and serves as the core of PLSI. Its generative story looks like the following:
  • select a document d with probability P\left(d\right)
  • pick a latent class z with probability P\left(z|d\right)
  • generate a word w with probability P\left(w|z\right)
  • where z \in Z = \{z_1, ..., z_K\},  w \in W = \{w_1, ..., w_M\} , and d \in D = \{d_1, ..., d_N\}
If I understood correctly, K defines the number of dimensions in the embedding simplex, which should correspond to the number of dimensions the latent semantic space is reduced to in LSI. The result of the generative process is the observed pair (d, w), while we can discard the z. This can be translated into a joint probabilty model P(d, w) = P(d)P(w|d), where P(w|d) = \sum_{z \in Z} P(w|z)P(z|d).  This model is based on two independence assumptions:
  1. observation paris (d, w) are generated independently (bag-of-words)
  2. since K \ll Nz acts as a bottleneck variable in predicting w conditioned on d.
The three probabilites can now be determined by maximisation of the log-likelihood function:
  • \mathcal{L} = \sum_{d \in D}\sum_{w \in W}n(d,w)\text{ log }P(d, w), where n(d, w) is the term frequency, and P(d, w)=\sum_{z \in Z} P(z)P(w|z)P(d|z) (Bayes' rule)

2.2.2 Model Fitting with Tempered EM

In Expectation Maximisation, we first compute the posterior probabilities for the latent variables z:
P(z|d, w) = \frac{P(z)P(d|z)P(w|z)}{\sum_{z'}P(z')P(d|z')P(w|z')}
Secondly, we maximise  the probability of a word given its latent class, the probability of a document given its class and the class probability:
P(w|z)=\frac{\sum_{d}n(d, w)P(z|d, w)}{\sum_{d, w'}n(d, w')P(z|d, w')}
P(d|z)=\frac{\sum_{d}n(d, w)P(z|d, w)}{\sum_{d', w}n(d', w)P(z|d', w)}
P(z) = \frac{1}{R}\sum_{d,w}n(d,w)P(z|d,w)\text{, }R \equiv \sum_{d, w}n(d, w)
The reason Hofmann used tempered EM is mainly to avoid overfitting. With tempered EM, we parametrise the E-step with a control parameter \beta:
P_\beta(z|d, w) = \frac{P(z)[P(d|z)P(w|z)]^\beta}{\sum_{z'}P(z')[P(d|z')P(w|z')]^\beta}
The following Figure 1 gives a geometric interpretation of the aspect model.

Fig. 1
Geometric interpretation of sub-simplex spanned by the aspect model.
Assuming K=3 , the embedding space shows how the multinomial distributions for each word given a latent class span a two dimensional (K-1) sub-simplex (instead of a M-1 probability simplex). This procedure per se already constitutes a dimensionality reduction which we find also in LSI. The mixing weights P\left(z|d\right) correspond to the coordinates of a document in the sub-simplex.

Mixture Decomposition vs. SVD

Hofmann identifies the following similarities between SVD and mixture decomposition, where he defines the matrices U \text{, } V \text{, and } \Sigma (T \text{, } D \text{, and } S in Deerwester et al. (1990)) as \hat{U} = (P(d_i|z_k))_{i,k}\hat{V} = (P(w_j|z_k))_{j,k}, and \hat{\Sigma} = diag(P(z_k))_k. The joint probability model can then be easily computed by multiplying these matrices, so that we get P=\hat{U}\hat{\Sigma}\hat{V}^T. As such, the weighted sum over outer products betweeen rows of \hat{U} and \hat{V} reflects conditional independence in PLSA.  Moreover, the factors P(w|z) and the component distributions P(d|z) of the aspect model correspond to the left/right eigenvectors in SVD. Last but not least, the mixing proportions P(z) in PLSA substitute the singular values of the SVD in LSA, as already hinted at above (where Kwould correspond to the the number of singular values we would restrict ourselves to in LSA).  However, it should be noted that in contrast to LSA, which uses the L_2-norm, PLSA uses an objective function to arrive at the optimal decomposition. This means that the factors of the mixture approximation P of the co-occurence table, which in itself is a well-defined probability distribution, posses a clear probabilistic meaning in terms of mixture and component distributions.

Factor Representation

Hofmann notes that a model trained with 128 factors on the TDT-1 collection is able to distinguish between words used related to "plane" (where the occupants are usually referred to as "passengers") and the ones used in the context of "space shuttle" (where we use the term "austronaut" to denote an occupant). As such, the aspect model seems to be successful in assigning words to these 128 thematic bins.

Folding-In Queries

When users want to retrieve documents and formulate the query, this query has to be somehow represented in the model in order to identify the closest matches. In LSA, a query is simply a linear mapping in the model via a (weighted) centereing of its constituent terms. In contrast, PLSA computes mixing proportions via EM. In this case, the factors are fixed, so that only the probability of a topic given a query needs to be maximised. Hofmann presents a nice example for the multi-word query "aid", "food", "medical", "people", "UN", and "war" and shows that, if we consider the four different factors which describe the war in Bosnia, the war in Iraq, the crisis in Rwanda, and the earthquake in Kobe, after twenty iterations, the query matches Rwanda best.

Probabilistic Latent Semantic Indexing

Vector-Space Models (VSM) are common when it comes to information retrieval. In LSI, we use a low-dimensional latent space to represent documents. If we now want to compute the similarity of a query or a document that is not part of the original model, LSI allows us to do so by folding-in the  query or document by simple matrix multiplication. Hofmann uses two variants to text against traditionsl LSI. PLSI-U, a context-dependent unigram model, provides  a multinomial distribution for a word given a document over the entire vocabulary. In order to compare the documents against the query, Hofmann uses the cosine similarity measure. PLSI-Q, on the other hand, actually folds in the query and computes the weights of the topic given the query.

Results

Hofmann evaluates the performance of PLSI (U & Q, based on idf on the one hand, and tfidf on the other hand) and traditional LSI based on different data sets (MED, CRAN, CACM, CISI). PLSA models are trained with  32, 48, 64, 128, while 10% of each data set was left out for testing. In general, PLSI-U and PLSI-Q significantly outperform standard LSI (especially their combined models).      

Comments

  1. In the ever-evolving realm of SEO, embracing innovations like Latent Semantic Indexing is crucial for achieving sustainable success. By understanding the power of LSI keywords and incorporating them intelligently into your content, you not only enhance your search engine visibility but also provide users with valuable and contextually relevant information.

    In the intricate dance between websites and search engines, dynamic ways of SEO emerges as the orchestrator of success. From content optimization to technical intricacies, embracing these dynamic ways allows your website to not only survive but thrive in the competitive digital sphere.

    Embracing Semantic SEO is a strategic move towards future-proofing your digital presence. By transitioning from a keyword-centric to a context-driven approach, you signal to search engines that your content is not only relevant but also aligned with user intent.

    ReplyDelete

Post a Comment

Popular posts from this blog

Extracting Text from PDFs

TET bindings for Python If you are about to extract text from digitised archives, you hope to receive clean XML files, because text extraction from XML files is an almost trivial undertaking. If you receive PDFs, on the other hand, things become a little more difficult. At least, OCR has been done already, so text recognition from pure images is done. For unknown reasons though, the output from the OCR has not been saved in an XML format, where you have the text in one file and the corresponding image (usually a tiff) in another file, but in a PDF where the text is saved in a stream alongside the image. Fortunately, there are libraries which allow you to extract this text stream from PDFs in an efficient manner. One such tool is PDFlib . Within the product range of PDFlib, you will find the Text and Image Extractor Toolkit, or TET for short. We have used the Python bindings to extract the text from the PDFs. A PDF of the first issue of the NZZ. But how to get the text out of

Indexing by Latent Semantic Analysis (Deerwester et al., 1990)

Problem statement Deerwester et al. address a common issue in information retrieval, which is the often unsatisfying recall due to the differences how documents are indexed and with what terms users would like to find them.  As such, synonymous query terms fail to retrieve a smiliar document and thus have a serious impact on recall. What is more, polysemy might return documents which are of no interest to the user, which severes precision. The authors point out three factors which are responsible for recent systems to fail in such tasks: Incomplete identification of index words: documents usually never contain all terms a user might query the data with. There is no way of dealing with polysemeous words. Independency assumption of word types   Their assumption is that there is an underlying latent semantic structure (in which terms might either refer to the document they appear in, or to the overall topic, or b