Probabilistic Latent Semantic Indexing (Hofmann 1999)
(Note: Hofmann has a very concise writing style. This summary thus contains passages which are more or less directly copied from Hofmann's paper, simply because you could not summarise them any further. This is just to point out in all clarity that everything in this summary represents Hofmann's work (without making direct "quotations" explicit)).
Problem Statement
Hofmann proposes probabilistic latent semantic indexing, which builds on the method by Deerwester et al. (1990). The main difference is that Hofmann's method has a solid statistical foundation and that it performs significantly better on term matching tasks. The problem statement is similar: in human machine-interaction, the challenge is to retrieve relevant documents and present those to the user after he or she has formulated a request. These requests are often stated as a natural language query, within which the user enters some keywords and/or free-form text. While LSI maps terms as well as documents in the latent semantic space via SVD, Probabilistic Latent Semantic Indexing (PLSI) uses an objective function to arrive at the optimal decomposition.
Approach
The Aspect Model
The aspect model is a statistical mixture model and serves as the core of PLSI. Its generative story looks like the following:
- select a document with probability
- pick a latent class with probability
- generate a word with probability
- where , , and
If I understood correctly, defines the number of dimensions in the embedding simplex, which should correspond to the number of dimensions the latent semantic space is reduced to in LSI. The result of the generative process is the observed pair , while we can discard the . This can be translated into a joint probabilty model , where . This model is based on two independence assumptions:
- observation paris are generated independently (bag-of-words)
- since , acts as a bottleneck variable in predicting conditioned on .
The three probabilites can now be determined by maximisation of the log-likelihood function:
- , where is the term frequency, and (Bayes' rule)
2.2.2 Model Fitting with Tempered EM
In Expectation Maximisation, we first compute the posterior probabilities for the latent variables :
Secondly, we maximise the probability of a word given its latent class, the probability of a document given its class and the class probability:
The reason Hofmann used tempered EM is mainly to avoid overfitting. With tempered EM, we parametrise the E-step with a control parameter :
The following Figure 1 gives a geometric interpretation of the aspect model.
Assuming , the embedding space shows how the multinomial distributions for each word given a latent class span a two dimensional ( ) sub-simplex (instead of a probability simplex). This procedure per se already constitutes a dimensionality reduction which we find also in LSI. The mixing weights correspond to the coordinates of a document in the sub-simplex.
Mixture Decomposition vs. SVD
Hofmann identifies the following similarities between SVD and mixture decomposition, where he defines the matrices ( in Deerwester et al. (1990)) as , , and . The joint probability model can then be easily computed by multiplying these matrices, so that we get . As such, the weighted sum over outer products betweeen rows of and reflects conditional independence in PLSA. Moreover, the factors and the component distributions of the aspect model correspond to the left/right eigenvectors in SVD. Last but not least, the mixing proportions in PLSA substitute the singular values of the SVD in LSA, as already hinted at above (where would correspond to the the number of singular values we would restrict ourselves to in LSA). However, it should be noted that in contrast to LSA, which uses the -norm, PLSA uses an objective function to arrive at the optimal decomposition. This means that the factors of the mixture approximation of the co-occurence table, which in itself is a well-defined probability distribution, posses a clear probabilistic meaning in terms of mixture and component distributions.
Factor Representation
Hofmann notes that a model trained with 128 factors on the TDT-1 collection is able to distinguish between words used related to "plane" (where the occupants are usually referred to as "passengers") and the ones used in the context of "space shuttle" (where we use the term "austronaut" to denote an occupant). As such, the aspect model seems to be successful in assigning words to these 128 thematic bins.
Folding-In Queries
When users want to retrieve documents and formulate the query, this query has to be somehow represented in the model in order to identify the closest matches. In LSA, a query is simply a linear mapping in the model via a (weighted) centereing of its constituent terms. In contrast, PLSA computes mixing proportions via EM. In this case, the factors are fixed, so that only the probability of a topic given a query needs to be maximised. Hofmann presents a nice example for the multi-word query "aid", "food", "medical", "people", "UN", and "war" and shows that, if we consider the four different factors which describe the war in Bosnia, the war in Iraq, the crisis in Rwanda, and the earthquake in Kobe, after twenty iterations, the query matches Rwanda best.
Probabilistic Latent Semantic Indexing
Vector-Space Models (VSM) are common when it comes to information retrieval. In LSI, we use a low-dimensional latent space to represent documents. If we now want to compute the similarity of a query or a document that is not part of the original model, LSI allows us to do so by folding-in the query or document by simple matrix multiplication. Hofmann uses two variants to text against traditionsl LSI. PLSI-U, a context-dependent unigram model, provides a multinomial distribution for a word given a document over the entire vocabulary. In order to compare the documents against the query, Hofmann uses the cosine similarity measure. PLSI-Q, on the other hand, actually folds in the query and computes the weights of the topic given the query.
Results
Hofmann evaluates the performance of PLSI (U & Q, based on idf on the one hand, and tfidf on the other hand) and traditional LSI based on different data sets (MED, CRAN, CACM, CISI). PLSA models are trained with 32, 48, 64, 128, while 10% of each data set was left out for testing. In general, PLSI-U and PLSI-Q significantly outperform standard LSI (especially their combined models).
In the ever-evolving realm of SEO, embracing innovations like Latent Semantic Indexing is crucial for achieving sustainable success. By understanding the power of LSI keywords and incorporating them intelligently into your content, you not only enhance your search engine visibility but also provide users with valuable and contextually relevant information.
ReplyDeleteIn the intricate dance between websites and search engines, dynamic ways of SEO emerges as the orchestrator of success. From content optimization to technical intricacies, embracing these dynamic ways allows your website to not only survive but thrive in the competitive digital sphere.
Embracing Semantic SEO is a strategic move towards future-proofing your digital presence. By transitioning from a keyword-centric to a context-driven approach, you signal to search engines that your content is not only relevant but also aligned with user intent.