Skip to main content

What is to come?

Up next ...

I'll give a short summary of what I plan to write about next. This is mostly bullet points, but maybe it makes you coming back sooner or later to check if there is something new you might be interested in.
  • Review of topic modeling tools
  • Literature review concerning topic modeling
  • relevant projects: what they did, how they did it and what we do
  • ...

Comments

Popular posts from this blog

From LSI to PLSI

Probabilistic Latent Semantic Indexing  (Hofmann 1999) (Note: Hofmann has a very concise writing style. This summary thus contains passages which are more or less directly copied from Hofmann's paper, simply because you could not summarise them any further. This is just to point out in all clarity that everything in this summary represents Hofmann's work (without making direct "quotations" explicit)). Problem Statement Hofmann proposes probabilistic latent semantic indexing, which builds on the method by  Deerwester et al. (1990) . The main difference is that Hofmann's method has a solid statistical foundation and that it performs significantly better on term matching tasks. The problem statement is similar: in human machine-interaction, the challenge is to retrieve relevant documents and present those to the user after he or she has formulated a request. These requests are often stated as a natural language query, within which the user enters some key...

Data, data, data ...

Data acquisition phase has started As holds true for every project, I assume, the data that will be worked with is of vital importance. Our aim in the impresso project is to digitise and process various newspapers from the late 18th to the 20th century (200 years in total) so that historians can explore the data in previously unseen and unknown ways. As such, we will provide new tools to do research in history, and who knows, maybe we will also take a leading role in revolutionising how historical research will be done in a few years ;-). But again, all this will not be able without any data. Already during the project application phase we contacted libraries in Switzerland and Luxembourg and asked them whether they have data available for our project. Many institutions gladly followed our call and committed themselves to assist us as good as they can. We are very happy that, for example, both the National Libraries of Luxembourg, as well as its Swiss counterpart are on board! T...

Indexing by Latent Semantic Analysis (Deerwester et al., 1990)

Problem statement Deerwester et al. address a common issue in information retrieval, which is the often unsatisfying recall due to the differences how documents are indexed and with what terms users would like to find them.  As such, synonymous query terms fail to retrieve a smiliar document and thus have a serious impact on recall. What is more, polysemy might return documents which are of no interest to the user, which severes precision. The authors point out three factors which are responsible for recent systems to fail in such tasks: Incomplete identification of index words: documents usually never contain all terms a user might query the data with. There is no way of dealing with polysemeous words. Independency assumption of word types   Their assumption is that there is an underlying latent semantic structure (in which terms might either refer to the document they appear in, or to the overall to...