Skip to main content

200 years of history await us

And off we go ...

This being my first blog post which summarises a whole month, I think I'll touch upon some of the issues I had to deal with. I expect there will be some more blog entries following which will be dedicated to specific problems or insights. My aim is to blog like once a week, let's see where this will get us.

It's been a little more than a month now since we could officially announce our SNF project as "on-going". More concretely, three institutions, namely the C²DH Lab at the University of Luxembourg, the DHLAB at the EPFL, and the Institute of Computational Linguistics at the University of Zurich set out to mine 200 years of newspaper data. We called the project impresso, which is the the perfect participle of the Latin word "imprimere", which means "to press", "to imprint", or "to stamp". Quite a fitting project name, I would say!

So far, my role has been quite a modest one in this undertaking. During my (planned) three years of PhD research, I want to explore and develop methods in the realm of topic modeling. That is, what topics do we stumble upon when we dive into historical newspaper collections? How are the topics related? How do topics develop over time? How can we do justice to the multilingual data?

A thought that just popped up recently in my mind is whether topic modeling could be used in the detection of fake news, which is a big topic nowadays. As I'm writing these lines, I just came across the papers by Shu et al. (2017) and Elyashar, Bendahan, and Puzis (2017). Although having not read them in detail so far, they both at least mention having used LDA techniques. But I'll leave this for later to be examined in detail. A method to uncover the lies of the past would be exciting, though.

Since I'm relatively new to automatic text classification and categorisation, I first did an extensive research of the literature that is available. All in all, I think I have identified the most important work and I was able to narrow it down to about 50 A-level submissions. Speaking of such, what to read and what not to read is important. After all we may only dedicate a certain amount of time to what others have done already. This is why one of my supervisors pointed out Robert Munro's blog to me. In a quite lengthy read he mentions the top 10 NLP conferences and journals and hints at what literature can be skipped without leaving us with a bad feeling. 

What else is there to come? Currently, we are working on our website which will be available soon. Moreover, the first of six workshops, the Kick-Off-Workshop so to speak, will take place on 23. and 24. of October at the EPFL.

Let's get started :-)

Comments

Popular posts from this blog

Extracting Text from PDFs

TET bindings for Python If you are about to extract text from digitised archives, you hope to receive clean XML files, because text extraction from XML files is an almost trivial undertaking. If you receive PDFs, on the other hand, things become a little more difficult. At least, OCR has been done already, so text recognition from pure images is done. For unknown reasons though, the output from the OCR has not been saved in an XML format, where you have the text in one file and the corresponding image (usually a tiff) in another file, but in a PDF where the text is saved in a stream alongside the image. Fortunately, there are libraries which allow you to extract this text stream from PDFs in an efficient manner. One such tool is PDFlib . Within the product range of PDFlib, you will find the Text and Image Extractor Toolkit, or TET for short. We have used the Python bindings to extract the text from the PDFs. A PDF of the first issue of the NZZ. But how to get the text out of

From LSI to PLSI

Probabilistic Latent Semantic Indexing  (Hofmann 1999) (Note: Hofmann has a very concise writing style. This summary thus contains passages which are more or less directly copied from Hofmann's paper, simply because you could not summarise them any further. This is just to point out in all clarity that everything in this summary represents Hofmann's work (without making direct "quotations" explicit)). Problem Statement Hofmann proposes probabilistic latent semantic indexing, which builds on the method by  Deerwester et al. (1990) . The main difference is that Hofmann's method has a solid statistical foundation and that it performs significantly better on term matching tasks. The problem statement is similar: in human machine-interaction, the challenge is to retrieve relevant documents and present those to the user after he or she has formulated a request. These requests are often stated as a natural language query, within which the user enters some key

Indexing by Latent Semantic Analysis (Deerwester et al., 1990)

Problem statement Deerwester et al. address a common issue in information retrieval, which is the often unsatisfying recall due to the differences how documents are indexed and with what terms users would like to find them.  As such, synonymous query terms fail to retrieve a smiliar document and thus have a serious impact on recall. What is more, polysemy might return documents which are of no interest to the user, which severes precision. The authors point out three factors which are responsible for recent systems to fail in such tasks: Incomplete identification of index words: documents usually never contain all terms a user might query the data with. There is no way of dealing with polysemeous words. Independency assumption of word types   Their assumption is that there is an underlying latent semantic structure (in which terms might either refer to the document they appear in, or to the overall topic, or b