What is this blog about?

Well, it's about NLP ... what else?

On 1. September 2017, I started my PhD studies at the Institute of Computational Linguistics at the University of Zurich. To keep track of the work I do, I thought it would be nice to have some kind of journal. And then I thought that maybe others could profit from this work too, which is why a journal in the form of a blog makes the most sense, doesn't it?

However, in case that what I write here is of no use to anybody, I would like to be informed asap, although this would not change anything of course. What nobody can take from me, however, is the feeling that something of my "making" remains in this world after I'm gone. In 60 years from now (yes, I still expect to spend a certain amount of time on this planet) nobody might be able find the blog anymore anyway, and all of the findings I will have presented might be obsolete, but still, a blog is a blog, and nowadays it is fancy for trendy people to have one, so why not me? :-)

Joking aside ... everything related to what I'll be doing during my 3-4 years of PhD research should land on this blog sooner or later. My main interest is topic modeling, so this is what most contributions on this blog will be dedicated to. From literature to tool reviews, from conference visits to interesting talks, I'll try to put as much up on here as possible.

Feedback is always welcome. If you have any, don't hesitate to contact me on Twitter, Facebook, or via contact form (if that exists on here, I have not checked that out yed :-) ).

Comments

Extracting Text from PDFs

TET bindings for Python If you are about to extract text from digitised archives, you hope to receive clean XML files, because text extraction from XML files is an almost trivial undertaking. If you receive PDFs, on the other hand, things become a little more difficult. At least, OCR has been done already, so text recognition from pure images is done. For unknown reasons though, the output from the OCR has not been saved in an XML format, where you have the text in one file and the corresponding image (usually a tiff) in another file, but in a PDF where the text is saved in a stream alongside the image. Fortunately, there are libraries which allow you to extract this text stream from PDFs in an efficient manner. One such tool is PDFlib . Within the product range of PDFlib, you will find the Text and Image Extractor Toolkit, or TET for short. We have used the Python bindings to extract the text from the PDFs. A PDF of the first issue of the NZZ. But how to get the text out of

From LSI to PLSI

Probabilistic Latent Semantic Indexing (Hofmann 1999) (Note: Hofmann has a very concise writing style. This summary thus contains passages which are more or less directly copied from Hofmann's paper, simply because you could not summarise them any further. This is just to point out in all clarity that everything in this summary represents Hofmann's work (without making direct "quotations" explicit)). Problem Statement Hofmann proposes probabilistic latent semantic indexing, which builds on the method by Deerwester et al. (1990) . The main difference is that Hofmann's method has a solid statistical foundation and that it performs significantly better on term matching tasks. The problem statement is similar: in human machine-interaction, the challenge is to retrieve relevant documents and present those to the user after he or she has formulated a request. These requests are often stated as a natural language query, within which the user enters some key

Data, data, data ...

Data acquisition phase has started As holds true for every project, I assume, the data that will be worked with is of vital importance. Our aim in the impresso project is to digitise and process various newspapers from the late 18th to the 20th century (200 years in total) so that historians can explore the data in previously unseen and unknown ways. As such, we will provide new tools to do research in history, and who knows, maybe we will also take a leading role in revolutionising how historical research will be done in a few years ;-). But again, all this will not be able without any data. Already during the project application phase we contacted libraries in Switzerland and Luxembourg and asked them whether they have data available for our project. Many institutions gladly followed our call and committed themselves to assist us as good as they can. We are very happy that, for example, both the National Libraries of Luxembourg, as well as its Swiss counterpart are on board! T

lifeofacomputationallinguist

Search This Blog