Skip to main content

What is this blog about?

Well, it's about NLP ... what else?

On 1. September 2017, I started my PhD studies at the Institute of Computational Linguistics at the University of Zurich. To keep track of the work I do, I thought it would be nice to have some kind of journal. And then I thought that maybe others could profit from this work too, which is why a journal in the form of a blog makes the most sense, doesn't it?

However, in case that what I write here is of no use to anybody, I would like to be informed asap, although this would not change anything of course. What nobody can take from me, however, is the feeling that something of my "making" remains in this world after I'm gone. In 60 years from now (yes, I still expect to spend a certain amount of time on this planet) nobody might be able find the blog anymore anyway, and all of the findings I will have presented might be obsolete, but still, a blog is a blog, and nowadays it is fancy for trendy people to have one, so why not me? :-)

Joking aside ... everything related to what I'll be doing during my 3-4 years of PhD research should land on this blog sooner or later. My main interest is topic modeling, so this is what most contributions on this blog will be dedicated to. From literature to tool reviews, from conference visits to interesting talks, I'll try to put as much up on here as possible.

Feedback is always welcome. If you have any, don't hesitate to contact me on Twitter, Facebook, or via contact form (if that exists on here, I have not checked that out yed :-) ).

Comments

Popular posts from this blog

Extracting Text from PDFs

TET bindings for Python If you are about to extract text from digitised archives, you hope to receive clean XML files, because text extraction from XML files is an almost trivial undertaking. If you receive PDFs, on the other hand, things become a little more difficult. At least, OCR has been done already, so text recognition from pure images is done. For unknown reasons though, the output from the OCR has not been saved in an XML format, where you have the text in one file and the corresponding image (usually a tiff) in another file, but in a PDF where the text is saved in a stream alongside the image. Fortunately, there are libraries which allow you to extract this text stream from PDFs in an efficient manner. One such tool is PDFlib . Within the product range of PDFlib, you will find the Text and Image Extractor Toolkit, or TET for short. We have used the Python bindings to extract the text from the PDFs. A PDF of the first issue of the NZZ. But how to get the text out of ...

Indexing by Latent Semantic Analysis (Deerwester et al., 1990)

Problem statement Deerwester et al. address a common issue in information retrieval, which is the often unsatisfying recall due to the differences how documents are indexed and with what terms users would like to find them.  As such, synonymous query terms fail to retrieve a smiliar document and thus have a serious impact on recall. What is more, polysemy might return documents which are of no interest to the user, which severes precision. The authors point out three factors which are responsible for recent systems to fail in such tasks: Incomplete identification of index words: documents usually never contain all terms a user might query the data with. There is no way of dealing with polysemeous words. Independency assumption of word types   Their assumption is that there is an underlying latent semantic structure (in which terms might either refer to the document they appear in, or to the overall to...

Creating a Gold Standard for OCR Quality Assessment

Building a Gold Standard for NZZ OCR Quality Assessment Getting equipped with data feels like Christmas for a digital humanist (this sentence actually got me thinking about the name of my blog and if it would be better to change it to "lifeofadigitalhumanist" :-), but naah!). And so I was quite happy when we received a 6TB external hard drive with all the NZZ data from 1780 to 2017 on it. Within then impresso project (see www.impresso-project.ch ) we work with texts until 1950, which in case of the NZZ still amounts to 2TB of PDFs full of text an image data. The external hard drive containing the NZZ newspapers. So, the work of text mining 170 years of a historical newspaper could begin. Or so we thought. We realised very quickly that the OCRed text the NZZ so kindly delivered was not nearly as good as we had hoped for. Also, the quality of the images leaves much to be desired. This has mainly two reasons: for one, as the NZZ approached its 225 year jubilee in 2005,...