lifeofacomputationallinguist

Posts

About Topic Modeling on Historical Newspapers ...

... and where it fails An article about the premiere of Dürrenmatt's play in the Netherlands. An article about the situation in the Netherlands during WWII. Let’s assume a historian wants to investigate the battle of Arnhem and she would like to use newspaper texts for her primary source. She enters her query “Arnhem” into a search engine and finds the two articles you see above. After a quick inspection, the historian deems the first one to be irrelevant for her further research, whereas the second one looks important. If we wanted to assign labels to the two different texts, we would probably choose something like “arts” for the first one, and “war” for the second one. The indicators for these labels are the words that are present in the articles. In the first one we find words like “théatres”, “comédie”, “pièce”, “œuvre”, etc., which we associate with “arts”. In the second one, words like “divisions”, “forces”, “combats”, “troupes”, etc., give the impression of a...

Creating a Gold Standard for OCR Quality Assessment

Building a Gold Standard for NZZ OCR Quality Assessment Getting equipped with data feels like Christmas for a digital humanist (this sentence actually got me thinking about the name of my blog and if it would be better to change it to "lifeofadigitalhumanist" :-), but naah!). And so I was quite happy when we received a 6TB external hard drive with all the NZZ data from 1780 to 2017 on it. Within then impresso project (see www.impresso-project.ch ) we work with texts until 1950, which in case of the NZZ still amounts to 2TB of PDFs full of text an image data. The external hard drive containing the NZZ newspapers. So, the work of text mining 170 years of a historical newspaper could begin. Or so we thought. We realised very quickly that the OCRed text the NZZ so kindly delivered was not nearly as good as we had hoped for. Also, the quality of the images leaves much to be desired. This has mainly two reasons: for one, as the NZZ approached its 225 year jubilee in 2005,...

Extracting Text from PDFs

TET bindings for Python If you are about to extract text from digitised archives, you hope to receive clean XML files, because text extraction from XML files is an almost trivial undertaking. If you receive PDFs, on the other hand, things become a little more difficult. At least, OCR has been done already, so text recognition from pure images is done. For unknown reasons though, the output from the OCR has not been saved in an XML format, where you have the text in one file and the corresponding image (usually a tiff) in another file, but in a PDF where the text is saved in a stream alongside the image. Fortunately, there are libraries which allow you to extract this text stream from PDFs in an efficient manner. One such tool is PDFlib . Within the product range of PDFlib, you will find the Text and Image Extractor Toolkit, or TET for short. We have used the Python bindings to extract the text from the PDFs. A PDF of the first issue of the NZZ. But how to get the text out of ...

Indexing by Latent Semantic Analysis (Deerwester et al., 1990)

Problem statement Deerwester et al. address a common issue in information retrieval, which is the often unsatisfying recall due to the differences how documents are indexed and with what terms users would like to find them. As such, synonymous query terms fail to retrieve a smiliar document and thus have a serious impact on recall. What is more, polysemy might return documents which are of no interest to the user, which severes precision. The authors point out three factors which are responsible for recent systems to fail in such tasks: Incomplete identification of index words: documents usually never contain all terms a user might query the data with. There is no way of dealing with polysemeous words. Independency assumption of word types Their assumption is that there is an underlying latent semantic structure (in which terms might either refer to the document they appear in, or to the overall to...

1st Workshop at EPFL

Or shaping the future of newspaper archive digitisation I admit this is a rather promising title which raises many expectations. But this goes along the lines with the motto of the SNSF Sinergia programme, which enables researchers to produce breakthrough research. And we want to live up to this motto ;-). So, there we were, on the magnificent EPFL campus on the beautiful morning of the 24th of October, ready for a two day kick-off workshop. view from the EPFL campus This was in fact the first meet-up of the entire consortium, consisting of the research groups from EPFL, the C2DH lab from Luxembourg, the Institute of Computational Linguistics from the University of Zurich, the historians from the University of Lausanne, archivists from Le Temps, NZZ, the Swiss and Luxembourg National Library, the Swiss Economic Archives, the Archives from the Canton of Wallis, representatives from infoclio.ch, and other invited partners and friends. I hope I did not forget anybody :-). ...

Homepage online

The impresso-project Website is Online! Only a minor entry today: our project website is online! Feel free to browse and if you have questions, insights you want to share, or any suggestions, please let us know :-) www.impresso-project.ch

Data, data, data ...

Data acquisition phase has started As holds true for every project, I assume, the data that will be worked with is of vital importance. Our aim in the impresso project is to digitise and process various newspapers from the late 18th to the 20th century (200 years in total) so that historians can explore the data in previously unseen and unknown ways. As such, we will provide new tools to do research in history, and who knows, maybe we will also take a leading role in revolutionising how historical research will be done in a few years ;-). But again, all this will not be able without any data. Already during the project application phase we contacted libraries in Switzerland and Luxembourg and asked them whether they have data available for our project. Many institutions gladly followed our call and committed themselves to assist us as good as they can. We are very happy that, for example, both the National Libraries of Luxembourg, as well as its Swiss counterpart are on board! T...