Skip to main content

Posts

Showing posts with the label NZZ

About Topic Modeling on Historical Newspapers ...

... and where it fails An article about the premiere of Dürrenmatt's play in the Netherlands. An article about the situation in the Netherlands during WWII. Let’s assume a historian wants to investigate the battle of Arnhem and she would like to use newspaper texts for her primary source. She enters her query “Arnhem” into a search engine and finds the two articles you see above. After a quick inspection, the historian deems the first one to be irrelevant for her further research, whereas the second one looks important. If we wanted to assign labels to the two different texts, we would probably choose something like “arts” for the first one, and “war” for the second one. The indicators for these labels are the words that are present in the articles. In the first one we find words like “théatres”, “comédie”, “pièce”, “œuvre”, etc., which we associate with “arts”. In the second one, words like “divisions”, “forces”, “combats”, “troupes”, etc., give the impression of a...

Creating a Gold Standard for OCR Quality Assessment

Building a Gold Standard for NZZ OCR Quality Assessment Getting equipped with data feels like Christmas for a digital humanist (this sentence actually got me thinking about the name of my blog and if it would be better to change it to "lifeofadigitalhumanist" :-), but naah!). And so I was quite happy when we received a 6TB external hard drive with all the NZZ data from 1780 to 2017 on it. Within then impresso project (see www.impresso-project.ch ) we work with texts until 1950, which in case of the NZZ still amounts to 2TB of PDFs full of text an image data. The external hard drive containing the NZZ newspapers. So, the work of text mining 170 years of a historical newspaper could begin. Or so we thought. We realised very quickly that the OCRed text the NZZ so kindly delivered was not nearly as good as we had hoped for. Also, the quality of the images leaves much to be desired. This has mainly two reasons: for one, as the NZZ approached its 225 year jubilee in 2005,...