Skip to main content

Posts

Showing posts from April, 2018

Creating a Gold Standard for OCR Quality Assessment

Building a Gold Standard for NZZ OCR Quality Assessment Getting equipped with data feels like Christmas for a digital humanist (this sentence actually got me thinking about the name of my blog and if it would be better to change it to "lifeofadigitalhumanist" :-), but naah!). And so I was quite happy when we received a 6TB external hard drive with all the NZZ data from 1780 to 2017 on it. Within then impresso project (see www.impresso-project.ch ) we work with texts until 1950, which in case of the NZZ still amounts to 2TB of PDFs full of text an image data. The external hard drive containing the NZZ newspapers. So, the work of text mining 170 years of a historical newspaper could begin. Or so we thought. We realised very quickly that the OCRed text the NZZ so kindly delivered was not nearly as good as we had hoped for. Also, the quality of the images leaves much to be desired. This has mainly two reasons: for one, as the NZZ approached its 225 year jubilee in 2005,