Skip to main content


Showing posts from June, 2018

About Topic Modeling on Historical Newspapers ...

... and where it fails An article about the premiere of Dürrenmatt's play in the Netherlands. An article about the situation in the Netherlands during WWII. Let’s assume a historian wants to investigate the battle of Arnhem and she would like to use newspaper texts for her primary source. She enters her query “Arnhem” into a search engine and finds the two articles you see above. After a quick inspection, the historian deems the first one to be irrelevant for her further research, whereas the second one looks important. If we wanted to assign labels to the two different texts, we would probably choose something like “arts” for the first one, and “war” for the second one. The indicators for these labels are the words that are present in the articles. In the first one we find words like “théatres”, “comédie”, “pièce”, “œuvre”, etc., which we associate with “arts”. In the second one, words like “divisions”, “forces”, “combats”, “troupes”, etc., give the impression of a...