1st Workshop at EPFL

Or shaping the future of newspaper archive digitisation

I admit this is a rather promising title which raises many expectations. But this goes along the lines with the motto of the SNSF Sinergia programme, which enables researchers to produce breakthrough research. And we want to live up to this motto ;-). So, there we were, on the magnificent EPFL campus

on the beautiful morning of the 24th of October, ready for a two day kick-off workshop.

view from the EPFL campus

This was in fact the first meet-up of the entire consortium, consisting of the research groups from EPFL, the C2DH lab from Luxembourg, the Institute of Computational Linguistics from the University of Zurich, the historians from the University of Lausanne, archivists from Le Temps, NZZ, the Swiss and Luxembourg National Library, the Swiss Economic Archives, the Archives from the Canton of Wallis, representatives from infoclio.ch, and other invited partners and friends. I hope I did not forget anybody :-).

impresso project consortium

So, what is there to expect when so many people with different backgrounds encounter each other: lots of fruitful discussions, you would think. And, thankfully, they really occurred! I have to confess that I was rather skeptical at the beginning at how things might turn out, although the workshop was carefully planned and prepared (with lots of efforts on Maud Ehrmann's side, which should be mentioned here!). But in the end I left Lausanne with many ideas, inspiration, and full of drive which hopefully lasts a long time. But maybe lets give a short recap of the whole workshop.

Day 1 - Kick-off Meeting

The first session was dedicated to the project as such: what is its purpose? What do we want to achieve? What will we do? What will we NOT do? And so on and so forth. We aimed at bringing everybody on the same page and introduced the whole consortium so all know who is doing what in the project.

Day 1 - Clemens Neudecker

During the first part of the afternoon session, Clemens Neudecker enlightened us with the experiences he made during the Europeana Newspapers project. Europeana Newspapers converted 10 million newspaper pages to full text and made them searchable, therefore providing better access to the archives for researchers and other interested users. Clemens hinted at issues with copyrights, OCR and OLR, standardisation problems, and last but not least memory issues. In a little talk after the workshop had already closed, we further exchanged ideas and experiences, which we appreciate a lot!

Day 1 - The Data

In a next round, we introduced the data we had acquired, or rather identified. You find an overview of some preliminary information in a an earlier blog post. Every data provider furthermore elaborated some more on his or her collection, and some also stated there expectations to this project. While some would be already satisfied with a better OCR quality in their archive, others explicitly expressed the need for better search and browsing functionality. More concretely, some archives wish to have their data enriched with NER and to integrate a search for named entities in their interface.

Day 1 - Historian's Round Table

Next up were the historians. In the late afternoon's session they could utter their needs and expectations to a system which makes their historical research more efficient. Also, we planned to come up of a already more narrowly defined set of sources we want to include in our core set and our extended set. However, the quite extensive list might have shied the historians away a little, which is why these two sets still need to be defined. Nevertheless it should eventually still be the historians who have the final say in which data we are going to process for them.

Day 2 - State of the Art

In the morning of the second day we presented the state of the art in NLP and archive interfaces. How can collections be explored today? What is NER? What is the benefit of topic modeling? What role does multilinguality play in a collection? These were the questions we tried to answer in this session. Two major insights: many search interfaces do not exactly provide everything what is needed. Plus, they sometimes look quite ugly ;-). And: topic modeling isn't as widespread as I assumed, ergo: lots of potential for research on our side ☺.

Day 2 - Research Scenarios

The workshop was rounded off with what I believe to be the most valuable session of the whole kick-off meeting. The C2DH lab created 6 different research scenarios for the historians. The historians then had to answer questions like how they would approach these research questions: what would they look for? How would the look for it? What tools would they use? How would they assess the quality of the results? And so on and so forth. The historians were the key players in this sessions, and we (especially people from the more technical side) learned a lot about the historians' research methodologies. The input of the historians will be of great help in designing a tool they like to use and hopefully will want to use every day.

Final Thoughts

In envy the students of the EPFL for their campus.
Go and see the exhibition of the Venice Time Machine.
The archivists contributed a lot of interesting ideas, too, and were very engaged in the discussions.
There are more and more projects on newspaper digitisation.
The following weekend with the return to standard time was desperately needed!
Last but not least, I would like to thank everybody for the participation in the workshop. And I'm already looking forward to the next workshop in July 2018.

lifeofacomputationallinguist

Search This Blog