Skip to main content

1st Workshop at EPFL

Or shaping the future of newspaper archive digitisation

I admit this is a rather promising title which raises many expectations. But this goes along the lines with the motto of the SNSF Sinergia programme, which enables researchers to produce breakthrough research. And we want to live up to this motto ;-). So, there we were, on the magnificent EPFL campus
on the beautiful morning of the 24th of October, ready for a two day kick-off workshop.
view from the EPFL campus
This was in fact the first meet-up of the entire consortium, consisting of the research groups from EPFL, the C2DH lab from Luxembourg, the Institute of Computational Linguistics from the University of Zurich, the historians from the University of Lausanne, archivists from Le Temps, NZZ, the Swiss and Luxembourg National Library, the Swiss Economic Archives, the Archives from the Canton of Wallis, representatives from infoclio.ch, and other invited partners and friends. I hope I did not forget anybody :-).
impresso project consortium

So, what is there to expect when so many people with different backgrounds encounter each other: lots of fruitful discussions, you would think. And, thankfully, they really occurred! I have to confess that I was rather skeptical at the beginning at how things might turn out, although the workshop was carefully planned and prepared (with lots of efforts on Maud Ehrmann's side, which should be mentioned here!). But in the end I left Lausanne with many ideas, inspiration, and full of drive which hopefully lasts a long time. But maybe lets give a short recap of the whole workshop.

Day 1 - Kick-off Meeting

The first session was dedicated to the project as such: what is its purpose? What do we want to achieve? What will we do? What will we NOT do? And so on and so forth. We aimed at bringing everybody on the same page and introduced the whole consortium so all know who is doing what in the project.

Day 1 - Clemens Neudecker

During the first part of the afternoon session, Clemens Neudecker enlightened us with the experiences he made during the Europeana Newspapers project. Europeana Newspapers converted 10 million newspaper pages to full text and made them searchable, therefore providing better access to the archives for researchers and other interested users. Clemens hinted at issues with copyrights, OCR and OLR, standardisation problems, and last but not least memory issues. In a little talk after the workshop had already closed, we further exchanged ideas and experiences, which we appreciate a lot!

Day 1 - The Data

In a next round, we introduced the data we had acquired, or rather identified. You find an overview of some preliminary information in a an earlier blog post. Every data provider furthermore elaborated some more on his or her collection, and some also stated there expectations to this project. While some would be already satisfied with a better OCR quality in their archive, others explicitly expressed the need for better search and browsing functionality. More concretely, some archives wish to have their data enriched with NER and to integrate a search for named entities in their interface.

Day 1 - Historian's Round Table

Next up were the historians. In the late afternoon's session they could utter their needs and expectations to a system which makes their historical research more efficient. Also, we planned to come up of a already more narrowly defined set of sources we want to include in our core set and our extended set. However, the quite extensive list might have shied the historians away a little, which is why these two sets still need to be defined. Nevertheless it should eventually still be the historians who have the final say in which data we are going to process for them.

Day 2 - State of the Art

In the morning of the second day we presented the state of the art in NLP and archive interfaces. How can collections be explored today? What is NER? What is the benefit of topic modeling? What role does multilinguality play in a collection? These were the questions we tried to answer in this session. Two major insights: many search interfaces do not exactly provide everything what is needed. Plus, they sometimes look quite ugly ;-). And: topic modeling isn't as widespread as I assumed, ergo: lots of potential for research on our side ☺.

Day 2 - Research Scenarios

The workshop was rounded off with what I believe to be the most valuable session of the whole kick-off meeting. The C2DH lab created 6 different research scenarios for the historians. The historians then had to answer questions like how they would approach these research questions: what would they look for? How would the look for it? What tools would they use? How would they assess the quality of the results? And so on and so forth. The historians were the key players in this sessions, and we (especially people from the more technical side) learned a lot about the historians' research methodologies. The input of the historians will be of great help in designing a tool they like to use and hopefully will want to use every day.

Final Thoughts

  • In envy the students of the EPFL for their campus.
  • Go and see the exhibition of the Venice Time Machine.
  • The archivists contributed a lot of interesting ideas, too, and were very engaged in the discussions.
  • There are more and more projects on newspaper digitisation.
  • The following weekend with the return to standard time was desperately needed!
  • Last but not least, I would like to thank everybody for the participation in the workshop. And I'm already looking forward to the next workshop in July 2018.

Comments

Popular posts from this blog

Extracting Text from PDFs

TET bindings for Python If you are about to extract text from digitised archives, you hope to receive clean XML files, because text extraction from XML files is an almost trivial undertaking. If you receive PDFs, on the other hand, things become a little more difficult. At least, OCR has been done already, so text recognition from pure images is done. For unknown reasons though, the output from the OCR has not been saved in an XML format, where you have the text in one file and the corresponding image (usually a tiff) in another file, but in a PDF where the text is saved in a stream alongside the image. Fortunately, there are libraries which allow you to extract this text stream from PDFs in an efficient manner. One such tool is PDFlib . Within the product range of PDFlib, you will find the Text and Image Extractor Toolkit, or TET for short. We have used the Python bindings to extract the text from the PDFs. A PDF of the first issue of the NZZ. But how to get the text out of

From LSI to PLSI

Probabilistic Latent Semantic Indexing  (Hofmann 1999) (Note: Hofmann has a very concise writing style. This summary thus contains passages which are more or less directly copied from Hofmann's paper, simply because you could not summarise them any further. This is just to point out in all clarity that everything in this summary represents Hofmann's work (without making direct "quotations" explicit)). Problem Statement Hofmann proposes probabilistic latent semantic indexing, which builds on the method by  Deerwester et al. (1990) . The main difference is that Hofmann's method has a solid statistical foundation and that it performs significantly better on term matching tasks. The problem statement is similar: in human machine-interaction, the challenge is to retrieve relevant documents and present those to the user after he or she has formulated a request. These requests are often stated as a natural language query, within which the user enters some key

Indexing by Latent Semantic Analysis (Deerwester et al., 1990)

Problem statement Deerwester et al. address a common issue in information retrieval, which is the often unsatisfying recall due to the differences how documents are indexed and with what terms users would like to find them.  As such, synonymous query terms fail to retrieve a smiliar document and thus have a serious impact on recall. What is more, polysemy might return documents which are of no interest to the user, which severes precision. The authors point out three factors which are responsible for recent systems to fail in such tasks: Incomplete identification of index words: documents usually never contain all terms a user might query the data with. There is no way of dealing with polysemeous words. Independency assumption of word types   Their assumption is that there is an underlying latent semantic structure (in which terms might either refer to the document they appear in, or to the overall topic, or b