Data, data, data ...

Data acquisition phase has started

As holds true for every project, I assume, the data that will be worked with is of vital importance. Our aim in the impresso project is to digitise and process various newspapers from the late 18th to the 20th century (200 years in total) so that historians can explore the data in previously unseen and unknown ways. As such, we will provide new tools to do research in history, and who knows, maybe we will also take a leading role in revolutionising how historical research will be done in a few years ;-).

But again, all this will not be able without any data. Already during the project application phase we contacted libraries in Switzerland and Luxembourg and asked them whether they have data available for our project. Many institutions gladly followed our call and committed themselves to assist us as good as they can. We are very happy that, for example, both the National Libraries of Luxembourg, as well as its Swiss counterpart are on board! Together with some smaller archives they have compiled a list of ca. 750 newspapers and periodicals which we need to consider.

If we were to include all the publications for our final tool, we would have to deal with roughly 56 billion words, and that is only for newspapers written in German. Although it would be tempting to just include all the newspapers we have in our source database, the question also is how valuable a resource is for a historian. This is why in a next phase, we need to decide on a set of publications we want to work with. In this undertaking the historians will have the final say. Their opinion is therefore very important to us.

Before the historians can decide on a selection of the most valuable newspapers, however, we need some statistics. After all, including only minor publications with a limited publication period is of now use to the historians. One of our visualising pros, Thijs van Beek, has taken the data the archives provided and built a small visualisation tool. With it you can examine all available sources, their sizes, languages, and origins. The data is not complete in all cases though, but the visualisation sums it up nicely, I think.

You find the interactive version here: http://impressostats.midasweb.lu/

Choosing from more than 750 publications will be hard. Since multiple institutions and partners collaborate in this project, everyone will have different needs and requirements. For these reasons, the data selection phase will take place at the first workshop next week, which will take place a the EPFL in Lausanne. All the collaborators will be present there and debate over a core set and an extended set. I'm looking forward to interesting discussions with computational linguists, archivists, visualisers, historians, and digital humanists.

After the workshop, we will finally know what data we will be working with for the next three years. So, next week is when the work really begins ;-).

See you next time!

lifeofacomputationallinguist

Search This Blog

Data, data, data ...

Data acquisition phase has started

Labels

Comments

Post a Comment

Popular posts from this blog

Indexing by Latent Semantic Analysis (Deerwester et al., 1990)

Extracting Text from PDFs

From LSI to PLSI