Extracting Text from PDFs

TET bindings for Python

If you are about to extract text from digitised archives, you hope to receive clean XML files, because text extraction from XML files is an almost trivial undertaking. If you receive PDFs, on the other hand, things become a little more difficult. At least, OCR has been done already, so text recognition from pure images is done. For unknown reasons though, the output from the OCR has not been saved in an XML format, where you have the text in one file and the corresponding image (usually a tiff) in another file, but in a PDF where the text is saved in a stream alongside the image. Fortunately, there are libraries which allow you to extract this text stream from PDFs in an efficient manner. One such tool is PDFlib. Within the product range of PDFlib, you will find the Text and Image Extractor Toolkit, or TET for short. We have used the Python bindings to extract the text from the PDFs.

A PDF of the first issue of the NZZ. But how to get the text out of there?

Alright, let's get to work! An installation of TET comes with bindings for other programming languages like C++, .NET, or Java. We use the Python bindings in the following. The setup is quiet easy, and there is also a rather extensive documentation, which would be a required read if you are looking for very special options.

In general, the command line tool works in a very simple way:

tet --text /path/to/file

... and you're done. If you want to have a bit more control, however, you will need some more options. We walk you through a little example here, using TET v.4.3 (newest one is v5.1, I believe):

First, you set the environment variables straight. You either to this globally or within a the script itself (they must point to the Python bindings of TET and to the location where you have stored your tetlib_py.so file).

import sys

sys.path.append('/path/to/TET/v4.3/bind/python')
sys.path.append('/path/to/TET/v4.3/bind/python/python33')

Now you can import TET.

from PDFlib.TET import *

You can use the TET class now.

In order to be able to extract the Text from a PDF document, you create a new TET object.

pdf_obj = TET()

Next, you should adjust the license keys (if you have a paid version of TET, otherwise you can only work on documents of up to 10 pages, I believe).

pdf_obj.set_option('licensefile=/path/to/licence/file')

Now you open the PDF from which you want to extract the text. The TET object has an own function called 'open_document()' at its disposal. In order to open a document, simply indicate the path to that PDF document as the first function parameter. You can already define an option list as a second parameter, but it can also be empty (it is required though).

doc_handle = pdf_obj.open_document('/path/to/PDF', '')

This creates a document handle with which you can process the document in a subsequent step. But back to the option list: If you want to generate tetml output, you need to say that in the second function parameter. An option list consists of key and value pairs, basically, which you assign with the help of the equal sign:

'key=value'

You can also define more than one option, in which case you need to separate different options with a blank space.

'key1=value1 key2=value2'

If values themselves can be defined as keys, you need a so-called nested option list, using curly brackets as grouping symbols. If you want to define your tetml output, for example, you can do this like so:

'tetml={elements={docinfo=true docxmp=true options=true}}'

Option lists can get quiet long, as one example with which we experimented shows:

'layouthint={header=true} layouteffort=extra contentanalysis={bidilevel=ltr includeboxorder=1 linespacing=small} docstyle=papers layoutanalysis={layoutdetect=2 layoutrowhint={full separation=thick}} granularity=word '

So, what does this option list say? We used it to process a page, which we describe below. Nevertheless, we would like to point out what the different parameter do:

The 'layouthint' option can be useful if we know something about the general structure of a document. If we want to indicate header sections, for example, we can set the 'header' keyword to true.

By setting 'layouteffort' to 'extra' we loose some performance, but the layout detection should work more precisely and more consistently (although the documentation does not say what exactly will be done).

With the 'contentanalysis' option we define several criteria. 'bidilevel' says we only deal with text which is written from left to right. Including information about box order might preserve the order of the text. Setting the line spacing to small should enhance text row detection.

TET comes with several, already set parameters for different document styles. We chose 'papers', since we are dealing with newspapers and journals. We think that we will profit from setting this parameter because it will be probably better at detecting columns.

There is more: you can also define some parameters of the category layout analysis.

And last but not least, you need to define what exactly you want to extract. Is it only words? Lines? Glyphs? Or just the whole page as a text. Setting this parameter will have implications on how your output will be structured.

But back to text extraction. In order to process the text, you either use the function 'open_page()' or 'process_page()'. We could not find out the difference, since both function expect similar input. A call of such a function could look as follows:

process_handle = pdf_obj.process_page(doc_handle, 1, 'docstyle=papers')

Now you can extract the text either with 'get_text()' or 'get_xml_data()' (deprecated in v5.1).

One word of "caution": Certain options cannot be combined!

The result

So, what do you get in the end? Ideally, text, which could look like this:

Extracted text from NZZ using TET command line tool

You see there are many OCR errors in there, which is a different kind of problem. However, for tools which highlight search terms in images, text-only output is not enough, you also need coordinates. A sample output as generated with the TET Python bindings could look like this:

TET output using Python bindings

I have written a little script which should be relatively easy to use. Feel free to comment on this or to share your experiences.

ATTENTION: With v4.3, the XML-output seems to be invalid. You can adjust this if you just add an additional line of code to the script, which writes the missing closing tags to the output file, like so:

output_xml.write(b'</Pages>\n</Document>\n</TET>\n')

The next step will be to find out which settings (or option lists in this case) work best for our collection. So, stay tuned!

Indexing by Latent Semantic Analysis (Deerwester et al., 1990)

Problem statement Deerwester et al. address a common issue in information retrieval, which is the often unsatisfying recall due to the differences how documents are indexed and with what terms users would like to find them. As such, synonymous query terms fail to retrieve a smiliar document and thus have a serious impact on recall. What is more, polysemy might return documents which are of no interest to the user, which severes precision. The authors point out three factors which are responsible for recent systems to fail in such tasks: Incomplete identification of index words: documents usually never contain all terms a user might query the data with. There is no way of dealing with polysemeous words. Independency assumption of word types Their assumption is that there is an underlying latent semantic structure (in which terms might either refer to the document they appear in, or to the overall to...

lifeofacomputationallinguist

Search This Blog