Extract text from a PDF

12/1/2015 By Hans 0 comments

Extract text from a PDF document

The purpose of PDF is to provide information that is readable by humans. It goes to great lengths to provide documents with very clear typography and graphics.

The purpose of a PDF is to format information so that it can be printed or shown on a screen, so that a human can read it or interact with it. Its purpose is not to format information in a way that can be read by a computer. So if you have a PDF document and you want to extract the text from it, it may become a bit complicated. One might expect that in a PDF document the text is always somewhere present. So that the only problem is knowing where to look. However, this is not the case.

pdf-raw-bytes.png

Just a bunch of fragments

pdf-text-fragments.png

The text can best be seen as a bunch of small fragments that are scattered across a page. Do not expect any ordering that makes sense in a semantic way. Furthermore, each fragment is in fact one or more glyph-id's together with a location on the page (A glyph-id is a number that identifies the way the glyph must be drawn). So these numbers do not necessarily have a relation with the Unicode values that you want, as it only describes how they will look like for a human.

Extracting and sorting

In order to extract the text so it can be processed by a computer, there are two steps to take:

1. Firstly, the glyph-id's must be converted to the character-id's. Some TALLcomponents products will do that for you.

2. Secondly, these must be sorted so that the text is extracted in the right order. A good start is to sort these first from top to bottom and second from left to right. In our code samples there are some examples for that.

This is a good start for text that starts top left and must be read to bottom right. If not, the sorting algorithm must be changed (which is not that difficult).

Superscript and subscript

But there are more problems, and one of them is superscript and subscript:

pdf_superscript.png

If these are just sorted vertically and horizontally, then the result would be "Hello note1 World", while it should have been "Hello World note1".

There is a code sample about extracting and sorting glyphs that solves this. The code examines the amount of vertical overlap and decides, which must be a factor of the height of the text on the base-line.

Flat low characters

pdf_low_characters.png

The same problem exists for the flat low characters like the underscore. When sorted these characters may end up in the next line, while is is actually position in between the words of the current line. This is also handled in the code sample, in which the flat low characters are recognized and the height is modified so that they are sorted correctly.

Multiple columns

pdf-multi-columns.png

Extracting text gets really complicated when the document layout becomes more complex, i.e. when there is more than one column on which the texts is wrapped. A PDF document does not have any information on columns, so its hard to recognize them. This is not addressed in the code sample. For further reading, there is an article here on this topic: Searching Text and Recognizing Columns. It also contains a code sample that detects these columns.

More

There are many other problems that can arise which can make the extraction of text difficult, like mathematical formulas, rotated text, creative layouts and so on. But this is not addressed in this blog. For now it is sufficient to say that extracting text was never a design goal of the PDF specification, and therefore it can be complex to work around that.