In order to find text, it is important to scan content in the correct order. This article looks at this process
In order to find text, it is important to scan content in the correct order. For example, take a look at the following piece of text:
We will read it as follows:
"The National Transportation Safety Board said a 12-member team....
We definitely do not read it like this:
The National Transpor- House Speaker John tation Safety Board said a Boehner said ...
From this it is clear that text in its nature is a sequential stream of information. In order to get a correct search result, we should preserve this sequence.
Ok, it is quite easy for me as a human being to recognize the columns in the text. My recognition system has been practiced in doing that for a long time (still need to practice to apply different image filters directly in my brain). But what about machines? How does software recognize layout structure of a text within a PDF file?
Most PDF documents do not store this layout information. If you look inside the document you can barely distinguish columns, paragraphs, sentences or even words. You can however extract runs of characters and their coordinates within a page.
So how to obtain the information about the text strcture? In this article we are going to build word position histograms in order to recognize columns.
Using histograms as a first guess to finding columns
How we can make the initial guess? Let’s take a look where the words begin. All the words on the left side of the column have the same X coordinate. Therefor it should be interesting to build a histogram that shows the frequency of a word start along the horizontal axis.
In order to build the histogram I use PDFKit.NET as it allows to extract all glyphs on the page. The glyphs are used to get the position of each word and based on that data I can construct the histograms. From the picture below you can see how the histogram corresponds to the page. The histogram clearly shows two peaks. These peaks are probably the left edges of the columns.
The histogram of the following page has six big peaks and we conclude that the page has six columns.
In fact we can see not only the amount of the columns but some more additional information. Notice that there is almost no noise between the first peak and the second one.
That happens because the first column doesn’t intersect with the second one.
However there is some noise distributed over the rest five columns. Moreover the noise between the second and the fifth column has almost the same level.
This noise is caused by the text under the picture which is not divided into columns. The same can be observed from the noise between the fifth and sixth columns.
Download this Visual Studio .NET project to construct histograms yourself.