Extract glyphs and sort by reading order

Content extraction
2/14/2014

Downloads

The PDF format is designed to be a finished document after all the editing is done. It is intended for printing, viewing and interacting (such as clicking links or filling fields), but not for further editing. However the extraction of text is possible, although with a few caveats, which this article will address.

Text extraction: default ordering

The primary purpose of PDF is the presentation of information to humans and not to provide computer readable information. Most PDF documents do not contain layout information such as columns, tables and paragraphs. The textual content is basically just a collection of glyphs, each with its own X and Y coordinates. To extract text this collection must be ordered somehow, usually by sorting the glyphs on their X and Y position. This assumes that the glyphs on the top-left are the first and on the bottom right are the last in the reading order. This is how PDFKit.NET orders the glyphs as returned by page.Glyphs by default:

C# code sample

1 Document document = new Document(fileIn); 2 Page page = document.Pages[0]; 3 GlyphCollection glyphs = page.Glyphs; 4 glyphs.Sort(); 5 foreach (Glyph glyph in glyphs) 6 { 7 // glyphs sorted from top-left to bottom right here... 8 }

VB.NET

1 Dim document As New Document(fileIn) 2 Dim page As Page = document.Pages(0) 3 Dim glyphs As GlyphCollection = page.Glyphs 4 glyphs.Sort() 5 ' glyphs sorted from top-left to bottom right here... 6 For Each glyph As Glyph In glyphs 7 Next

In most cases, this gives the expected result. This fails however in case of superscript text like this:

superscript1.JPG

The problem here is that in both the text ‘abc’ and ‘xyz’ are sorted first on their vertical position. So what we get as result is:

superscript2.PNG

Although 'abc' comes after 'superscript' in the reading order, it is returned first.

Text extraction: custom ordering

To address this problem we can write a custom sorter that is used this way:

C# code sample

1 Document document = new Document(fileIn); 2 Page page = document.Pages[0]; 3 GlyphCollection glyphs = page.Glyphs; 4 var glyphSorter = new SortGlyphs(); 5 glyphs.Sort(glyphSorter, true); 6

VB.NET

1 Dim document As New Document(fileIn) 2 Dim page As Page = document.Pages(0) 3 Dim glyphs As GlyphCollection = page.Glyphs 4 Dim glyphSorter = New SortGlyphs() 5 glyphs.Sort(glyphSorter, True)

The implementation of SortGlyphs:

1 internal class SortGlyphs : IGlyphComparer 2 { 3 private int minimalPercentageForGlyphsToOverlapVertically = 40; // in percents 4 5 public struct glyphArea 6 { 7 public float X; 8 public float Y; 9 public float Width; 10 public float Height; 11 public bool HasIllegalValue; 12 } 13 14 // Find out whether glayp A or B should come first in the output 15 public int Compare(Glyph glyphA, Glyph glyphB) 16 { 17 var areaA = getGlyphArea(glyphA); 18 var areaB = getGlyphArea(glyphB); 19 20 // default compare result: 21 // If nothing special and not about on the same Y position, return: 22 // 1 if glyph A is positioned higher than glyph B, meaning A is first 23 // -1 if glyph A is below glyph B 24 var compare = (areaA.Y < areaB.Y) ? 1 : -1; // default: -1 = a < b, 1 = a > b 25 26 // Below, we account for characters with a negative height. Usually, this 27 // means that the characters are upside down. Just make sure that these characters 28 // are considered smaller than the rest, and equal amongst themselves. 29 if (areaA.HasIllegalValue) 30 { 31 compare = (areaB.HasIllegalValue) ? 0 : -1; 32 } 33 else if (areaA.HasIllegalValue) 34 { 35 compare = 1; 36 } 37 else 38 { 39 // If here, the glyphs can be compared by their X and Y position, 40 // First look if the glyphs are about on the same height 41 // by checking if there is any overlap... 42 var yOverlapTop = Math.Min(areaA.Y + areaA.Height, areaB.Y + areaB.Height); 43 var yOverlapBottom = Math.Max(areaA.Y, areaB.Y); 44 if (yOverlapTop > yOverlapBottom) 45 { 46 // Found that A and B have at least a bit of a vertical overlap: 47 // Now check if there is a a minimal amount to overlap or that they are just 48 // touching each other. 49 // Calculate the overlap in percents of the shortest one, in this way it is 50 // more likely to match the sub- and superscript glyphs with the rest of the line 51 var yOverlap = yOverlapTop - yOverlapBottom; 52 var yOverlapPercent = 100.00 * (yOverlap / Math.Min(areaA.Height, areaB.Height)); 53 if (yOverlapPercent > this.minimalPercentageForGlyphsToOverlapVertically) 54 { 55 // Overlapping enough: take a look at the x-position and mark the most 56 // left sided first 57 compare = (areaA.X > areaB.X) ? 1 : // a > b 58 (areaA.X < areaB.X) ? -1 : // a < b 59 (areaA.Width > areaB.Width) ? 1 : // a > b 60 (areaA.Width < areaB.Width) ? -1 : // a < b 61 0 ; // a == b 62 } 63 } 64 } 65 return compare; 66 } 67 68 // Get the area around the glyph, possibly changed for special characters 69 private static glyphArea getGlyphArea(Glyph glyph) 70 { 71 var area = new glyphArea() 72 { 73 X = glyph.BottomLeft.X, 74 Y = glyph.BottomLeft.Y, 75 Width = glyph.TopRight.X - glyph.TopLeft.X, 76 Height = glyph.TopRight.Y - glyph.BottomRight.Y, 77 HasIllegalValue = false 78 }; 79 area.HasIllegalValue = (area.Height < 0) || (area.Width < 0); 80 81 // This gives a possibily to tweak the dimensions of some special characters 82 // like the underscore '_'. 83 if (glyph.Characters.Length == 1) 84 { 85 switch (glyph.Characters[0]) 86 { 87 case '_': //underscore 88 float dY = (float)(glyph.TopLeft.Y - glyph.BottomLeft.Y); 89 dY *= 0.4f; //the correction will be 40%, so 60% (height) will be left. 90 break; 91 } 92 } 93 return area; 94 } 95 }

VB.NET

1 Friend Class SortGlyphs 2 Implements IGlyphComparer 3 Private minimalPercentageForGlyphsToOverlapVertically As Integer = 40 4 ' in percents 5 Public Structure glyphArea 6 Public X As Single 7 Public Y As Single 8 Public Width As Single 9 Public Height As Single 10 Public HasIllegalValue As Boolean 11 End Structure 12 13 ' Find out whether glayp A or B should come first in the output 14 Public Function Compare(glyphA As Glyph, glyphB As Glyph) As Integer Implements IGlyphComparer.Compare 15 Dim areaA = getGlyphArea(glyphA) 16 Dim areaB = getGlyphArea(glyphB) 17 18 ' default compare result: 19 ' If nothing special and not about on the same Y position, return: 20 ' 1 if glyph A is positioned higher than glyph B, meaning A is first 21 ' -1 if glyph A is below glyph B 22 Dim compare__1 = If((areaA.Y < areaB.Y), 1, -1) 23 ' default: -1 = a < b, 1 = a > b 24 ' Below, we account for characters with a negative height. Usually, this 25 ' means that the characters are upside down. Just make sure that these characters 26 ' are considered smaller than the rest, and equal amongst themselves. 27 If areaA.HasIllegalValue Then 28 compare__1 = If((areaB.HasIllegalValue), 0, -1) 29 ElseIf areaA.HasIllegalValue Then 30 compare__1 = 1 31 Else 32 ' If here, the glyphs can be compared by their X and Y position, 33 ' First look if the glyphs are about on the same height 34 ' by checking if there is any overlap... 35 Dim yOverlapTop = Math.Min(areaA.Y + areaA.Height, areaB.Y + areaB.Height) 36 Dim yOverlapBottom = Math.Max(areaA.Y, areaB.Y) 37 If yOverlapTop > yOverlapBottom Then 38 ' Found that A and B have at least a bit of a vertical overlap: 39 ' Now check if there is a a minimal amount to overlap or that they are just 40 ' touching each other. 41 ' Calculate the overlap in percents of the shortest one, in this way it is 42 ' more likely to match the sub- and superscript glyphs with the rest of the line 43 Dim yOverlap = yOverlapTop - yOverlapBottom 44 Dim yOverlapPercent = 100.0 * (yOverlap / Math.Min(areaA.Height, areaB.Height)) 45 If yOverlapPercent > Me.minimalPercentageForGlyphsToOverlapVertically Then 46 ' Overlapping enough: take a look at the x-position and mark the most 47 ' left sided first 48 ' a > b 49 ' a < b 50 ' a > b 51 ' a < b 52 ' a == b 53 compare__1 = If((areaA.X > areaB.X), 1, If((areaA.X < areaB.X), -1, If((areaA.Width > areaB.Width), 1, If((areaA.Width < areaB.Width), -1, 0)))) 54 End If 55 End If 56 End If 57 Return compare__1 58 End Function 59 60 ' Get the area around the glyph, possibly changed for special characters 61 Private Shared Function getGlyphArea(glyph As Glyph) As glyphArea 62 Dim area = New glyphArea() With { 63 .X = glyph.BottomLeft.X, 64 .Y = glyph.BottomLeft.Y, 65 .Width = glyph.TopRight.X - glyph.TopLeft.X, 66 .Height = glyph.TopRight.Y - glyph.BottomRight.Y, 67 .HasIllegalValue = False 68 } 69 area.HasIllegalValue = (area.Height < 0) OrElse (area.Width < 0) 70 71 ' This gives a possibily to tweak the dimensions of some special characters 72 ' like the underscore '_'. 73 If glyph.Characters.Length = 1 Then 74 Select Case glyph.Characters(0) 75 Case "_"c 76 'underscore 77 Dim dY As Single = CSng(glyph.TopLeft.Y - glyph.BottomLeft.Y) 78 dY *= 0.4F 79 'the correction will be 40%, so 60% (height) will be left. 80 Exit Select 81 End Select 82 End If 83 Return area 84 End Function 85 End Class

This sorter considers glyphs to be on the same line if they overlap 40% or more.

Multiple columns

As stated beFore, the text in a PDF document is just an unordered collection of glyphs. Extracting the text becomes much more difficult if the layout of the PDF is not ordered in a very simple way, e.g. the PDF can have multi columns. We do not try to solve these kind of problems, but we published an article on how to recognize columns.