Convert PDF to plain text

Manipulate PDF, Content extraction
11/2/2011

The following code sample shows how to convert the collection of glyphs on a PDF page to a text string. The algorithm detects spaces, line breaks and overlapping glyphs for visual effects.

convert pdf to plain text.png

Code sample to convert PDF to plain text

C# code sample

1 using (FileStream fileIn = new FileStream(@"..\..\..\inputdocuments/sometext.pdf", FileMode.Open, FileAccess.Read)) 2 { 3 Document document = new Document(fileIn); 4 5 //get the first page 6 Page page = document.Pages[0]; 7 8 //retrieve all glyphs from the current page 9 //Notice that you grep a strong reference to the glyphs, otherwise the GC can decide to recycle. 10 GlyphCollection glyphs = page.Glyphs; 11 12 //default the glyph collection is ordered as they are present in the PDF file. 13 //we want them in reading order. 14 glyphs.Sort(); 15 16 using (FileStream fileOut = new FileStream(@"..\..\extractedText.txt", FileMode.Create, FileAccess.Write)) 17 { 18 StreamWriter writer = new StreamWriter(fileOut); 19 20 Glyph previousGlyph = null; 21 22 foreach (Glyph glyph in glyphs) 23 { 24 int spaces = CheckSpaces(previousGlyph, glyph); 25 26 for (int i = 0; i < spaces; i++) 27 { 28 //insert a space. 29 writer.Write(" "); 30 } 31 32 if (spaces == -1) 33 { 34 //insert an enter. 35 writer.WriteLine(); 36 } 37 38 //insert the characters 39 foreach (char ch in glyph.Characters) 40 { 41 writer.Write(ch); 42 } 43 44 previousGlyph = glyph; 45 } 46 47 writer.Flush(); 48 } 49 }

VB.NET code sample

1 Using fileIn As New FileStream("..\..\..\inputdocuments/sometext.pdf", FileMode.Open, FileAccess.Read) 2 Dim document As New Document(fileIn) 3 4 'get the first page 5 Dim page As Page = document.Pages(0) 6 7 'retrieve all glyphs from the current page 8 'Notice that you grep a strong reference to the glyphs, otherwise the GC can decide to recycle. 9 Dim glyphs As GlyphCollection = page.Glyphs 10 11 'default the glyph collection is ordered as they are present in the PDF file. 12 'we want them in reading order. 13 glyphs.Sort() 14 15 Using fileOut As New FileStream("..\..\extractedText.txt", FileMode.Create, FileAccess.Write) 16 Dim writer As New StreamWriter(fileOut) 17 18 Dim previousGlyph As Glyph = Nothing 19 20 For Each glyph As Glyph In glyphs 21 Dim spaces As Integer = CheckSpaces(previousGlyph, glyph) 22 23 For i As Integer = 0 To spaces - 1 24 'insert a space. 25 writer.Write(" ") 26 Next 27 28 If spaces = -1 Then 29 'insert an enter. 30 writer.WriteLine() 31 End If 32 33 'insert the characters 34 For Each ch As Char In glyph.Characters 35 writer.Write(ch) 36 Next 37 38 previousGlyph = glyph 39 Next 40 41 writer.Flush() 42 End Using 43 End Using

C# code sample

1 //sometimes PDF files don't contain space characters, in this case words are not seperated like so: "word1 word2" 2 //but you have two Strings "word1" and "word2", where word2 is simply placed further away to simulate a " ". 3 //to account for this, we must check the positions of each Glyph which is why this function is necessary. 4 static int CheckSpaces(Glyph firstGlyph, Glyph secondGlyph) 5 { 6 if (firstGlyph == null) 7 { 8 //there is only 1 glyph to compare. 9 return 0; 10 } 11 12 if (firstGlyph.BottomLeft.Y != secondGlyph.BottomLeft.Y) 13 { 14 //they are not on the same line. (-1 will converted in an enter) 15 return -1; 16 } 17 18 double spaceBetween = secondGlyph.BottomLeft.X - firstGlyph.BottomRight.X; 19 20 if (spaceBetween < 0.1) 21 { 22 //[almost] overlapping text. 23 return 0; 24 } 25 26 double spaceLength = firstGlyph.Font.CalculateWidth(" ", firstGlyph.FontSize); 27 28 double spaces = spaceBetween / spaceLength; 29 30 return (int)Math.Round(spaces); 31 }

VB.NET code sample

1 'sometimes PDF files don't contain space characters, in this case words are not seperated like so: "word1 word2" 2 'but you have two Strings "word1" and "word2", where word2 is simply placed further away to simulate a " ". 3 'to account for this, we must check the positions of each Glyph which is why this function is necessary. 4 Private Function CheckSpaces(firstGlyph As Glyph, secondGlyph As Glyph) As Integer 5 If firstGlyph Is Nothing Then 6 'there is only 1 glyph to compare. 7 Return 0 8 End If 9 10 If firstGlyph.BottomLeft.Y <> secondGlyph.BottomLeft.Y Then 11 'they are not on the same line. (-1 will converted in an enter) 12 Return -1 13 End If 14 15 Dim spaceBetween As Double = secondGlyph.BottomLeft.X - firstGlyph.BottomRight.X 16 17 If spaceBetween < 0.1 Then 18 '[almost] overlapping text. 19 Return 0 20 End If 21 22 Dim spaceLength As Double = firstGlyph.Font.CalculateWidth(" ", firstGlyph.FontSize) 23 24 Dim spaces As Double = spaceBetween / spaceLength 25 26 Return CInt(Math.Round(spaces)) 27 End Function