search text in PDF

Content extraction, Manipulate PDF
11/3/2011

Downloads

This article shows how to search a PDF for text in C# using the Document.Find method and the TextFindCriteria and TextMatchEnumerator classes.

This code sample visualizes the results by saving a copy of the searched document with all matches highlighted and marked with the match position (first match, second match, etc.).

Resulting PDF page

This is what the result of the below code sample looks like:

search-text-in-pdf-using-c-sharp.png

C# code sample - search text in PDF

It is possible to search text on a specific page by calling Page.Find or in the entire PDF document by calling Document.Find. The signature of both methods is the same. The search criteria are defined by passing a TextFindCriteria instance. The results are returned as a TextMatchEnumerator instance. This code sample demonstrates Document.Find but everything applies to Page.Find as well.

C# code sample

1 // open PackingLightBrochure. 2 using ( FileStream fileIn = new FileStream( @"..\..\..\inputDocuments\PackingLightBrochure.pdf", FileMode.Open, FileAccess.Read ) ) 3 { 4 //create document 5 Document document = new Document( fileIn ); 6 7 //criteria for searching "PDF", it does not have to match a whole word and the case also does not have to match 8 TextFindCriteria criteria = new TextFindCriteria( "PDF", false, false ); 9 10 //get a list of search results 11 TextMatchEnumerator enumerator = document.Find( criteria ); 12 13 int resultIndex = 0; 14 15 foreach( TextMatch match in enumerator ) 16 { 17 GlyphCollection glyphs = match.Glyphs; 18 19 Glyph firstGlyph = glyphs[0]; 20 Glyph lastGlyph = glyphs[glyphs.Count - 1]; 21 22 //create a rectangle over the found text. 23 RectangleShape rect = new RectangleShape( 24 firstGlyph.BottomLeft.X, 25 firstGlyph.BottomLeft.Y, 26 lastGlyph.TopRight.X - firstGlyph.BottomLeft.X, 27 lastGlyph.TopRight.Y - firstGlyph.BottomLeft.Y ); 28 29 rect.Brush = new SolidBrush( System.Drawing.Color.Yellow ); 30 rect.Pen = null; //the rectangle has no outline 31 rect.Opacity = 128; 32 33 //coordinates are in Overlay space, not VisualOverlay space. 34 match.Page.Overlay.Add(rect); 35 36 //now we create a textshape to print the result index onto. 37 MultilineTextShape textShape = new MultilineTextShape(); 38 TranslateTransform translate = new TranslateTransform( firstGlyph.BottomLeft.X, firstGlyph.BottomLeft.Y ); 39 textShape.Transform = translate; 40 41 textShape.Width = lastGlyph.TopRight.X - firstGlyph.BottomLeft.X; 42 textShape.Height = lastGlyph.TopRight.Y - firstGlyph.BottomLeft.Y; 43 translate.Y += textShape.Height; //correct different origin. 44 textShape.HorizontalAlignment = HorizontalAlignment.Center; 45 textShape.Opacity = 200; 46 Fragment fragment = new Fragment((++resultIndex).ToString(), 0); //0 means autosized. 47 fragment.TextColor = RgbColor.Red; 48 textShape.Fragments.Add( fragment ); 49 50 //add the text to the page and print it over the yellow rectangle 51 match.Page.Overlay.Add( textShape ); 52 53 } 54 55 //write the new PDF document 56 using ( FileStream fileOut = new FileStream( @"..\..\searchtext.pdf", FileMode.Create, FileAccess.Write ) ) 57 { 58 document.Write( fileOut ); 59 } 60 61 }

VB.NET code sample

1 ' open PackingLightBrochure. 2 Using fileIn As New FileStream("..\..\..\inputDocuments\PackingLightBrochure.pdf", FileMode.Open, FileAccess.Read) 3 'create document 4 Dim document As New Document(fileIn) 5 6 Dim criteria As New TextFindCriteria("PDF", False, False) 7 'criteria.Backwards = true; 8 9 Dim enumerator As TextMatchEnumerator = document.Find(criteria) 10 11 Dim order As Integer = 0 12 13 For Each match As TextMatch In enumerator 14 Console.WriteLine("Match found at page: " & match.Page.Index) 15 16 Dim glyphs As GlyphCollection = match.Glyphs 17 18 Dim firstGlyph As Glyph = glyphs(0) 19 Dim lastGlyph As Glyph = glyphs(glyphs.Count - 1) 20 21 'create a rectangle over the found text. 22 Dim rect As New RectangleShape(firstGlyph.BottomLeft.X, firstGlyph.BottomLeft.Y, lastGlyph.TopRight.X - firstGlyph.BottomLeft.X, lastGlyph.TopRight.Y - firstGlyph.BottomLeft.Y) 23 24 rect.Brush = New SolidBrush(System.Drawing.Color.Yellow) 25 rect.Pen = Nothing 26 rect.Opacity = 128 27 28 'coordinated are in Overlay space, not VisualOverlay space. 29 match.Page.Overlay.Add(rect) 30 31 'print the order. 32 Dim textShape As New MultilineTextShape() 33 34 Dim translate As New TranslateTransform(firstGlyph.BottomLeft.X, firstGlyph.BottomLeft.Y) 35 textShape.Transform = translate 36 37 textShape.Width = lastGlyph.TopRight.X - firstGlyph.BottomLeft.X 38 textShape.Height = lastGlyph.TopRight.Y - firstGlyph.BottomLeft.Y 39 translate.Y += textShape.Height 40 'correct different origin. 41 textShape.HorizontalAlignment = HorizontalAlignment.Center 42 textShape.Opacity = 200 43 Dim fragment As New Fragment((System.Threading.Interlocked.Increment(order)).ToString(), 0) 44 'autosized. 45 fragment.TextColor = TallComponents.PDF.Colors.RgbColor.Red 46 textShape.Fragments.Add(fragment) 47 48 49 match.Page.Overlay.Add(textShape) 50 Next 51 52 Using fileOut As New FileStream("..\..\searchtext.pdf", FileMode.Create, FileAccess.Write) 53 document.Write(fileOut) 54 55 End Using 56 End Using