Read and write meta data from PDF

Content extraction, General
11/18/2016

Downloads

This article shows a sample for reading and writing meta data from PDF using C#.

What is meta data

Metadata is used for data about the document like author, creation date .... These are the PDF standard metadata. PDF also supports custom metadata. Custom metadata could hold any kind of information. For example tracking info for a business process that involves the document. Metadata consists of name-value pairs under a metadata schema. The values are of type string. This can be any kind of string, but when it is structured data, it is very common to use XML for it, where the metadata schema holds an url to the schema that describes the structure of the XML.

More info about metadata can be found:

How to read meta data

The structure for meta data in the PDF document is as follows:

  • A Document has an attribute MetadataSchemas of type MetadataSchemaCollection.
  • The MetadataSchemaCollection is a collection of type MetadataSchema.
  • A MetadataSchema has attributes Prefix, NamespaceUri and Names.
  • The MetadataSchema holds a collection of name - value pairs.
  • The Names attribute contains all the names of name - value pairs that are associated to the MetadataSchema.
  • With a name a corresponding value can be retrieved from the MetadataSchema.

C# code sample that reads all the meta data from a PDF document

using (FileStream inputStream = new FileStream(@"..\..\input.pdf", FileMode.Open, FileAccess.Read))
{
    Document document = new Document(inputStream);

    MetadataSchemaCollection metadataSchemas = document.MetadataSchemas;
    Console.WriteLine($"MetadataSchemas.Count = {metadataSchemas.Count}");

    foreach (MetadataSchema metadataSchema in metadataSchemas)
    {
        Console.WriteLine($"MetadataSchema: {metadataSchema.Prefix} --> {metadataSchema.NamespaceUri}");
        foreach (string name in metadataSchema.Names)
        {
            string value = "unknown";
            MetadataString metadataString = metadataSchema[name] as MetadataString;
            if (null != metadataString)
            {
                value = metadataString.Value;
            }
            Console.WriteLine($"  - name = '{name}' value = '{value}'");
        }
    }
}

How to write meta data

Now that you know how to read meta data from a PDF, this section will provide a code sample that shows how to write meta data to PDF

Sample custom metadata XML from DocuSign

<?xml version="1.0" encoding="UTF-8"?>
<DocuSignDocumentTemplate xmlns="http://www.docusign.net/API/3.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" canAddTabs="true">
   <Header>
      <Name>template</Name>
      <Version>1.0</Version>
...
   </Header>
   <Document>
      <Name>ADV_AGRMT_DOC_EX1</Name>
      <Source requireInstanceDoc="true">C:\Documents and Settings\a331760\Desktop\ADV_AGRMT_DOC_EX1.pdf</Source>
      <NumPages>1</NumPages>
      <Size>21050</Size>
   </Document>
   <Recipients>
      <Recipient>
...
         <RoleName>ESIG_PRIMARY_OWNER</RoleName>
         <RoutingOrder>1</RoutingOrder>
         <Type>Signer</Type>
         <RequireAccessCode>false</RequireAccessCode>
...
         <Tabs>
            <Tab>
               <PageNumber>1</PageNumber>
               <XPosition>88</XPosition>
               <YPosition>492</YPosition>
               <Type>SignHere</Type>
            </Tab>
         </Tabs>
      </Recipient>
...
   </Recipients>
</DocuSignDocumentTemplate>

C# sample that writes meta data to a PDF document

Document document = new Document();

document.Pages.Add(new Page(PageSize.A4));

MetadataSchema metadataSchema = document.MetadataSchemas.Add("tc", @"http://test.tallcomponents.com/testschema/");
metadataSchema.Add("Songwriter", "Johansson");
metadataSchema.Add("Songs", "<songs><song>Close to my soul</song><song>Winter is over</song></songs>");

using (FileStream outputStream = new FileStream(@"output.pdf", FileMode.Create, FileAccess.Write))
{
    document.Write(outputStream);
}