In this post I will roughly outline the architecture of our HTML to PDF converter. I believe it applies largely to any HTML consumer

STEP 1: Tidy

The first step is to tidy the HTML. Most HTML documents are not well formed. Meaning that they are not XML documents. By first converting the HTML document into an equivalent XML document, we create the required conditions for the subsequent transformations. Typical tidy transformations include the following:

Before <h1>heading <h2>subheading</h3>

<i><h1>heading</h1></i> <p>new paragraph <b>bold text <p>some more bold text

<body> <li>1st list item <li>2nd list item

After <h1>heading</h1> <h2>subheading</h2>

<h1><i>heading</i></h1> <p>new paragraph <b>bold text</b> <p><b>some more bold text</b>

<body> <ul> <li>1st list item</li> <li>2nd list item</li> </ul>

STEP 2: Execute JavaScript

This step is far from trivial because it involves executing JavaScript. So one actually needs to build a JavaScript interpreter (or use an existing one). When we started working on our HTML converter, we already had our own JavaScript interpreter. But it was conditioned for JavaScript code that one typically finds in PDF documents. This type of JavaScript tends to be far less complex than the JavaScript in current HTML documents. We rarely receive a bug report related to JavaScript embedded in PDF documents. But when we started using the same interpreter with real-life HTML documents, it turned out to be flawed and having many deficiencies. Especially supporting the jQuery library turned out to be an enormous task. jQuery makes extensive use of more advanced JavaScript features such as prototypes and closures.

Here is a typical conversion that takes place when executing JavaScript:

Before <html>   <head>   <script type="text/javascript">     function load() {         var x = document.getElementById("foo")         x.innerHTML = "Hello TallComponents!";      }   </script>   </head>   <body onload="load()">     <h1 id="foo">Hello World!</h1>   </body> </html> After <html>   <body>     <h1>Hello TallComponents!</h1>   </body> </html>

STEP 3: Resolving CSS

Cascading style sheets are used to separate content from presentation. CSS is resolved after JavaScript has been executed because JavaScript may also manipulate CSS. CSS can be specified in three ways:

1. External stylesheet. These are specified using the LINK element as follows:

After <HEAD> <LINK href="special.css" rel="stylesheet" type="text/css"> </HEAD>

2. Inline stylesheet. These are speficied using the STYLE element as follows:

After <HEAD> <STYLE type="text/css"> H1 {border-width: 1; border: solid; text-align: center} </STYLE> </HEAD>

3. Inline style. Here is a typical example:

After <P style="font-size: 12pt; color: fuchsia"> Aren't style sheets wonderful? </P>

The CSS resolver transforms the HTML document such that it only has inline styles while being equivalent to the original document.


Now that the HTML document has been normalized with respect to syntax and CSS and initialization JavaScript code has been executed and removed, all visual elements have to be placed and formatted. This is referred to as the layout step.

Although we convert to PDF, our layout module has been designed such that by implementing a specific interface we may just as well convert to SVG or GDI+.


While the above four steps have been presented as sequential steps, in reality they are performed more or less in parallel for performance reasons.