Fixing Non-Compliant HTML and DOCTYPE

5/16/2011 By Frank 0 comments

A customer reported a problem with our HTML to PDF converter. The problem was that the background color of cells was not being respected. After taking a look at the HTML document it appeared that colors were not specified correctly as follows:

1 <td bgcolor="FF0000">this is expected to be red</td>

While it should be (note the # prefix):

1 <td bgcolor="#FF0000">this is expected to be red</td>

When we open this HTML document in IE9 the cell content is red as the customer expected. Clearly IE9 tolerates this deviation from the specification. We changed our Tidy module so that FF0000 is converted to #FF0000 before subsequent modules process the content.

As part of our QA process we ran the W3C tests and noticed that this fix broke a few tests. Here is one test that broke:

1 </head> 2 <style type="text/css">    3 body { color: green; }    4 p { color: 1111ff; }   5 </style> 6 </head> 7 <body>   8 <p>This line should be green.</p> 9 </body>

Before the fix, the line appeared as green (because we ignored 1111ff and used the inherited color green). After the fix the line appears as 1111ff (or blue).

When we tried the major browsers (IE, Chrome, Opera, and Firefox) we noticed that all of them render both documents correctly. This puzzled us at first. But after taking a closer look, we found out that the the DOCTYPE of an HTML document switches between both behaviors. If the DOCTYPE is as follows:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

then “FF0000” should be interpreted as “#FF0000”. If the DOCTYPE is as follows:

1 <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" """>

then “FF0000” should be ignored.

We implemented the fix and if all tests go well, we will release a maintenance update.