This article is about a PDF document from a customer where we failed to geometrically sort the glyphs correctly.
We encountered a PDF document from a customer where we failed to geometrically sort the glyphs correctly. Glyphs appearing lower on the page were returned earlier in the sort order by the text extractor.
It turned out that not only the font size, but also the horizontal scaling and finally the CTM (current transformation matrix) scaled the text negatively in both direction two times. The net result is positive of course. Our software should have handled all these scales using a generic algorithm but we dropped one: we took the absolute value of the horizontal scale (Tz operator).
Here is a screenshot from our internal PDF nitpicker PDFSpy:
Operation 2 sets the vertical scale to –1. Operation 9 sets the font size to –5 (this is basically a negative scale in the horizontal and a negative scale in the vertical direction). Operation 11 sets the horizontal scale to –1 (the operand of the Tz operator must be divided by 100). When we take all this into account the scale in both directions is positive.
We fixed this in maintenance update 188.8.131.52 of PDFKit.NET 3.0.