We recently encountered reports that Adobe Reader XI wants to install Asian language packs for files that got generated by our software and that are in fact, completely “western”.
Adobe Reader X did not require such a thing, so we wondered what had happened.
East Meets West
It turns out that this happens for documents that have non-embedded TrueType fonts. In PDF, one must incorporate some information for each font that is used in a document, even if it is not embedded. This can be done in various ways. In the past, we chose to incorporate so-called “CID”-style font information. This has the advantage that it can address the complete range of all Unicode characters, so in principle one can safely write out all Unicode text with such a font, without worrying about it any further (mixing Western, Asian, and even Klingon, provided that the fonts contains definitions for these glyphs).
One of the requirements for this type of font that the PDF specification imposes, is that it needs to have an encoding from a predefined sets of encodings. This limited set happens to just contain encodings that are classified as “Eastern” (Chinese, Japanese, Korean, etc.). There is however nothing particular “Eastern” about these. An encoding just specifies which sequence of bytes in a text maps to which glyph ID in the font. Languages are not really involved at this point.
And so we used these “Eastern” encodings for many years, and this has never lead to any issues having to needlessly install eastern language packs and the like. For all that really matters is not the encoding, but the characters that are actually used in the document.
Well, Adobe Reader XI changed all that. It now confuses our customers, prompting them to install an Asian language pack for documents that do not have a single eastern character in them. All because the encoding for the used CID fonts is “Eastern” (it has to be, because Adobe requires it for CID fonts).
To avoid this, it appears that we will have to start avoiding CID fonts for western documents. This means that we will have to investigate each piece of text, check whether the characters fit within a particular western single-byte code page, and output a TrueType font description with that single byte encoding. It also means that we may have to output multiple of these font descriptions into a document if not all characters fit within a single western single-byte code page.
It seems like an awkward step to take, now that Unicode should actually have solved all these code page troubles many years ago.