Alphabet Soup: The Internationalization of Linux, Part 2

Stephen Turnbull

Issue #60, April 1999

Mr. Turnbull takes a look at the problems faced with different character sets and the need for standardization.

A large body of standards has evolved to handle the problems with text manipulation I presented last month. In general, ad hoc handling methods are considered to be localization, while a method that conforms to some standard and is generalizable to many cultural environments is considered internationalization.

POSIX

Currently, the central standard for internationalization is the locale model of POSIX. Unfortunately, in the current state of the art, localization via the POSIX model is something of a Procrustean bed. For example, in Japanese there are two common ways of notating the currency unit yen: postfixing ¥ and prefixing &yen. It is not uncommon for both conventions to be used in the same document in different contexts: the former is common in running text, the latter in tables. POSIX does not provide for this. It is easy enough to implement by creating a Japanese-table locale to complement the Japanese-text locale, but this places the burden of setting the correct locale on the application programmer. Although much smaller in scale, this burden is much like that imposed by multilingualization. Nor was POSIX designed to support such fine discrimination; this is better left to the individual application anyway.

POSIX-style internationalization provides a comfortable, functional environment for almost all users and applications. Specifically, a POSIX locale determines:

the character set and encoding to be used
classification of characters (e.g., alpha, hex-digit, whitespace, etc.)
the sorting order for strings in the language
digit separator and decimal-point conventions
date and time presentation
currency presentation
message format (in particular, strings for yes and no)

All of these features are implemented by changing the functionality of standard library functions or by adding new ones. That is, the isalpha function in libc no longer consults a fixed table, but instead the table is varied according to the current locale. Displaying monetary values can be done by using the new function strfmon. Unfortunately, the locale support in Linux libc is still only partially documented as of libc-2.0.7t; no man page for strfmon exists, although there is an entry point in the library. A useful discussion by Ulrich Drepper, one of the authors of GNU libc, may be found at http://i44s11.info.uni-karlsruhe.de/~drepper/conf96/paper.html.

POSIX Internationalization Levels

The POSIX standard defines a number of levels of compliance with internationalization standards. These levels are a somewhat useful guide to how far an internationalization effort has progressed. Level 1 compliance is achieved when a system is 8-bit clean. Obviously this is a bare minimum, since some characters may get corrupted. Level 2 compliance is achieved when a flexible system for producing localized time, date and monetary formats is implemented. As described above, these facilities are provided by GNU libc, so disciplined use of appropriate formatting functions and the setlocale call is going to be sufficient for most applications to achieve Level 2 compliance. Level 3 compliance is achieved when the application can use localized message catalogs. This facility is provided by the GNU gettext library. Controlling gettext is nearly as simple as setting the locale. Unfortunately, the rules of precedence are somewhat different. However, disciplined use of gettext and its supporting functions will make localization much easier. (See “Internationalizing Messages in Linux Programs” by Pancrazio de Mauro, March 1999.)

Level 4 refers to Asian language support. The Asian languages are given a special status because of the variety of complex subsystems needed to support them. For example, many implementations of X have two separate families of string display functions, one for strings encoded in one byte and another for strings composed of characters encoded in two bytes. In Japanese, one- and two-byte characters are mixed freely, so an internationalized application which needs to deal with Japanese would have to analyze strings into one-byte and two-byte substrings on the fly. In fact, dealing with Japanese by itself forces the programmer to deal with many of the problems posed by true multilingual applications.

Character Sets

It would be nice if we could think of character sets as corresponding directly to the scripts we use to write by hand, but unfortunately things are not so simple. For example, there are nearly 200 countries in the world, each with its own currency. Of course, many share the same symbol, but it is clear that if your keyboard had a key for every currency symbol, it would be about twice as big as the one you use today. Other useful symbols are the paragraph and sectioning marks used by lawyers and the various operators and non-Latin symbols used by mathematicians. Since new characters are being created all the time (for example, the symbol for the new European monetary unit, the euro), it is impossible to include them all. So in fact, a character set is someone's idea of a useful set of characters.

Representation as bit strings inside a computer imposes further constraints. Since modern computers all work in terms of bytes as the smallest efficient unit of access, there is a big difference in the space and processing requirements for text based on a 256-character set, which can be encoded in a single byte, and a 257-character set, which cannot. One might think that the extension to two bytes, or 65536 characters, would be enough to satisfy anyone, but it turns out that even this is not enough. The process of selecting about 20,000 ideographic characters of Chinese origin occasioned many arguments while the Unicode character set was being designed. Even those 20,000 may not be enough; while there are only a few people in the world who care about some of the excluded ideographic characters, to them it may be the most important character in the world, as it is the one they use to write their name.

The result is that many character sets have been designed and populated, and standards have been written to codify their use.

ASCII

The most influential standard of all is the American Standard Code for Information Interchange, abbreviated ASCII. This is a list of the 128 7-bit bit strings, with an assignment of each one to either a character commonly used in American English or a control function. Many of the control functions are not used today, but so much software has been written on the assumption that hex values 0x00 to 0x1F are not printing characters that no one considers assigning a few more characters to some of those code points.

Because nearly all existing computer languages are compatible with the ASCII character set, ASCII in some form is a subset of most electronic character sets. However, there are many variants. For example, the JIS Roman character set used in most Japanese computers is almost identical to ASCII, except that a couple of the glyphs are changed and the Japanese yen symbol is substituted for the backslash. In order to codify this development, ASCII-like character sets are defined by the International Standards Organization (ISO) in standard ISO 646. U.S. ASCII is designated the international reference version for ISO 646 and is occasionally referred to as ISO 646-IRV (for example, in naming fonts for the X Window System).

The ISO 8859 Family of Character Sets

ASCII is simply not sufficient for use in an internationalized environment. For example, most European languages use accented characters. Certainly, it is possible to represent “Latin small letter a with acute accent” (á) as a two character ligature (e.g., 'a), but this is inconvenient for sorting and possibly ambiguous. Furthermore, it is not obvious how to represent the caron using only ASCII characters. In order to maintain compatibility with ASCII for the sake of existing software, and accommodate many of the countries most intensively using computers, the ISO 8859 standard was designed. ISO 8859 had three main goals: maintain ASCII compatibility, implement within the constraints of ISO 2022 and provide the broadest coverage of languages within a single-octet encoding. Unfortunately, these three goals are not compatible. Several important scripts which can be encoded in a single octet require several dozen code points each for their characters because they do not overlap with ASCII or each other: Greek, Russian, Hebrew and Arabic.

The solution arrived at was not to define a single character set, but rather a family of character sets. Each ISO 8859 character set contains ASCII (ISO-646-IRV) as a subset, and the encoding is defined so that, interpreted as integers (C chars), the ASCII characters are encoded identically in ASCII and ISO 8859. Then a list of supplementary character sets, each including at most 96 characters to conform to ISO 2022 (described below), was defined. These supplementary characters are then assigned to the code points 0xA0 to 0xFF. Where the supplementary set is derived from an alphabet, the natural collating order is followed, but for the collections of accented characters the order is necessarily arbitrary. The current supplementary character sets are listed in Table 1.

Table 1.

Unicode and the ISO-10646 Universal Character Sets

The next step is to unify all of the various character sets. Of course, the national standards have two main advantages. They are space-efficient, encoding the characters needed for daily use and computer programming in one byte, and they are time-efficient, since they can be arranged in the natural collating order. The second advantage has already been conceded by the majority of European languages when using encodings in the ISO 8859 family. Indeed, ISO 8859-1 has been an enormous success since it effectively unifies all the major Western European and American languages in a single multilingual encoding. With system library support for sorting (the LC_COLLATE portion of POSIX locales), it is hard to justify using anything else where it will serve.

In this context, it was natural to try to extend the success of ISO 8859 by abandoning the efficiency of one-byte encodings in favor of a single comprehensive encoding for all characters used by all the world's languages. Two complementary efforts, proceeding in parallel, were conducted by a commercial consortium and the ISO. Unsurprisingly, the ISO's working group called its effort by the ponderous name Universal Multiple-Octet Coded Character Set (abbreviated UCS), while the commercial consortium adopted the sprightly “Unicode”. Also unsurprisingly, the Unicode Consortium (driven by the commercial advantages of a uniform two-byte encoding) was able to formulate a standard unifying nearly all of the world's scripts in a single two-byte encoding by 1991, as well as codifying a dictionary of properties of each character guiding such usages as ligatures and bidirectional text, while the ISO ended up defining both a two-byte version and a four-byte (31-bit) version of the UCS in 1993 without the additional properties. Also in 1993, the Unicode and UCS-2 character sets and encodings were unified, although each standard retains unique features.

Why separate efforts? Surely 65336 different characters are enough for anyone. Who needs two billion characters?

The reason for separate efforts is easy enough to explain. The Unicode effort was driven by the commercial advantages of a single encoding. Much effort has been expended in the standardization of Internet protocols, first working around the problems caused by “8-bit-dirty” Internet software, then in adding support for Asian languages, and finally in creating protocols for negotiating character sets. It would be nice if all that effort and the necessary implementation inefficiencies could be avoided by having one standard encoding. As we will see, it is not that easy, but standardizing on Unicode could result in large cost savings, both in development and processor time and protocol overhead.

On the other hand, the ISO group was primarily concerned that a truly universal framework be created so as to avoid the need for yet another “universal” standardization effort in the future. It worried more about generality and eschewed standardizing poorly-understood areas, such as treatment of bidirectional text. In fact, UCS-4 currently contains only those characters defined by the Unicode standard, adopted en masse as the Basic Multilingual Plane of UCS-4 and equivalent to UCS-2.

The reason for their concern is it is already painfully obvious 65536 characters are not enough for some purposes. Although over 18,000 unassigned code positions remain in Unicode, classical scholars of hieroglyphics or Chinese could rapidly fill these positions with ideographs. The current set of “unified Han” (Chinese ideographs used in Chinese, Japanese, Korean and Vietnamese) was reduced to 20,902 only through a highly contentious unification process, suggesting that some of the controversial characters might be reassigned to code points. Archaic Hangul (composed Korean syllables) would add thousands more. Unicode also explicitly excludes standardized graphic notations such as those used in music, dance and electronics. It is clear that a truly universal character set will easily exceed the limit of 65536 imposed by a two-octet encoding.

Why does ISO 10646 specify a 31-bit encoding? Current hardware is byte-oriented, but there is no particular reason to stop at 24 bits, since only certain video hardware can efficiently use three-byte words of memory. The word size most efficiently accessed by most current hardware is four bytes. With potentially billions of characters, it was considered wise to reserve a bit in each character for arbitrary internal processing purposes; however, this bit must be cleared before passing the character on to an entity expecting a UCS character.

Similarly, large contiguous private spaces have been reserved containing 1/8 of the three-octet codes, i.e., those with the high octet 0, and 1/4 of the four-octet codes. This means that an application can embed entire national standard character sets in this space in a natural way (in particular, preserving their orderings) if desired, without any possibility of conflict with the standard, current or any future extensions. ISO 10646 does not necessarily recommend such techniques, but certainly permits them. This still leaves over 1.5 billion code points reserved for future standardization; it seems certain most will remain reserved but unused for a good long time.

However, it seems unlikely that Unicode, let alone UCS-4, will soon have the success enjoyed by ISO 8859-1. First, the Oriental languages' digital character set standards are not yet satisfactory, in part because the languages are not fully standardized. Standardization efforts for all the Han character languages remain active. If the Japanese, for example, have not yet settled on a national character set, how can they be satisfied with the unified Han characters of Unicode? A recent tract entitled Japanese is in Danger! claims that Unicode will be the death of the Japanese language, and many computer-literate Japanese show varying degrees of sympathy with its arguments.

Second, in multilingual texts it may be desirable to search for some specifically Chinese character (as opposed to its Korean or Japanese cognates). In Unicode, this requires maintaining substantial amounts of surrounding context which would contain markup tags indicating language and would be impossible by definition in Unicode-encoded plain text. Although you could point to similar difficulties with ISO 8859-1 text, it is not the same. A Chinese character is a semantic unit with specific meaning, unlike an alphabetic character. In fact, the Han unification process normally ignores semantics. Thus, it confounds a Japanese character with the same shape as a given Chinese character, but a different meaning. ISO 8859-1 characters, on the other hand, are rarely searched for in isolation; if so, they have no semantic content.

Third, Asians are simply not yet as multilingual across the Asian languages as Europeans are across European ones, although this is changing rapidly. Still, it is unlikely that we will ever see an “Asian Switzerland” with Chinese, Japanese and Vietnamese simultaneously in use as official languages. Thus, the advantage of Unicode over national standards is not so great.

Fourth, from the Western European point of view, most of the gains to a single character set supporting multilingual processing have already been achieved by ISO 8859-1. Western Europeans have little need for Unicode.

In the near future, Unicode will be most useful to computer and operating system vendors, including Linux. By supporting Unicode as the basic internal code set, an unambiguous way is provided to avoid linguistic confusion. Adding new languages will simply be a matter of providing fonts, a Unicode-to-font-encoding mapping table and translating the messages. No additional programming effort will be necessary, and backwards compatibility is guaranteed. This is not trivial. An example is given below of a kernel patch used to make directory listings of Japanese Windows file systems mounted with either the MS-DOS or VFAT file systems readable. This kernel patch is certainly never going to be integrated into the kernel source code, because it is impossible to ensure it won't mess up non-Japanese names.

National Standard and Private Character Sets

Besides the national standard character sets mentioned above, many others are still in common use. A few of the more important ones include Russian's KOI-8 (alternative to ISO 8859-5), ISCII for Indian languages written in the Devanagari script and VISCII for Vietnamese. Of course, U.S. ASCII is available.

Other important character sets are those defined by industry or individual firms. An important characteristic of these private character sets is that their encodings often do not conform to the ISO 2022 standard, making interchange among systems difficult. Microsoft is at the forefront, defining and registering myriads of Windows character sets. Most of these are ASCII derivatives and closely related to ISO 8859 encodings, so although the small differences are annoying to programmers, they are often insignificant to users. However, in the field of Asian languages the non-ISO-2022-compliant encodings called Shift JIS, an encoding for Japanese used by Microsoft and Apple operating systems, and Big 5, an encoding for traditional Chinese defined by a consortium of five large Taiwanese manufacturers, are important. In both cases, portions of the code space not used by international standard character sets were employed.

In the case of Shift JIS, the idea was to include the so-called half-width katakana, the 70 or so characters necessary to phonetically transcribe the Japanese language. Rarely used in normal text, they are somewhat convenient for file names in the DOS 8.3 format. This requires that they be encoded as a single octet. The Japanese standard JIS X 0201 encodes ASCII in its usual code points and places the katakana in the octets with values 0xA1 to 0xDF. Shift JIS is based on this standard and uses a simple algorithm to transform standard JIS kanji codes into two-octet codes with the first octet in the ranges 0x81 to 0x9F and 0xE0 to 0xEF, which are unused by JIS X 0201.

Character Set Extension: ISO 2022

Unicode and UCS-4 solve the problem of character representation permanently. However, as seen above, Unicode is not quite sufficient for all purposes and UCS-4 is much too wasteful for general use. Furthermore, an enormous amount of hardware and software is oriented toward one-octet character sets. Thus, the ISO 2022:1994 standard for code extension techniques (in particular, using several character sets in one data stream), the most recent edition of a standard first published as ECMA-35 by the European Computer Manufacturer's Association in 1971, remains relevant.

ISO 2022 is a rather abstract standard. A brief outline of its provisions follows.

Division of codes into 7-bit and 8-bit types; the 256 code points in the 8-bit table are divided into the left (L, 0x00 to 0x7F) and right (R, 0x80 to 0xFF) halves. 7-bit codes are considered to use only the left half.
Further division of the 128 points in each half into control (C, 0x00 to 0x1F) and graphic (G, 0x20 to 0x7F) codes.
Codes 0x1B (escape), 0x20 (space) and 0x7F (delete) in CL and GL are fixed. Codes 0xA0 and 0xFF in GR are often left unused.
Provisions are made for handling control characters similar to those for graphic characters described below, but these are uninteresting in a discussion of internationalization.
Graphic character sets must be encoded in a fixed number of bytes per character. Either all bytes of all characters are in the range 0x20 to 0x7F, or all bytes of all characters are in the range 0xA0 to 0xFF. A character set in which the bytes 0x20 and 0x7F or 0xA0 and 0xFF are never used is referred to as a 94n-character set, where n is the number of bytes. Otherwise, the character set is a 96n-character set.
An encoding may use up to four character sets simultaneously, denoted G0, G1, G2 and G3. G0 must be a 94n-character set; the other three may be 96n-character sets. The interpretation of a byte depends on the shift state. Any of G0, G1, G2 or G3 may be invoked into GL by the locking shift control codes LS0, LS1, LS2 and LS3 respectively. When the character set G0 is shifted into GL, then G0 is used to interpret bytes in the range 0x20 to 0x7F. Similarly, in 8-bit encodings, right locking shifts are used to invoke character sets G1, G2 or G3 into GR by the control codes LS1R, LS2R and LS3R. Then that character set is used to interpret bytes in the range 0xA0 to 0xFF.
A single character may be invoked from the G2 and G3 sets by use of the single shift control codes SS2 and SS3.
Escape sequences are provided for the purpose of designating new character sets into the G0, G1, G2 and G3 elements.

A given version of ISO 2022 need not provide all of the above shifting and designating facilities. ASCII, for example, provides none. To the extent that they are provided by a derivative standard, the control codes must take the values as shown in Table 2.

Table 2.

Three examples of codes which may be considered applications of the ISO 2022 standard are ASCII, ISO 8859-1 and EUC-JP. ASCII is the standard encoding for American English. It is a 7-bit code with the ASCII control codes designated to C0, the ASCII graphic characters designated to G0, and C1, G1, G2 and G3 not used. C0 is invoked in CL; G0 is invoked in GL. No shift functions are used.

ISO 8859-1 is an 8-bit code, with C0 left unspecified (but normally C0 has the ASCII control characters in it), the ASCII graphic characters are designated to G0 and the Latin-1 character set is designated to G1. C1, G2 and G3 are unused. C0 is invoked in CL and G0 is invoked in GL. No shift functions are used.

Packed-format EUC-JP is an 8-bit code, with C0 unspecified but normally using the ASCII control characters; the JIS X 0201 Roman version of ISO 646 designated to G0; the main Japanese character set JIS X 0208 containing several alphabets, punctuation, the Japanese kana syllabaries, some dingbats and about 6000 of the most common kanji (ideographs) designated to G1; the half-width katakana from JIS X 0201 designated to G2; and the JIS X 0212 set of about 8000 less common kanji designated to G3. C0 is invoked in CL, G0 is invoked in GL and G1 is invoked in GR. No locking shift functions are used. Half-width katakana and the JIS X 0212 kanji must be accessed using the single shifts SS2 and SS3 respectively, and they are shifted into GR.

Finally, ISO 2022 is commonly used in Internet mail and multilingual documents. The 7-bit version is used and every character set must be designated to G0 before use.

The single most important aspect of ISO 2022 is that code points in the range of ASCII control characters may not be used for graphic characters. This means that text files using encodings conforming to ISO 2022 will behave like text (with line breaks and not causing strange behaviour in your terminal or emulator) when displayed. If you do not have the fonts or your software does not understand the designation escape sequences, you will see gibberish, but at least your terminal will continue working.

A second useful fact is that in most cases ASCII or some version of ISO 646 is designated to G0. An encoding like EUC-JP with ASCII designated to G0 and invoked to GL and all of the other character sets invoked to GR is “file system safe” in 8-bit clean systems. This is more or less the definition of the EUC variant of ISO 2022.

Some encodings which do not conform and thus often cause problems in software not specifically prepared for them are Shift JIS, Big 5, VISCII and KOI-8. Shift JIS in particular annoys me, because I dual-boot Linux and Windows 95 for Japanese OCR, conversion of Microsoft Word documents to plaintext and FreeCell, and directory listings with Japanese in them are invariably messed up. Fortunately, I find yet to have a reason to try to access a file named in Japanese. Kernel patches are available which help deal with this, but they are unofficial and will stay that way because they are inherently dangerous. That is, they work with Japanese most but not all of the time, and they will not handle non-Japanese 8-bit encodings correctly.

Internet Messaging

One of the earliest and most important applications for the Internet is messaging, either direct to recipients (electronic mail) or broadcast (Usenet newsgroups). From the internationalization point of view, these are basically the same; internationalization doesn't care about the transmission mechanism, only how the content is handled.

Because messaging was an early application, it assumes a rather restricted environment. In particular, it assumes the data stream is limited to 7-bit bit-strings, and one cannot even be sure that all ASCII characters will be transmitted without error. In particular, if a message originates in the UNIX world, is passed through BITNET, i.e., EBCDIC encoding and back to UNIX, some characters are likely to be corrupted. Of course, these days such corruption is unlikely, but when the standards were designed, it was commonplace. Now these restrictions are defined in standards and widely implemented in software, so they are likely to continue for the foreseeable future, even though the hardware and software for Internet transmission of data is extremely reliable.

The Internet mail transmission protocol (SMTP) is defined in RFC-821. The main provision of interest is that the transmission channel must transmit all 128 ASCII characters properly. 8-bit-clean channels are encouraged, but implicitly 7-bit characters are the norm. Internet messages are standardized in RFC-822 for electronic mail and RFC-1036 for Usenet. RFC-1036 adopts RFC-822 nearly in full, so I will refer to these three standards together as RFC-822.

RFC-822 is intended first of all to be compatible with RFC-821. The content of a message is divided into the part that is relevant to the mail transport system, the headers, and the part that is irrelevant to transporting the message, the body. RFC-822 allows users to send 8-bit content in the body at their own risk, but the headers must be in a 7-bit code, in particular, ASCII. This is rather annoying to non-English-speaking users. To permit non-English text in subject headers and in comments (particularly full names associated with addresses) and to provide reliable transport for non-ASCII body content, both non-English text and binary data of various kinds, the Multipurpose Internet Mail Extension suite of protocols was defined. Today, this standard occupies no less than five RFCs (2045-2049). We will be interested only in those parts related to internationalization.

MIME Transfer Encodings

The MIME transfer encodings are like the UCS transformation formats discussed above. They allow arbitrary content to be expressed in a way that will not choke the transmission channel or be damaged by it. MIME defines two transfer encodings, quoted printable and BASE64.

The quoted-printable encoding is very simple. Any octet may be represented by its hexadecimal code, preceded by an equals sign. So a space character is represented as =20 and the Spanish small enye (ñ) is =F1. The Latin capital letter A is =41. However, in general these are used only in three circumstances. First, since the equal sign is an escape character, it must be represented by =3D. Second, some software strips trailing whitespace, in particular on systems with record-oriented storage that do not use control characters to represent line breaks. A space or tab that ends a line will be encoded =20 or =09, respectively. This is important to the signature convention used on Usenet newsgroups. Finally, non-ASCII octets including most control characters will be encoded. Thus, the quoted-printable encoding is intended for applications, such as Western European languages, where most characters come from the basic Latin (i.e., ASCII) set. In fact, one quickly learns to accurately read quoted printable text without decoding it.

Note that this is a transfer encoding. It is a purely mechanical transformation and provides no information about the intended meaning of the character. Although ñ is one interpretation of =F1, there are many others including a different one for each of the ten ISO 8859 character sets. Quoted printable encoding provides no indication of which is intended.

The BASE64 encoding is intended to be a robust encoding for arbitrary binary data, including images and audio. However, it is also commonly used for languages like Japanese where interpreting each octet separately as an ASCII character is illegible without decoding. It is more efficient than quoted printable, using only 33% more space than the original text, where each quoted character uses three times as much space as the unencoded octet. BASE64 is similar to the famous uuencode format long used in UNIX for the same purpose, but the characters used for the encoding are limited to the 52 Latin letters, the 10 decimal digits, the plus sign and slash.

The equals sign is also used, as padding. The reason for this choice is that base 64 is a convenient radix for byte-oriented encoding, since four base-64 digits can encode 24 bits or 3 octets. The characters chosen are passed intact by all known systems, which is not true of some of the punctuation marks used in the uuencode algorithm. The encoding algorithm is obvious:

Break up the data stream into groups of three octets. The last group may have one or two octets and will be treated specially.
For each group of three, concatenate the octets into a 24-bit string, then break it into four 6-bit groups. Interpret each as a 6-bit binary integer and index into the table above. This results in a group of four base-64 digits. Add them to the output.
If there is a remaining group, it has either one or two octets in it. Add one or two null octets to complete the group of three. Now treat it as in Step 2, except that if there was one octet in the group, add the first two base-64 digits to the output and pad the end with two equals signs to make a group of four. If there were two octets in the final group, add the first three base-64 digits to the output and pad with a final equals sign to make a group of four.

Notice that by using the equals sign it is always possible to exactly decode the original text; there will not even be a spurious null character at the end. Furthermore, the algorithm is very fast and space-efficient, given the restrictions.

MIME-specific Headers

A message conforming to the MIME standard must have a version header of the form

MIME-Version: 1.0

Some mailers are sufficiently picky as to refuse to do MIME processing on mail lacking a valid MIME-Version header. This would be amusing, except for the fact that many mailers either do not implement the MIME functions correctly, produce an illegal MIME-Version header, or fail to insert the MIME-Version header at all.

The only version of the MIME header formats is 1.0. The MIME standard has undergone several revisions and expansions, but the basic format has remained unchanged at version 1.0. These revisions and standards have added new values for some of the parameters and specified interpretations for some ambiguous areas, but the syntax is unchanged. Case is irrelevant, in both the header tags and the values. The style of capitalization used below is more or less conventional, but not required.

One way to protect the content, or at least check that it has not been truncated, is to provide a Content-Length header. This is allowed by the MIME standard. The general type of encoding of the body is stated in the content-transfer-encoding header. The default is

Content-Transfer-Encoding: 7-bit

Other allowed values are “quoted-printable” and “base64” (both implicitly 7-bit) and “8-bit”.

Next, the content type of the body is specified. In most messages it will be plain text, specified as

Content-Type: text/plain

Other text types commonly found in mail these days are text/rich and text/html. A forwarded message (with no prefatory comments) may have content type specified as message/rfc-822. Messages can also be multipart. This is commonly used to add multimedia attachments, but can also be used to break up the body into components in different languages.

The MIME standard specifies that the character set is ASCII unless otherwise noted. RFC-822 requires that all headers be ASCII, so the MIME character set specification applies only to the body of the message. This specification is done using the charset parameter of the content type header. The default could be explicitly specified as

Content-Type: text/plain;charset=us-ascii

Note that the optional parameters are specified in keyword=value form. The correct way to specify ASCII is “us-ascii”, because that is the preferred form as registered with the IANA. A list of valid character sets for MIME is at http://www.hunnysoft.com/mime/. Europeans will commonly use

Content-Transfer-Encoding: 8-bit
Content-Type: text/plain;charset=iso-8859-1

The Japanese standard for electronic messages is a version of ISO 2022 called ISO-2022-JP. In fact, this encoding needs to be extended only slightly. It can be used for Chinese and Korean as well and even as a multilingual encoding. The extended version is known as ISO-2022-JP-2 or ISO-2022-INT.

MIME-encoded Words: Non-ASCII Text in Headers

The MIME standard also provides a mechanism for putting non-ASCII text in headers. RFC-822 makes this illegal, so use of this mechanism will result in gibberish being displayed by mail programs that do not implement MIME. However, most mail programs today are MIME-aware, so this should not present any problems. If your correspondents complain, tell them to get a MIME-aware mailer.

The mechanism is simple. Non-ASCII text is encoded using either quoted-printable encoding or BASE64 encoding according to convenience, and bundled up into an encoded word. The reason it must be bundled into an encoded word is that the Content-Type header applies to the body, and if the body is multipart, there will be no charset parameter. Using a special header to control the format of headers seems silly, so the encoded word itself will contain the necessary character set information.

The format of an encoded word begins with the characters =?, continues with the name of the encoding used, the character ?, either the letter Q (for “quoted printable”) or the letter B (for BASE64), the character ? again, the encoded text, and finally the characters ?=. For example, the French word “voil<\#226>” is encoded =?ISO-8859-1?Q?V=F3il=E0?=. Incredibly inefficient, of course, but these will be used only a few times per message. Note that one extra restriction is put on quoted printable encoding, not present in the basic encoding: any question marks in the encoded text must be encoded. Otherwise, the sequence <question mark><encoded octet> would be interpreted as the end of the encoded word.

Content Negotiation

As yet, there are no general standards, but HTTP 1.1 is an example of a protocol that provides facilities for the browser and server to negotiate the type of content to be provided. In particular, the browser can automatically specify the language and preferred encoding of content. The server may ignore this, if content in that language is unavailable. This method is certainly more convenient for users than providing links to translations in various languages.

Another example of content negotiation is provided by the MIME multipart/alternative format. This format allows the same content to be presented in several ways. For example, a mail message can be formatted as both plain text and as HTML. Many UNIX mail user agents do not understand HTML, but Netscape certainly does. This allows “dumb” MUAs (or people who hate HTML e-mail) with a minimal understanding of MIME to read the e-mail as plain text, while those using Netscape to read their mail get the (dubious, in my opinion) benefit of the HTML presentation.

Conclusion

These two articles have presented an overview of the principles of internationalization. It hasn't been brief, but it is hardly complete or comprehensive. Linux is now in fairly good shape with respect to the basic facilities for internationalization with the wide dissemination of GNU libc version 2 (usually known on Linux systems as glibc or libc6).

A few issues still remain to be worked out, especially with respect to Asian languages. We can expect the standards to become more comprehensive over time. For example, locales may deal with line wrapping conventions, or the locale model may be extended to support multilingual applications directly.

However, the main effort today must be on the part of applications programmers and multilingual volunteers. Applications programmers need to use the POSIX locale facilities and GNU gettext to internationalize their programs. Multilingual volunteers should join the GNU translation project and help translate message catalogs for their favorite programs.

Resources