LSB
This is a writeup for a request from Linux Standard Base (LSB) to work on coming up with a specification for Indian language support on Linux desktops. This has also been posted to their mailing lists I am not yet sure which is the appropriate list to dicuss this on, but it is probably lsb-wg@freestandards.org.
Refer to IndicTesting also.
The essential requirements for supporting an Indian (or other Unicode) language in a Linux distribution are:
Contents |
Font
- The font encoding should be Unicode (ISO-10646). The Unicode version supported should also be specified.
- For conjuncts used in Indian languages, the font will typically be an OpenType one. While it is theoretically possible to have a Unicode- compliant TrueType font for Indian languages, it is much more difficult to create and maintain than an OpenType one, and I am not aware the existence of any such TrueType font.
- Ideally, the specification should include a minimal set of glyphs (possibly, along with OpenType rules) that provide complete coverage of the letters and conjuncts in the language. However, this might reasonably be considered a quality of implementation issue, rather than something to be standardized by the LSB.
Locale data
(glibc, X, GNOME, OpenOffice, and Unicode CLDR)
- The glibc locale includes data specific to the language and/or culture, including things like month and day names, time and date representations, monetary symbols, and number groupings. Perhaps, the most important of these is the collation sequence that defines how characters in the language are sorted. For languages where a legally mandated sorting standard exists, it could be specified that this standard should be adhered to. As these locale data are part of glibc, it might be sufficient to specify the version of glibc in use in the distribution. There is a proposal to include a broader set of such culturally sensitive data as part of GNOME, and similarly, openoffice.org maintains its own locale information. Hopefully, all of these disparate locale data should be subsumed into the Common Locale Data Repository (CLDR) maintained by the Unicode consortium. As for the glibc locales, LSB could specify the versions of GNOME, OpenOffice locale data and that of the Unicode CLDR.
- X locales contain some information about the encoding used by fonts in the language, though the format does not seem to be publicly documented. Currently, many Indian locales are not available in X, the symptom being a warning message at the time an X application is started from within such a locale. However, this does not seem to affect the functioning of the program. For the purposes of the LSB, specifying the version of X should suffice.
Keyboard input methods
(xkb, SCIM,UIM, IIIMF, application-specific)
These allow for text entry in Indian languages from a standard keyboard. Several versions of such keymappings exist, with the two best-known ones being Inscript, formalized by the Indian government, and the pseudo-phonetic ITRANS, which is a de-facto standard. Keyboard input in Linux is handled at various levels, from xkb that is tied to X at a very low-level, to the specific mappings that work only with a particular application. Methods like SCIM, UIM, and IIIMF operate at an intermediate level, using the XInput Method.
- xkb will work with all X-based applications, but has the limitation that it only allows a one-to-one mapping, whereas a many-to-many mapping is needed for inputting Indian language text with a phonetic scheme. Nevertheless, the low-level support for xkb means that it is important that at least an Inscript xkb map be supplied for each langauge.
- SCIM, UIM, IIIMF, etc., probably represent the future of keyboard input methods in X, of which my personal opinion is that SCIM is going to become the most widely accepted one. They do share some commonality, and it is possible to make a keymap using the m17n library that will work with all of these input methods. At a minimum, Inscript and ITRANS keymaps should be supplied for each language. Presently, SCIM comes with built-in ITRANS maps for Indian languages, and soon will include our Inscript maps, so that specifying support for a version of SCIM should suffice.
- In general, it is asking too much to standardize application- specific keymapping schemes, but one application that should be considered for standardization is yudit. yudit is perhaps the best at handling Unicode text, and is cross-platform. Inscript and ITRANS keymaps for most Indian languages exist for yudit, and after these are rolled into the yudit distribution, one could specify at least that particular release of yudit.
Font rendering
(X, GNOME: GTK/Pango, KDE: QT, OpenOffice: ICU, m17n, application-specific)
As the base X windowing system did not support Unicode for a long time, various desktops and application suites rolled their own, as noted above. I am not sure of the current status of OpenType, Unicode font rendering in X, though there appear to be some functions for rendering UTF-8 input. Renderers like Pango, QT, ICU, m17n, etc., exhibited bugs when it comes to rendering at least some Indian languages, and the version of the renderer required to be supported needs to be specified.
Printing support
(from within GNOME, KDE, OpenOffice, Mozilla/Firefox)
For the most part, printing in Indian languages from GNOME, KDE, and OpenOffice works fine, but that from the Mozilla family of browsers has problems. These are in the process of being fixed, so a particular version of the browser should be specified. Parenthetically, it should be noted that Mozilla and Firefox currently need to be compiled specifically with Indian language support in order to properly display Indian text, and such support should be required.
System suport for Unicode
- Linux installer: I am not aware of any installer that currently supports Unicode, though there is ongoing work on such a graphical installer for the next version of Debian.
- System utilities: Many standard system utilities such as tr, sed, sort, uniq, etc., do a poor job of supporting Unicode. Likewise, glibc has had bugs related to Unicode support, notably in tools for processing language data. Specifying
- Applications: Unicode support for databases, editors, etc. Again, it might be too cumbersome to standardize specific applications, but it could be possible to mandate Unicode support in some important classes of applications, such as databases, office suites, Internet browsers, etc.
- Programming languages (C, C++, Python, PERL, etc.): There are two issues here; whether the language allows the handling of Unicode text, and whether one can use Unicode text in a program written in that language. Many languages support the former, but not the latter. As support for the handling of Unicode text was typically added at some point after the initial standardization of the language, support for at least a certain version of the language standard should be specified.