CollationData

From IndLinux
Jump to: navigation, search

Compiling Collation data for a locale.

Collation sequence determines the ordering of characters in a locale. A simple, implicit way, is though encoding / code points where the order is based on the numerical ordering of code points eg in ASCII A = 65, B = 66, C = 67 etc.

Collation data is used by sort/search routines, and is there for vital for their efficient operation. Code point sorting is good enough for simple scripts like roman,latin, european , where no of characters is less. The disadvantage with code point sorting is its fixed forever, and if the encoded script were to be used with multiple languages, having different rules for sorting then it becomes difficult to accomodate them. Since many scripts are common across region/languages its imperative that collation sequence is independent of encoding.


glibc locale status
As of now only Tamil (ta_IN), Assamese (as_IN) and Oriya (or_IN) has a custom locale defined. All other locales copy the iso14651_t1 table, which defines sorting fot other than Indic script(Latin, Arabic etc). An extended table exists but is not used

A draft sort order defined by Gora Mohanty is here indicsort.tar.gz

BengaliSort

GujaratiSort

HindiSort

KannadaSort

MalayalamSort

MarathiSort

PunjabiSort

TamilSort

TeluguSort

TibetanSort