Sorting (String Ordering)

The process of arranging strings in a specific order such as dictionary order, Unicode code-point order, or locale-dependent order. Because the "correct" order varies by language and culture, configuring the collation sequence is critical for internationalization.

Sorting is the process of arranging a list of strings in a defined order. It is used everywhere text data appears: file-name listings, contact names, dictionary headwords, and search-result rankings. However, the "correct" order differs by language and culture, making internationalized sorting far more complex than it first appears.

The simplest sort is by Unicode code-point order. JavaScript's Array.sort() uses this method by default. Under code-point order, however, "Z" (U+005A) comes before "a" (U+0061), producing counterintuitive results when uppercase and lowercase letters are mixed. Numeric strings are also sorted lexicographically - "1, 10, 2, 20, 3" - rather than numerically.

Japanese sorting is particularly complex. There are multiple ordering schemes for kanji: by on'yomi (Sino-Japanese reading), by kun'yomi (native reading), by radical and stroke count, and by JIS code. Phone directories use the gojuon (fifty-sound) order of readings, kanji dictionaries use radical-stroke order, and JIS standards use kuten code order. Because the same kanji can have multiple readings (e.g., "生" can be read as "sei," "sho," "nama," or "ikiru"), reading-based sorting requires furigana data.

The ICU (International Components for Unicode) library is the industry standard for locale-aware sorting. JavaScript's Intl.Collator is built on ICU; new Intl.Collator('ja').compare(a, b) sorts strings in a natural Japanese order. In German, whether "a" is placed after "a" or treated as "ae" depends on the locale, and Intl.Collator handles such language-specific rules correctly.

Natural sort order interprets numbers within strings as numeric values. Sorting "file1, file2, file10" lexicographically yields "file1, file10, file2," but natural sort produces "file1, file2, file10." This approach is used for filenames and version numbers to deliver intuitive results.

In relation to character counting, generating sort keys often requires string normalization. Unifying full-width and half-width characters, handling dakuten and handakuten, and normalizing letter case can all change the string's character count. In practice, the original (pre-normalization) string is displayed to the user while the sort key (post-normalization) is used only internally. Algorithm books on Amazon cover sorting strategies in depth.

Share this article