Unicode Normalization

The process of unifying different representations of the same character. Four forms exist: NFC, NFD, NFKC, and NFKD.

Unicode normalization is the process of converting different code point sequences that represent the same character into a unified form. For example, the Japanese character "ga" can be represented as a single code point (U+304C) or as "ka" + combining dakuten (U+304B + U+3099).

There are four normalization forms: NFC (composed), NFD (decomposed), NFKC (compatibility composed), and NFKD (compatibility decomposed). NFC is recommended for the web, and JavaScript provides String.normalize() for conversion. Unicode text processing books cover normalization in detail.

Without normalization, visually identical strings may fail equality comparisons. This causes issues in database searches and filename comparisons.

macOS file systems (APFS/HFS+) use NFD, which can cause filename compatibility issues with other operating systems. Internationalization programming books explain practical normalization techniques.