UTF-8 là gì - Bộ đếm ký tự

UTF-8

Bảng mã Unicode có độ dài thay đổi. Bảng mã ký tự thống trị trên web, được sử dụng bởi hơn 98% trang web.

UTF-8 là bảng mã ký tự có độ dài thay đổi để triển khai Unicode. It represents characters in 1 to 4 bytes, maintaining full ASCII compatibility while supporting all Unicode characters worldwide. As of 2024, over 98% of web pages use UTF-8, making it the de facto standard character encoding on the internet.

Số byte UTF-8 thay đổi theo loại ký tự. ASCII (U+0000-U+007F) uses 1 byte, extended Latin and Cyrillic (U+0080-U+07FF) use 2 bytes, Japanese/Chinese/Korean characters including kanji, hiragana, and katakana (U+0800-U+FFFF) use 3 bytes, and emoji and some CJK characters (U+10000-U+10FFFF) use 4 bytes. This variable-length design allows English text to be stored at ASCII size while efficiently representing multilingual text. Web technology introduction books cover UTF-8 as essential knowledge.

UTF-8 trở thành tiêu chuẩn web vì nhiều lý do. Its ASCII compatibility means existing ASCII-based protocols (HTTP, SMTP, URL) and tools work unchanged. It has no byte order (endianness) issues and requires no BOM (Byte Order Mark). It also features self-synchronization, allowing character boundaries to be identified from any point in the byte stream, making it resilient to data corruption.

Trong HTML, bảng mã ký tự được khai báo bằng <meta charset="UTF-8">. Without this declaration, browsers may misidentify the encoding, causing garbled text. It can also be specified via the HTTP response header Content-Type: text/html; charset=utf-8, and specifying both is best practice.

Xử lý UTF-8 trong cơ sở dữ liệu cần chú ý. Của MySQL, utf8 character set historically supports only up to 3 bytes, unable to store 4-byte characters (like emoji). To store all Unicode characters including emoji, utf8mb4 must be used. PostgreSQL's UTF8 supports 4 bytes from the start. Encoding technology books provide detailed coverage.

Đối với đếm ký tự, hiểu sự khác biệt giữa số byte UTF-8 và số ký tự là quan trọng. "Hello" là 5 ký tự và 5 byte trong UTF-8, nhưng "こんにちは" is 5 characters yet 15 bytes (3 bytes × 5 characters). Since database column sizes and file size limits are often specified in bytes, knowing both character count and byte count is necessary. Displaying both in character counting tools helps users handle various constraints.

UTF-8

Thuật ngữ liên quan

Bài viết liên quan