Control Character

Special characters that are not displayed on screen but instruct how text should be processed. Examples include line feed (LF), tab (HT), carriage return (CR), and null (NUL).

Control characters are special characters that do not render visually but instead direct how text is processed or displayed. The first 32 characters of ASCII (U+0000 through U+001F) and DEL (U+007F) are classified as control characters. Unicode extends this range with additional control characters in U+0080 through U+009F.

Only a handful of control characters see everyday use. LF (Line Feed, U+000A) represents a newline and is the standard line ending on Unix-based systems. CR (Carriage Return, U+000D) moves the cursor to the beginning of the line; Windows uses the two-character sequence CR+LF for line breaks. HT (Horizontal Tab, U+0009) is the tab character used for column alignment. NUL (U+0000) is the null character, used in C to mark the end of a string.

Control characters are a common source of confusion in character counting. When you press Enter in a text editor, the display simply moves to a new line, but internally either LF (1 byte) or CR+LF (2 bytes) is inserted. A text file created on Windows will have a higher character count than the same content created on Unix, because each line break contributes two characters (CR+LF) instead of one (LF).

In the web context, HTML collapses consecutive whitespace characters (spaces, tabs, newlines) into a single space for display. This means the character count in HTML source code does not match the character count of the rendered text. The <pre> element or CSS white-space: pre preserves whitespace including control characters exactly as written. Character encoding references on Amazon cover these distinctions thoroughly.

From a security perspective, control characters can be exploited in injection attacks. HTTP header injection inserts CR+LF into HTTP headers, and log injection inserts newlines to forge log entries. Sanitizing user input by stripping control characters is a fundamental security practice.

Unicode also defines a category called "format characters" that are distinct from control characters. Zero-width space (U+200B), zero-width joiner (U+200D), and bidirectional control characters (U+200E, U+200F) fall into this group. Like control characters, they are invisible, but they influence how text is displayed, so their treatment in character counting must be clearly defined.

Share this article