UTF-8
DataDefinition
UTF-8 is a variable-length character encoding for Unicode. It represents each character using 1 to 4 bytes and is the dominant encoding on the web (used by ~98% of websites). UTF-8 is backward compatible with ASCII; the first 128 characters use one byte each.
How UTF-8 Encoding Works
UTF-8 uses a variable-length byte structure to encode Unicode code points. The 1-byte range covers U+0000 to U+007F, which is the full ASCII character set, so ASCII text is identical in UTF-8. The 2-byte range covers U+0080 to U+07FF, which includes Latin extended, Greek, Cyrillic, Hebrew, and Arabic scripts; bytes follow the pattern 110xxxxx 10xxxxxx. The 3-byte range covers U+0800 to U+FFFF, which includes most CJK (Chinese, Japanese, Korean) ideographs and the majority of commonly used scripts; bytes follow the pattern 1110xxxx 10xxxxxx 10xxxxxx. The 4-byte range covers U+10000 to U+10FFFF, which includes emoji and supplementary characters; bytes follow 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. For example: the letter A (U+0041) encodes as 0x41 (1 byte). The euro sign (U+20AC) encodes as 0xE2 0x82 0xAC (3 bytes). The rocket emoji (U+1F680) encodes as 0xF0 0x9F 0x9A 0x80 (4 bytes). UTF-8 is optimal for ASCII-heavy text because each ASCII character takes just 1 byte, while UTF-32 uses 4 bytes per character regardless of code point.
UTF-8 vs UTF-16 vs UTF-32
UTF-8 is the web standard used by over 98% of websites as of 2024-01-01, encoding characters as 1 to 4 variable bytes. UTF-16 uses 2 or 4 bytes per code point and is used internally by JavaScript, Java, and the Windows .NET runtime. UTF-32 uses exactly 4 bytes per code point and appears in some Unix systems where fixed-width access to characters is needed. For ASCII-dominant text, UTF-8 is the most compact choice. For text heavy with Chinese, Japanese, or Korean characters, UTF-16 can be more storage-efficient because those code points take 3 bytes in UTF-8 but only 2 bytes in UTF-16. UTF-32 is not preferred for storage because of its size. JavaScript strings are UTF-16 internally, which explains why emoji behave oddly: they sit above U+FFFF and occupy surrogate pairs, giving them a length of 2 in JavaScript even though they are a single visible character. Python 3 uses UTF-8 for source files. Web browsers, the HTTP protocol, JSON, and HTML all default to UTF-8.
Common UTF-8 Issues in Development
The most frequent issue is the BOM (Byte Order Mark, EF BB BF) prepended by some Windows tools. Certain parsers treat the BOM as content, causing unexpected characters at the start of CSV files or XML documents. A second common issue is truncating UTF-8 strings mid-sequence. Cutting a 3-byte character at byte 2 produces an invalid sequence that many decoders reject or replace with a placeholder character. Always split strings at code point boundaries using proper string libraries rather than raw byte slicing. A third issue specific to MySQL: the database charset named "utf8" only supports 3-byte sequences and cannot store emoji or other 4-byte characters. You must use "utf8mb4" instead if you need full Unicode support. A fourth issue is string normalization: visually identical strings can have different byte representations in Unicode (NFC vs NFD), causing false negatives in comparisons. Use String.prototype.normalize() in JavaScript or unicodedata.normalize() in Python to canonicalize strings before comparing them.
Frequently Asked Questions
Related Terms
Base64
Base64 is a binary-to-text encoding scheme that represents binary data using 64 printable ASCII characters (A-Z, a-z, 0-9, +, /). It increases data size by ~33% but allows binary data to be safely transmitted over text-only channels like email and URLs.
URL Encoding
URL encoding (percent-encoding) converts characters that are not allowed in URLs into a safe format by replacing them with a % followed by two hexadecimal digits. Spaces become %20 or +, special characters like & and = are encoded when used in query parameters.
ASCII
ASCII (American Standard Code for Information Interchange) is a character encoding standard that assigns numbers 0-127 to characters: 0-31 are control characters, 32-127 are printable characters (letters, digits, punctuation). ASCII is the basis for UTF-8 and most modern text encodings.
Unicode
Unicode is a universal character set standard that assigns a unique code point (number) to every character in every writing system, including emoji and symbols. The Unicode Standard covers over 149,000 characters across 161 scripts. UTF-8, UTF-16, and UTF-32 are encodings of Unicode.