DD
DevDash

UTF-8

Data

Definition

UTF-8 is a variable-length character encoding for Unicode. It represents each character using 1 to 4 bytes and is the dominant encoding on the web (used by ~98% of websites). UTF-8 is backward compatible with ASCII; the first 128 characters use one byte each.

How UTF-8 Encoding Works

UTF-8 uses a variable-length byte structure to encode Unicode code points. The 1-byte range covers U+0000 to U+007F, which is the full ASCII character set, so ASCII text is identical in UTF-8. The 2-byte range covers U+0080 to U+07FF, which includes Latin extended, Greek, Cyrillic, Hebrew, and Arabic scripts; bytes follow the pattern 110xxxxx 10xxxxxx. The 3-byte range covers U+0800 to U+FFFF, which includes most CJK (Chinese, Japanese, Korean) ideographs and the majority of commonly used scripts; bytes follow the pattern 1110xxxx 10xxxxxx 10xxxxxx. The 4-byte range covers U+10000 to U+10FFFF, which includes emoji and supplementary characters; bytes follow 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. For example: the letter A (U+0041) encodes as 0x41 (1 byte). The euro sign (U+20AC) encodes as 0xE2 0x82 0xAC (3 bytes). The rocket emoji (U+1F680) encodes as 0xF0 0x9F 0x9A 0x80 (4 bytes). UTF-8 is optimal for ASCII-heavy text because each ASCII character takes just 1 byte, while UTF-32 uses 4 bytes per character regardless of code point.

UTF-8 vs UTF-16 vs UTF-32

UTF-8 is the web standard used by over 98% of websites as of 2024-01-01, encoding characters as 1 to 4 variable bytes. UTF-16 uses 2 or 4 bytes per code point and is used internally by JavaScript, Java, and the Windows .NET runtime. UTF-32 uses exactly 4 bytes per code point and appears in some Unix systems where fixed-width access to characters is needed. For ASCII-dominant text, UTF-8 is the most compact choice. For text heavy with Chinese, Japanese, or Korean characters, UTF-16 can be more storage-efficient because those code points take 3 bytes in UTF-8 but only 2 bytes in UTF-16. UTF-32 is not preferred for storage because of its size. JavaScript strings are UTF-16 internally, which explains why emoji behave oddly: they sit above U+FFFF and occupy surrogate pairs, giving them a length of 2 in JavaScript even though they are a single visible character. Python 3 uses UTF-8 for source files. Web browsers, the HTTP protocol, JSON, and HTML all default to UTF-8.

Common UTF-8 Issues in Development

The most frequent issue is the BOM (Byte Order Mark, EF BB BF) prepended by some Windows tools. Certain parsers treat the BOM as content, causing unexpected characters at the start of CSV files or XML documents. A second common issue is truncating UTF-8 strings mid-sequence. Cutting a 3-byte character at byte 2 produces an invalid sequence that many decoders reject or replace with a placeholder character. Always split strings at code point boundaries using proper string libraries rather than raw byte slicing. A third issue specific to MySQL: the database charset named "utf8" only supports 3-byte sequences and cannot store emoji or other 4-byte characters. You must use "utf8mb4" instead if you need full Unicode support. A fourth issue is string normalization: visually identical strings can have different byte representations in Unicode (NFC vs NFD), causing false negatives in comparisons. Use String.prototype.normalize() in JavaScript or unicodedata.normalize() in Python to canonicalize strings before comparing them.

Frequently Asked Questions

Related Terms

Want API access + no ads? Pro coming soon.