Question 1

What is UTF-8 encoding?

Accepted Answer

UTF-8 encodes Unicode characters as 1-4 bytes. ASCII characters (0-127) use 1 byte; Latin/European characters use 2 bytes; most Asian characters use 3 bytes; emoji and supplementary characters use 4 bytes. It is the web standard encoding.

Question 2

What is the difference between UTF-8 and UTF-16?

Accepted Answer

UTF-8 uses variable-length encoding (1-4 bytes) and is dominant on the web. UTF-16 uses 2 or 4 bytes per character and is used internally by JavaScript, Java, and Windows. UTF-8 is more efficient for ASCII-heavy text; UTF-16 for Asian text.

Question 3

What is a BOM in UTF-8?

Accepted Answer

The BOM (Byte Order Mark) is an optional sequence (EF BB BF) at the start of a UTF-8 file. It is unnecessary in UTF-8 (unlike UTF-16 where byte order matters) and can cause issues in some contexts. Most UTF-8 files omit the BOM.

Question 4

Why do I see ??? or garbage characters when reading a file?

Accepted Answer

This is a character encoding mismatch. The file was saved in one encoding (Latin-1 or Windows-1252) but read as UTF-8 or vice versa, causing the decoder to misinterpret byte sequences. Fix it by explicitly specifying the encoding when opening files. In Python: open(file, encoding='utf-8'). In MySQL: SET NAMES utf8mb4. In HTML: place inside the element before any other content.

Question 5

What is the difference between UTF-8 and ASCII?

Accepted Answer

ASCII is a 7-bit encoding defining 128 characters (0-127): English letters, digits, and basic punctuation. UTF-8 is a superset of ASCII. The first 128 UTF-8 code points are identical to ASCII, each using exactly one byte. UTF-8 extends further to 1,112,064 code points covering every Unicode character. Any valid ASCII text is also valid UTF-8, which makes UTF-8 fully backward compatible with all ASCII systems and libraries.

UTF-8

How UTF-8 Encoding Works

UTF-8 vs UTF-16 vs UTF-32

Common UTF-8 Issues in Development

Frequently Asked Questions

Related Terms