Unicode vs UTF-8 vs UTF-16: Text Encoding Untangled

Unicode, UTF-8 and UTF-16 get used as synonyms, and the confusion they cause is not academic: it is the mojibake in your CSV import, the é that became é, the emoji that broke a database column. The untangling takes one sentence: Unicode assigns every character a number; UTF-8 and UTF-16 are two different ways of writing those numbers as bytes. This guide unpacks that sentence with real byte values, using our free UTF-8 to hex converter to make the invisible encoding layer visible.

Unicode: the numbering, not the bytes

Unicode is a registry: every character of every script, plus symbols and emoji, gets a number called a code point, written U+ and hex digits. The Latin A is U+0041, the Greek α is U+03B1, the grinning emoji is U+1F600. The registry says nothing about storage; “this file is in Unicode” is, strictly, not a complete statement, like saying a song is “in audio”. Storage needs an encoding, a rule turning code points into bytes, and the two rules that matter are UTF-8 and UTF-16. Everything confusing about text files lives in the gap between the number a character has and the bytes a particular encoding writes for it.

UTF-8: one to four bytes, ASCII for free

UTF-8 spends bytes proportionally to how far a code point sits from the ASCII core: one byte for ASCII (A is just 41), two for most European and Middle Eastern letters (α is CE B1), three for most Asian scripts and symbols (€ is E2 82 AC), four for emoji and rare characters (😀 is F0 9F 98 80). The design carries two famous gifts. First, plain ASCII text is already valid UTF-8, byte for byte, which let the web upgrade without converting its archives, and is the main reason UTF-8 won the format war. Second, the byte patterns are self-marking: a multi-byte character’s bytes can never be mistaken for ASCII, so tools can cut, search and recover mid-stream safely. Paste any text into the hex converter or binary converter and the one-to-four-byte rhythm is right there in the output.

UTF-16: two bytes, with a trapdoor

UTF-16 writes most characters as exactly two bytes: A is 00 41, € is 20 AC. Its regularity made it the internal format of Windows, Java and JavaScript, all designed in the era when two bytes seemed enough for every character that would ever matter. Then Unicode outgrew 65,536 code points, and UTF-16 acquired its trapdoor: characters beyond the two-byte range, emoji most visibly, are stored as a surrogate pair, two reserved two-byte units that only mean something together, 😀 becoming D8 3D DE 00. The trapdoor is why JavaScript reports the length of “😀” as 2, why naive substring code can slice an emoji in half, and why “fixed-width” UTF-16 is a promise the format can no longer keep.

The same four characters in both

CharacterCode pointUTF-8 bytesUTF-16 bytes
AU+00414100 41
αU+03B1CE B103 B1
U+20ACE2 82 AC20 AC
😀U+1F600F0 9F 98 80D8 3D DE 00

The table shows the trade in miniature: UTF-8 is smaller for ASCII-heavy text and grows gracefully; UTF-16 is smaller for some Asian-script text (two bytes against UTF-8’s three) and pays with the surrogate machinery. For files, interchange and the web, UTF-8 won so completely that the practical advice is one word long; UTF-16 survives where platforms baked it in decades ago.

The bugs this explains

Mojibake, é arriving as é, is UTF-8 bytes read with a one-byte legacy encoding: C3 A9 is é in UTF-8 and é in Latin-1, the same bytes under two dictionaries, fixed by declaring the right encoding rather than editing the text. The two cafés that refuse to match are Unicode’s composed-versus-decomposed forms: é can be one code point (C3 A9) or an e plus a combining accent (65 CC 81), identical on screen, different in bytes; the Unicode character counter exposes the difference and explains the “identical” strings that fail equality checks. The smart-quote explosion in legacy systems is what the Unicode to ASCII converter exists for, flattening typographic characters to their plain cousins before a strict consumer chokes. The byte-level toolbox for all of this lives in the number systems pillar and the encoding guide.

Frequently asked questions

What is a BOM and do I need one?

The byte order mark is an optional signature at a file’s start: useful in UTF-16, where byte order genuinely varies, and mostly a nuisance in UTF-8, where some tools display it as stray junk and others require it. The modern default is UTF-8 without BOM, and the main reason to know the term is recognizing those three bytes (EF BB BF) when they leak into output.

Is there a UTF-32?

Yes: four bytes per character, every character, no exceptions. It is the simplest encoding and the most wasteful, quadrupling ASCII-heavy text, so it appears almost only as an internal processing format where fixed width simplifies code and memory is cheap.

Why does my database count string lengths differently than my app?

They are counting different things: bytes, code units (UTF-16’s two-byte steps), or code points all give different totals for the same text the moment emoji or accents appear. The 😀 that is 4 UTF-8 bytes, 2 UTF-16 units and 1 code point illustrates all three at once. Length checks across systems should agree on which definition they mean.

Does UTF-8 handle Greek, Cyrillic and Arabic properly?

Fully: every Unicode script encodes in UTF-8, typically at two bytes per letter for those alphabets. There is no quality difference between encodings, only size and compatibility differences; any text that looks wrong was decoded with the wrong dictionary, not stored in a lesser format.

ATV

Written by Nick (ATV Team)

We build and maintain the 600+ free, client-side tools on this site, and every guide is written against the tools themselves: each figure is computed and checked before it is published, and every linked tool is tested in the browser. More about how we work on the about page, and the full library of guides lives on the blog.