Code Points to UTF-8 Converter Free

Free online Unicode code points to UTF-8 converter. Shows actual UTF-8 byte sequences per character. Client-side, instant, secure - no uploads.

Turn Unicode code points (U+XXXX, hex, or decimal) into rendered text or their actual UTF-8 byte sequences. Runs entirely in your browser.

How to Use Code Points to UTF-8 Converter Free

  1. Paste your code points into the input area. Formats accepted: U+1F600, 0x1F600, bare hex 1F600, or bare decimal 128512. Separate with spaces, commas, or semicolons.
  2. Pick the bare-token base. Choose auto-hex if your bare tokens are hex (the common Unicode convention), or decimal if you're feeding in plain decimal numbers. Prefixed tokens (U+, 0x) are always hex.
  3. Pick an output format. text gives you rendered characters (same as a basic converter). UTF-8 bytes (hex) shows the actual encoded bytes like F0 9F 98 80 - this is what UTF-8 really looks like on disk.
  4. Enable the byte breakdown if you're learning. The panel shows each token → character → UTF-8 bytes + byte count, making the 1/2/3/4-byte size rules obvious.
  5. Press "Encode to UTF-8" (or just type - live preview runs with 150ms debounce). Stats below show total bytes, byte-class distribution, and any invalid inputs flagged by reason.
  6. Copy or download. Copy pushes the output to your clipboard; Download saves a utf8-*.txt file with the exact current output format.

Frequently asked questions

How is this different from the “code points to Unicode” tool?

Both tools accept code points and can render characters. The difference: this tool additionally exposes the real UTF-8 byte sequences via the native TextEncoder API. If you need to see what a code point actually becomes on disk (1, 2, 3, or 4 bytes), this is the one. If you only want the rendered glyphs, either works.

How does UTF-8 encode a code point into bytes?

UTF-8 uses a prefix code. U+0000-U+007F fits in one byte with leading bit 0. U+0080-U+07FF uses two bytes: 110xxxxx 10xxxxxx. U+0800-U+FFFF uses three: 1110xxxx 10xxxxxx 10xxxxxx. U+10000-U+10FFFF uses four: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. Continuation bytes always start with 10, so a decoder can resync mid-stream.

Why does U+1F600 need 4 bytes but U+00E9 only needs 2?

UTF-8’s byte count depends on the code point’s numeric value, not the character. U+00E9 (é, decimal 233) fits in 11 bits so it takes 2 bytes. U+1F600 (decimal 128512) needs 21 bits so it spills into 4 bytes. The boundaries are exactly 0x80, 0x800, 0x10000, and 0x110000.

What are the bit patterns of each UTF-8 byte?

First byte tells you the length: 0xxxxxxx (1 byte), 110xxxxx (2 bytes), 1110xxxx (3 bytes), 11110xxx (4 bytes). Every following “continuation” byte is 10xxxxxx. The x bits hold the code point, big-endian. Enable the byte breakdown toggle above to see real examples.

Why can’t UTF-8’s first byte exceed 0xF4?

Because the Unicode range stops at U+10FFFF. A first byte of 0xF5 would imply a code point ≥ U+140000, which Unicode does not allocate. RFC 3629 (2003) explicitly restricts UTF-8 to this range – primarily to stay compatible with UTF-16 which cannot represent anything higher.

Are surrogate code points valid in UTF-8?

No. U+D800-U+DFFF are reserved as UTF-16 surrogate halves and are explicitly banned in UTF-8 by RFC 3629. Feeding a surrogate into this tool with the default settings gives you a “surrogate code unit – not valid UTF-8” error. The “Allow surrogates” toggle bypasses this check for WTF-8 or Modified UTF-8 study, but the resulting bytes are not standards-compliant.

What is “overlong encoding” and why is it forbidden?

UTF-8 requires the shortest encoding for each code point. Writing A (U+0041) as the 2-byte sequence C1 81 is technically decodable but is called an overlong form. RFC 3629 bans these because attackers have historically used them to smuggle special characters past filters – e.g., encoding a slash / in an overlong way to bypass path-traversal defenses.

What happens with combining marks like U+0301?

Each code point is encoded independently. “Á” typed as U+0041 + U+0301 becomes 3 UTF-8 bytes total: 41 CC 81 (1 + 2). It visually renders as one character but the byte stream sees two. Compare with U+00C1 (precomposed Á) which is just 2 bytes: C3 81. Enable the breakdown to see both side by side.

Why do ZWJ emoji sequences like 👨‍👩‍👧 take so many bytes?

That emoji is three code points plus two ZWJ joiners: U+1F468 + U+200D + U+1F469 + U+200D + U+1F467. Each non-BMP emoji is 4 UTF-8 bytes, each ZWJ is 3, totaling 18 bytes for one visual glyph. This is why an “emoji character count” wildly overestimates display length – it counts code points, not bytes or grapheme clusters.

Is this converter free, offline, and private?

Yes to all three. It’s free with no account, no quota, no ads. It runs 100% in your browser using the native TextEncoder – nothing is uploaded and nothing is stored. Load the page once, disconnect, and it keeps working.