Convert Data URI to UTF-8

online Data URI to UTF-8 decoder with byte-breakdown panel for emoji and CJK. Client-side, instant, secure - no uploads.

Decode a data: URI and view the UTF-8 text with an optional per-codepoint byte breakdown. Handles Base64 and URL-encoded payloads and correctly surfaces emoji, CJK, and multi-byte characters.

How to Use Convert Data URI to UTF-8

  1. Paste the full URI into the input. It must start with data: and contain a comma separating the header from the payload.
  2. The tool auto-detects whether the payload is Base64 or percent-encoded from the URI header, then decodes the raw bytes.
  3. Decoding is always UTF-8. If the URI declared a different charset, the stats line calls that out - you'll see a "âš  declared charset: windows-1252 (decoded as UTF-8)" note. For honest-to-the-charset decoding use the generic Data URI to ASCII decoder.
  4. Enable the byte breakdown to see each code point's UTF-8 byte sequence. A single emoji like 😀 shows up as 4 bytes (F0 9F 98 80); a Latin-1 character like é shows 2 bytes (C3 A9). This is how you see why some strings are longer than they look.
  5. Optional: Strip UTF-8 BOM. If the decoded text starts with U+FEFF (EF BB BF bytes), enabling this removes it so the result is clean.
  6. Read the stats. You get byte-class counts like 1-byte:5 3-byte:2 - handy when sizing payloads or debugging unexpected sizes.
  7. Copy or download. Copy places the decoded text on your clipboard. Download saves a data-uri-utf8-*.txt file.

Frequently Asked Questions

How is this different from the generic Data URI to ASCII decoder?

The generic decoder honours whatever charset the URI declares (UTF-8, Windows-1252, ISO-8859-1, etc.) – it converts the bytes using that charset. This tool always decodes as UTF-8 and additionally exposes a per-codepoint UTF-8 byte breakdown. Pick this one when you know the payload is UTF-8 and want to see the byte-level structure; pick the generic one when the source charset is legacy or unknown.

Why does my UTF-8 emoji take 4 bytes but show as 1 character?

Because UTF-8 packs larger Unicode code points into more bytes. U+1F600 (😀) sits in the supplementary planes above U+FFFF, which requires UTF-8’s 4-byte form (F0 9F 98 80). The decoded string has 1 code point, but any string-length measurement that counts code units (JavaScript’s .length) will say 2, and the UTF-8 byte count is 4. The breakdown panel exposes all three numbers.

What are the 1/2/3/4-byte UTF-8 classes?

UTF-8 uses variable-width encoding: U+0000-U+007F is 1 byte (plain ASCII), U+0080-U+07FF is 2 bytes (Latin extended, Greek, Cyrillic, Arabic), U+0800-U+FFFF is 3 bytes (CJK, most BMP characters), U+10000-U+10FFFF is 4 bytes (emoji, supplementary scripts). The stats panel breaks down your payload by these classes.

What do U+FFFD replacement characters in the output mean?

The decoder hit bytes that aren’t valid UTF-8. Most common cause: the source was actually some other charset (e.g., Windows-1252) but this tool always assumes UTF-8. If you see lots of U+FFFD, try the generic Data URI to ASCII decoder with the original charset, or check whether the URI was truncated mid-sequence.

How do I know if my URI has a UTF-8 BOM?

After decoding, check if the output starts with U+FEFF (a zero-width no-break space that serves as a byte-order mark). Tick the “Strip UTF-8 BOM” box to remove it automatically. In Base64, the BOM appears as the prefix 77u/ (which decodes to EF BB BF).

Does this handle URL-safe Base64 inside a data URI?

Yes. RFC 2397 technically specifies the standard Base64 alphabet (+ and /), but URL-safe Base64 (- and _) leaks into data URIs that were built for web contexts. This tool normalises both alphabets and auto-pads missing = characters.

Why is the byte count different from my string’s length?

Because UTF-8 bytes, UTF-16 code units (what JavaScript’s .length counts), and Unicode code points (Array.from(s).length) are three different things. For ASCII-only text they match. For anything else they diverge: “é” is 2 UTF-8 bytes but 1 code point and 1 code unit; “😀” is 4 UTF-8 bytes, 1 code point, but 2 UTF-16 code units. The breakdown panel shows all the math.

What happens with URL-encoded payloads and UTF-8?

Percent-encoded multi-byte UTF-8 is handled correctly. data:,%E6%97%A5%E6%9C%AC (6 percent-escapes, 18 chars of payload) decodes to 日本 (2 code points, 6 UTF-8 bytes). decodeURIComponent is UTF-8-aware by default, so this just works.

Can I decode a Windows-1252 or Latin-1 data URI with this tool?

You can, but you’ll get U+FFFD replacement characters wherever the bytes don’t form valid UTF-8. The tool flags that in stats. For lossless legacy-charset decoding, use the generic Data URI to ASCII decoder which honours the declared charset. This tool is UTF-8-only by design so you can see the byte structure clearly.

Is it free, offline, and private?

Yes. Decoding uses the browser’s native atob, decodeURIComponent, and TextDecoder. Nothing is uploaded, nothing is logged, no account is needed. Load the page once and the tool works offline indefinitely.