Home Tools Blog About

Convert UTF-8 to Code Points

In short

Convert UTF-8 text to Unicode code points in 7 formats (U+, decimal, HTML, CSS, JS, Python). Bidirectional. Free, client-side, instant, secure.

  • Runs in your browser
  • Nothing uploaded
  • Free, no sign-up

Convert UTF-8 text to Unicode code points in 7 output formats: U+XXXX, decimal, HTML entity (hex / decimal), CSS escape, JavaScript escape (\u{XXXX}), Python escape. Each character maps to ONE codepoint regardless of how many UTF-8 bytes it occupies.

Per-character breakdown

Type to begin.
🛡
100% PrivateNo server uploads, ever
InstantRuns in your browser
💧
No WatermarksClean output, always
🆓
Free ForeverNo accounts, no limits

How to Use Convert UTF-8 to Code Points

  1. Paste UTF-8 text. The tool iterates Unicode codepoints (not UTF-16 code units), so emoji and supplementary-plane characters become a single value, not surrogate pairs.
  2. Pick a format: U+ notation for documentation; decimal for math; HTML entity for embedding in HTML; CSS / JS / Python escapes for direct paste into source code.
  3. Set padding: 4 covers BMP; 6 covers everything up to U+10FFFF; 8 matches Python's U escape literal.
  4. Per-character grid shows codepoint, plane (ASCII/BMP/SMP/SIP/TIP/SSP), UTF-8 byte count + bytes hex, and the chosen output format.
  5. Swap to reverse - auto-detects any of the 7 formats by token shape and reconstructs the original text.

Frequently Asked Questions

What’s the difference between a codepoint and a UTF-8 byte?

A codepoint is a number identifying a character in the Unicode standard (U+0000 to U+10FFFF). A UTF-8 byte is one of 1-4 bytes the codepoint serializes to in UTF-8. 🌍 is ONE codepoint (U+1F30D) but takes FOUR UTF-8 bytes (F0 9F 8C 8D). This tool extracts codepoints; the sister UTF-8 to Bytes tool extracts the bytes.

Why are emoji single codepoints if they’re 4 UTF-8 bytes?

Codepoints and bytes are different abstractions. Codepoints are the conceptual character identifiers; UTF-8 is just one of several ways to serialize them. UTF-32 would store each codepoint in 4 fixed-width bytes; UTF-16 may use 2 or 4 bytes (surrogate pairs); UTF-8 uses 1-4. This tool reports the codepoint, which is encoding-independent.

Which output format should I use?

For documentation: U+ notation. For HTML embedding: &#xHH; (hex) or &#NN; (decimal). For JavaScript source: u{XXXX} (ES2015+) or uXXXX (pre-ES2015, BMP only). For Python source: uXXXX for BMP, UXXXXXXXX for non-BMP. For CSS: HHHH (note trailing space terminator). Decimal: rare but appears in some encoding specs.

What does padding do?

Pads short hex with leading zeros to a minimum width. A (U+41) with padding 4 → U+0041; with padding 0 → U+41. Python U always needs 8 hex digits regardless of value, so 8 padding is mandatory for that format. 6 covers the full Unicode range up to U+10FFFF.

How does the reverse decode detect formats?

Regex-based extraction in priority order: U+HEX, u{HEX}, UHHHHHHHH, uHHHH, &#xHEX;, &#NN;, CSS HEX , 0xHEX. Anything matched is parsed as that format. If nothing matches, the input is split on whitespace and parsed as hex.

What happens with invalid codepoints?

Values > U+10FFFF and surrogates U+D800-U+DFFF throw position-specific errors. Surrogates are reserved exclusively for UTF-16 encoding – they shouldn’t appear as standalone codepoints.

What are the planes (ASCII/BMP/SMP/SIP)?

The grid labels each codepoint by its Unicode plane. ASCII: U+0000-U+007F. BMP (Basic Multilingual Plane): U+0000-U+FFFF. SMP (Supplementary Multilingual Plane): U+10000-U+1FFFF (emoji, ancient scripts). SIP (Supplementary Ideographic Plane): U+20000-U+2FFFF (rare CJK). Higher planes (TIP, SSP) are edge cases.

Is text uploaded?

No. Everything runs locally in your browser – nothing is sent to a server, logged, or stored, and the tool keeps working offline once the page has loaded.

What’s the input cap?

200,000 characters. Codepoint output can be up to ~5× input characters depending on format.

How does this compare to the Unicode to Code Points tool?

The Unicode→Code Points tool has the same algorithm. This one is filed under UTF-8 and emphasizes the codepoint-vs-byte distinction, with UTF-8 byte counts in the per-character grid.

Keep going

Related Tools

All Utf8 tools →

Code Points to UTF-8 Converter Free

Free online Unicode code points to UTF-8 converter. Shows actual UTF-8 byte sequences per…

Binary to UTF-8 Decoder

Binary to UTF-8 Text Decoder handles emoji, CJK, accents, strips BOM, counts replacement chars.…

Convert Arbitrary Base to UTF-8

Decode numeric tokens in any base (2-36) as UTF-8 bytes - multi-byte emoji and…

Base64 to UTF-8 Decoder

Decode Base64 to UTF-8 text - handles emoji, CJK, BOM-stripping, URL-safe variants. Free, client-side,…

Convert Bytes to UTF-8

Convert Bytes to UTF-8 Decode decimal/hex/binary byte values to UTF-8 text - emoji, CJK,…

Convert Data URI to UTF-8

online Data URI to UTF-8 decoder with byte-breakdown panel for emoji and CJK. Client-side,…

Convert Decimal to UTF-8

online decimal to UTF-8 text decoder. Byte-mode (raw UTF-8 bytes) and codepoint-mode. Client-side, instant,…

Convert Hexadecimal to UTF-8

Decode hex to UTF-8 text with byte-structural breakdown. Handles ASCII, Latin, CJK, emoji. Batch…

Convert HTML Entities to UTF-8

Decode HTML entities to UTF-8 with per-character byte breakdown. Named, decimal, hex. Free, offline,…

Convert Octal to UTF-8

Decode octal byte sequences to UTF-8 text, encode UTF-8 to octal. C-escape support, multi-byte.…

Convert UTF-16 to UTF-8

Convert UTF-16 code units to UTF-8 text and bytes. 3 formats, BE/LE, BOM, surrogate…

Convert UTF-32 to UTF-8

Convert UTF-32 code points to UTF-8 text and bytes. 3 formats, BE/LE, BOM, strict…

Share

Embed this tool

Add this free tool to your website. Copy and paste the code: