Home Tools Blog About

Extract Unicode Graphemes

In short

Segment text into grapheme clusters via Intl.Segmenter - shows every code point per grapheme plus UTF-16 / UTF-8 sizes. Free, offline, client-side, instant, secure.

  • Runs in your browser
  • Nothing uploaded
  • Free, no sign-up

Split text the way humans see it - 👨‍👩‍👧‍👦 as one grapheme, not 7 code points; 🇬🇷 as one flag, not 2 regional indicators; é as one cluster whether it's pre-composed or e + ◌́. Powered by the native Intl.Segmenter API.

- type or paste text to begin
🛡
100% PrivateNo server uploads, ever
InstantRuns in your browser
💧
No WatermarksClean output, always
🆓
Free ForeverNo accounts, no limits

How to Use Extract Unicode Graphemes

  1. Type or paste text into the input - anything with emoji, accented letters, CJK, or combining marks works best for showing what the tool does. Output updates live (150 ms debounce).
  2. Read the stats line. Total graphemes, total code points, total UTF-16 code units, total UTF-8 bytes, count of multi-code-point graphemes, and which engine ran (the native Intl.Segmenter on modern browsers, or the code-point fallback on older ones).
  3. Pick an output format. Aligned table (default) shows index, the grapheme in quotes, the full code-point sequence, and the byte sizes. List gives just the graphemes. Numbered adds the code points inline. CSV / JSON are for further processing.
  4. Notice multi-code-point graphemes. The family emoji 👨‍👩‍👧‍👦 appears as one grapheme but its code-point column shows all seven (man + ZWJ + woman + ZWJ + girl + ZWJ + boy). That's the key insight: what you see as one character can be several code points.
  5. Compare with UTF-16 and UTF-8 sizes. Emoji and astral characters cost 2 UTF-16 code units each. UTF-8 costs vary from 1 byte (ASCII) up to 4 bytes per code point - and a single family emoji can span 25+ UTF-8 bytes.
  6. Copy or download. Ctrl/Cmd + Enter copies. Download saves graphemes.txt, .csv, or .json by format.

Frequently Asked Questions

What is a grapheme cluster?

What a user perceives as a single character, even when it spans multiple Unicode code points. The ZWJ family emoji 👨‍👩‍👧‍👦 is one grapheme to the user but 7 code points to the encoder. The Greek flag 🇬🇷 is one grapheme but 2 regional indicators. A pre-composed é (U+00E9) is one code point and one grapheme; a decomposed é (e + U+0301) is two code points but still one grapheme. The Unicode Standard Annex #29 defines exactly when characters cluster.

Why does the tool show all code points now?

Because a grapheme made of multiple code points is exactly what makes this tool interesting. Showing only the first code point of a family emoji misleads you about the underlying data – the size is hugely different (7 code points vs 1), and operations like string.length in JavaScript return code-unit counts that don’t match what you’d expect. The full sequence reveals the truth.

What’s the difference between code points, UTF-16 units, and UTF-8 bytes?

Three different ways to count: a code point is the abstract Unicode character ID (e.g. U+1F468 is “man”). A UTF-16 code unit is JavaScript’s internal storage unit – characters above U+FFFF take two code units (a surrogate pair). A UTF-8 byte is what gets written to disk or sent over the wire – ASCII is 1 byte, accented Latin is 2, most other BMP characters are 3, and astral characters (emoji) are 4. A single family emoji can be 1 grapheme, 7 code points, 11 UTF-16 units, and 25 UTF-8 bytes.

What does the engine label mean?

The stats line shows “engine: Intl.Segmenter” when the native API is available (Chrome 87+, Edge 87+, Safari 14.1+, Firefox 125+), or “engine: fallback (code-point split)” when it isn’t. The fallback iterates by code point – it handles surrogate pairs correctly but cannot recognise multi-code-point grapheme clusters like ZWJ sequences, so emoji families come out split. Practically, almost any browser from 2022 onward uses the proper API.

Does this work for all scripts?

Yes. Intl.Segmenter is locale-aware and handles Latin, Cyrillic, Greek, Arabic, Hebrew, Devanagari, Chinese, Japanese, Korean, Thai, and all emoji. The tool initialises with locale 'en'; the segmentation rules for graphemes are locale-independent per UAX #29, so the locale doesn’t change the result for graphemes (it would matter for word/sentence granularity).

How big a text can I segment?

The native Intl.Segmenter is fast – millions of characters in under a second. The tool runs synchronously on the main thread, so very large inputs (10 MB+) might briefly lock up the page. The 150 ms input debounce keeps typing-while-editing smooth on any practical input size.

Why is grapheme extraction useful for developers?

Cursor positioning, text selection, character counting (Twitter-style “200 characters”), and string truncation all need graphemes – not code units or code points – to behave the way users expect. Splitting a string on code units risks cutting a family emoji in half (broken render) or a surrogate pair in half (corrupt data). The CSV/JSON output of this tool is useful as a reference dataset when writing tests for grapheme-aware code.

Is my data uploaded?

No. The page loads three static files (HTML, CSS, JS) and then runs entirely in your browser. Your text never leaves the device – no fetch, no XHR, no analytics, no cookies. You can disconnect from the internet after the page loads and the tool still works.

Is this tool free?

Yes – free, unlimited, no signup, no watermark. The extracted graphemes are yours to use anywhere. Attribution to is appreciated but not required.

Keep going

Related Tools

All Unicode tools →

Extract Unicode Range

Filter text to characters in one or more hex Unicode ranges - Greek, Cyrillic,…

Center Unicode Text

Center Unicode text within a fixed width, with real grapheme counting for emoji and…

Check Spoofed Unicode Text

Detect Unicode confusables and homoglyphs from Cyrillic, Greek, Armenian, and Hebrew that imitate Latin…

Chunkify Unicode Text

Split Unicode text into equal chunks with grapheme, code-point, or UTF-16 modes. Keeps emoji…

ASCII to Unicode Converter

ASCII to Unicode & Decode decimal, hex, octal, or U+XXXX values to Unicode characters…

Convert Code Points to Unicode

Convert Code Points to Unicode (U+XXXX, hex, decimal) to characters - handles emoji, CJK,…

Convert Unicode to ASCII

Convert Unicode to ASCII with transliteration (é → e, ñ → n), replace, or…

Convert Unicode to Base64

Encode Unicode text to Base64 (and decode) with standard, URL-safe, MIME variants. UTF-8 proper.…

Convert Unicode to Binary

Convert Unicode to binary in 3 modes (UTF-8, codepoint, UTF-16). Per-character breakdown. Free, offline,…

Convert Unicode to Bytes

Convert Unicode to UTF-8 bytes in hex, decimal, or binary. Per-byte grid, reverse direction.…

Convert Unicode to Code Points

Convert Unicode to code points (U+XXXX, HTML/CSS/JS escapes) and back. Per-character breakdown. Free, offline,…

Convert Unicode to Data URL

Convert Unicode to data URLs with base64 or URL-encoding, 12 MIME types, charset toggle.…

Share

Embed this tool

Add this free tool to your website. Copy and paste the code: