Question 1

What's the difference between a character, a code point, and a grapheme?

Accepted Answer

A code point is a single Unicode number like U+1F600 (😀). A grapheme is what a human reads as one character, 👨‍👩‍👧‍👦 is 7 code points joined by zero-width joiners but one grapheme. 'Character' is ambiguous; in Python it usually means code point, in JavaScript it means UTF-16 code unit. For user-facing counts, use extract-unicode-graphemes.

Question 2

Why do two strings that look identical sometimes compare unequal?

Accepted Answer

Different normalization forms. A Mac keyboard may produce é as e + U+0301 combining acute (NFD), while copy-paste from a web page gives the single code point U+00E9 (NFC). Both look the same; neither matches the other in byte comparison. Run both through normalize-unicode-text with NFC and the check passes.

Question 3

How do I detect a Cyrillic or Greek lookalike impersonating a Latin string?

Accepted Answer

Run check-spoofed-unicode-text. It scans every code point and flags any character from a script that doesn't match the surrounding majority, Cyrillic а (U+0430) inside a Latin word, Greek ο (U+03BF) swapped for o. Common targets include Pаypal, Amаzon, and Gооgle in phishing links and typo-squatted usernames.

Question 4

What does the \uXXXX sequence in JSON and Python strings mean?

Accepted Answer

It's a Unicode escape, \u00e9 represents U+00E9 (é). JSON produces these when ensure_ascii=True is set and the string contains non-ASCII; log pipelines keep them to sidestep encoding traps. unicode-escape-decoder reverses them into readable characters, and escape-unicode produces them when a config file must stay pure ASCII.

Question 5

How do emoji break character counters?

Accepted Answer

A single emoji can span multiple code points and multiple UTF-16 units. 😀 is one code point (U+1F600) but two UTF-16 units, JavaScript's '😀'.length returns 2. A family emoji 👨‍👩‍👧‍👦 is 7 code points. A skin-toned 👍🏽 is 2 code points. For tweet limits and SMS segment counts, use count-unicode-characters in grapheme mode.

Question 6

UTF-8 vs. UTF-16 vs. UTF-32, which should I use?

Accepted Answer

UTF-8 uses 1-4 bytes per code point and dominates the web and Linux, over 98% of public web pages served in 2025 used it. UTF-16 is JavaScript's internal string encoding and the native format of Windows APIs. UTF-32 uses a fixed 4 bytes and is mainly an internal representation. Store and transmit in UTF-8 via convert-unicode-to-utf8; reach for convert-unicode-to-utf16 only when debugging a JS surrogate-pair bug.

Question 7

How do I safely put Arabic, Chinese, or emoji in a URL?

Accepted Answer

Percent-encode them with url-encode-unicode. A search query like شاي turns into %D8%B4%D8%A7%D9%8A, the UTF-8 byte sequence expressed as percent-encoded triplets. Modern browsers display the readable form in the address bar but send the encoded version over the wire, so old proxies, CDN rules, and server logs stay happy.

Add Combining Characters →

ASCII to Unicode Converter →

Center Unicode Text →

Check Spoofed Unicode Text →

Chunkify Unicode Text →

Convert Code Points to Unicode →

Convert Emoji to Image →

Convert Unicode to ASCII →

Convert Unicode to Base64 →

Convert Unicode to Binary →

Convert Unicode to Bytes →

Convert Unicode to Code Points →

Convert Unicode to Data URL →

Convert Unicode to Decimal →

Convert Unicode to Hex →

Convert Unicode to HTML →

Convert Unicode to Image →

Convert Unicode to Octal →

Convert Unicode to String Literal →

Convert Unicode to UTF-16 →

Convert Unicode to UTF-32 →

Convert Unicode to UTF-8 →

Count Unicode Characters →

Cyclically Shift Unicode →

Decrement Code Points →

Emoji Picker →

Escape Unicode →

Extract Unicode Graphemes →

Extract Unicode Range →

Reverse Unicode Text, Emoji Safe →

Split Text into Characters →

Truncate Text →

All tools in this category