Home Tools Blog About

Counting Characters: Code Points vs Graphemes

Counting characters sounds simple until emoji and accents get involved, because what a person sees as one character can be several code points underneath. A flag emoji can be two code points, and a family emoji can be seven, yet each looks like a single symbol. This guide explains the difference between code points and graphemes, why counts disagree, and free tools to count correctly.

Three ways to count

A string has three different lengths: its bytes, its code points, and its graphemes. Bytes depend on the encoding, code points are the Unicode characters, and graphemes are what a reader perceives as single units. For plain English all three match, but for emoji and accented text they diverge. The byte side is in our UTF-8 to bytes guide, and word-level counting is in our word and character count guide.

Counting code points

A code point count tallies the Unicode characters in a string, which is usually closer to a human count than bytes are. The Unicode character counter reports this number, and it is the right measure when a limit is defined in characters rather than bytes, such as some database fields and message limits.

Counting graphemes

A grapheme is what a person calls one character, even if it is built from several code points joined together. An accented letter written as a base plus a combining mark is two code points but one grapheme. The Unicode grapheme tool splits text into these perceived units, which is the count that matches what a reader would tally by eye.

Why emoji break counts

Emoji are the clearest example of the gap. A country flag is two code points, a skin-tone emoji adds a modifier, and a family emoji joins several people with invisible joiner characters into one grapheme. So a single visible emoji can report as one grapheme, several code points, and many bytes. This is why naive length checks miscount emoji and sometimes cut them in half.

When the difference matters

The right count depends on the job. Storage and buffers care about bytes. A character limit cares about code points or graphemes. A cursor moving through text should step by graphemes so it never lands inside an emoji. Picking the wrong one causes truncated strings, off-by-one limits, and broken emoji, which is why knowing which length you mean is worth the small effort.

Frequently asked questions

What is the difference between a code point and a grapheme?

A code point is one Unicode character, while a grapheme is what a reader perceives as a single character, which can be built from several code points.

How many code points is a flag emoji?

Usually two, a pair of regional indicator symbols, even though it shows as one flag.

Why do character counts disagree?

Because bytes, code points, and graphemes are three different lengths, and they only match for plain text without emoji or accents.

Which count should a character limit use?

Code points or graphemes, not bytes, so multi-byte characters are not unfairly penalized or miscounted.

Why do emoji sometimes get cut in half?

Because a length check based on code units can split a multi-unit emoji, so counting and cutting by graphemes avoids it.

ATV

Written by Nick (ATV Team)

We build and maintain the 600+ free, client-side tools on this site, and every guide is written against the tools themselves: each figure is computed and checked before it is published, and every linked tool is tested in the browser. More about how we work on the about page, and the full library of guides lives on the blog.