Home Tools Blog About

UTF-8 to Bytes and Code Points Explained

UTF-8 stores each character as one to four bytes, and converting UTF-8 to bytes or code points shows exactly how a string is encoded under the hood. The letter A is a single byte, 65, while an emoji is four bytes, and its code point is a number like U+1F600. This guide explains the difference between bytes and code points, how to convert each way, and free tools for the job.

Bytes versus code points

A code point is the number Unicode assigns to a character, written like U+0041 for A or U+1F600 for a smiley. A byte is a unit of storage, 0 to 255. UTF-8 is the rule that turns code points into bytes: small code points become one byte, larger ones become two, three, or four. So one character can be one code point but several bytes. Our text encoding guide covers how UTF-8, UTF-16, and UTF-32 differ.

UTF-8 to bytes

Converting UTF-8 to bytes shows the actual stored sequence. The character A is one byte, 65. An accented e is two bytes. An emoji is four. The UTF-8 to bytes converter lists every byte for a string, which is exactly what you need when a file size, a buffer, or a network frame is measured in bytes rather than characters.

UTF-8 to code points

Code points identify the characters regardless of how they are stored. The UTF-8 to code points converter gives the U+ number for each character, which is the right view when you care about which characters a string contains rather than its byte length. This distinction is why counting characters and counting bytes can give different answers for the same text.

Bytes back to text

Going the other way, a sequence of bytes is decoded back into characters using the same UTF-8 rules. The bytes to UTF-8 converter reassembles the multi-byte sequences into readable text, which is how a program turns a raw buffer back into a string. If a byte sequence is invalid UTF-8, the decode fails, which is a common source of garbled text.

Why this matters

The byte versus character gap causes real bugs. A database column sized in characters can overflow on multi-byte input, a substring cut at a byte boundary can split a character, and a length check can reject valid text. Seeing the bytes and code points of a string makes these issues obvious, and it is essential when debugging encoding problems or working with binary protocols that carry text.

Free converters used in this guide

Frequently asked questions

What is the difference between a byte and a code point?

A code point is the Unicode number for a character, while a byte is a unit of storage. UTF-8 encodes one code point as one to four bytes.

How many bytes is an emoji in UTF-8?

Most emoji are four bytes in UTF-8, even though they are a single code point and a single character on screen.

Why do character count and byte count differ?

Because multi-byte characters take several bytes each, so a string with accents or emoji has more bytes than characters.

What is a code point written as U+1F600?

It is the Unicode number for a character in hexadecimal, here a grinning face emoji, independent of how many bytes store it.

What happens with invalid UTF-8 bytes?

The decode fails or produces replacement characters, which is a common cause of garbled or broken text.

ATV

Written by Nick (ATV Team)

We build and maintain the 600+ free, client-side tools on this site, and every guide is written against the tools themselves: each figure is computed and checked before it is published, and every linked tool is tested in the browser. More about how we work on the about page, and the full library of guides lives on the blog.