Character encodings and UTF-8¶

A "string" in your favorite language is a sequence of something. The something is the question this page answers. Across forty years of computing the answer has been ASCII, then dozens of incompatible 8-bit code pages, then UCS-2, then UTF-16, then finally - everywhere except a few legacy holdouts - UTF-8. Knowing how UTF-8 is laid out at the byte level explains a long list of bugs that look unrelated: why len(s) and "number of characters" disagree, why slicing a string mid-character corrupts it, why filenames break across operating systems, why emoji are sometimes one "character" and sometimes two.

This page walks the arc: what a character actually is, how the encodings represent them in bytes, where each one shows up in real systems, and the production gotchas that come from mixing them.

1. What a "character" actually is¶

Three distinct concepts are usually called "character," and conflating them is the source of most string bugs:

Code point. A number assigned to an abstract character by the Unicode standard. A is code point U+0041. 🚀 is code point U+1F680. é is code point U+00E9 (but it can also be represented as e followed by a combining acute accent U+0301 - two code points that render as the same visual). Unicode has 1,114,112 code points reserved, of which about 150,000 are assigned.
Code unit. The unit of storage an encoding uses. UTF-8 has 8-bit code units; UTF-16 has 16-bit code units; UTF-32 has 32-bit code units. One code point may take one or more code units.
Grapheme cluster. What a human would call "one character" on screen - including base letter plus all combining marks. é is one grapheme cluster even when written as two code points. 🇺🇸 is one grapheme cluster even though it is two code points (U+1F1FA + U+1F1F8).

len("café") in Go returns the number of bytes (UTF-8 code units), which is 5 (c, a, f are one byte each, é is two). len("café") in Python 3 returns the number of code points, which is 4. Neither is the number of grapheme clusters (also 4 here, but if you stick a combining accent on e you get 4 code points / 5 in UTF-8 / still 4 grapheme clusters). When someone says "how long is this string," ask: bytes, code points, or graphemes?

2. ASCII: the 7-bit ancestor¶

ASCII (1963) assigns 128 numbers (0-127) to characters: 33 control codes, plus letters, digits, punctuation. Every printable English keystroke and the most common control characters fit in 7 bits.

  0x41  A      0x61  a       0x30  0
  0x42  B      0x62  b       0x31  1
  ...          ...           ...
  0x5A  Z      0x7A  z       0x39  9
  0x20  space  0x0A  newline 0x09  tab

ASCII fits in one byte with the top bit unused. For decades that top bit was used by something, and what it was used for depended on the system: Latin-1, Windows-1252, MacRoman, KOI8-R for Russian, Big5 for Chinese, Shift-JIS for Japanese. Each was a different "code page" assigning meaning to bytes 128-255. Move a file between systems and the byte 0xE9 could be é, é (different glyph!), nothing, or garbage.

Two things made the mess: ASCII is too small (no accented Latin, let alone CJK, Arabic, Hebrew, Cyrillic, Indic scripts), and the 8th bit got fragmented across dozens of incompatible standards. Unicode was the response.

3. Unicode and the encoding split¶

Unicode (1991) does one thing: assign every character a unique code point number, regardless of language. A is U+0041 (same number ASCII uses, by design). Я (Cyrillic capital Ya) is U+042F. 中 is U+4E2D. 🚀 is U+1F680.

That settles "what number?" but leaves "how do you store the numbers in bytes?" unanswered. Three real encodings emerged:

UTF-32: every code point is a 4-byte integer. Simple, fixed-width, wasteful (most text is mostly ASCII, where 3 of every 4 bytes are zero).
UTF-16: code points up to U+FFFF take one 2-byte unit; everything else takes two 2-byte units via a "surrogate pair." Internal format of Java strings, JavaScript strings, and Windows APIs. Has endianness (see endianness) so files need a BOM.
UTF-8: variable-length, 1-4 bytes per code point, ASCII is unchanged, no endianness. The web's encoding. The Linux filesystem's encoding. The encoding you should default to for almost everything new.

The internet decided in the 2000s: UTF-8 won. Today over 98% of web pages are UTF-8. Go, Rust, Swift, modern C++ all default to UTF-8. Windows still uses UTF-16 for its native APIs and Java's char is still a 16-bit code unit, but that is increasingly the legacy side.

4. UTF-8: the byte layout¶

UTF-8 is brilliant because of the way it tells you "how many bytes is this character?" from the bits of the first byte. The rules:

Code point range	Bytes	Layout
U+0000 - U+007F	1	`0xxxxxxx`
U+0080 - U+07FF	2	`110xxxxx 10xxxxxx`
U+0800 - U+FFFF	3	`1110xxxx 10xxxxxx 10xxxxxx`
U+10000 - U+10FFFF	4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`

The leading bits of byte 1 encode "how many bytes total." The 10 prefix on every continuation byte means "I am a continuation, not the start of a new character." This is the magic property: you can resynchronize from any byte in a UTF-8 stream. Scan forward until you see a byte whose top bit pattern is not 10; you are at a character boundary. No other encoding has this property.

4.1 Worked example: `A`¶

A is U+0041. Range 0-127, so 1 byte. The byte is 0x41 = 0100_0001. Same as ASCII. Every ASCII file is already a valid UTF-8 file - this is why UTF-8 was such a clean transition.

4.2 Worked example: `é`¶

é is U+00E9. Range 128-2047, so 2 bytes:

  U+00E9 in binary:        000 1110 1001
  pad to 11 bits:          000 1110 1001    (the x positions in the layout)
  fit into 110xxxxx 10xxxxxx:
      110|00011 10|101001
      = 11000011 10101001
      = 0xC3 0xA9

So é in UTF-8 is the byte pair C3 A9. If you open a UTF-8 file in a viewer that thinks it is Latin-1, you will see two characters: Ã©. That is the smoking gun of a UTF-8 file mis-decoded as Latin-1, and you have seen it before in emails from misconfigured servers.

4.3 Worked example: `🚀`¶

🚀 is U+1F680. Range 65536-1114111, so 4 bytes:

  U+1F680 in binary:       0 0001 1111 0110 1000 0000
  pad to 21 bits:          000 011111 011010 000000
  fit into 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx:
      11110|000 10|011111 10|011010 10|000000
      = F0 9F 9A 80

A single rocket emoji is four bytes in UTF-8. This is why len("🚀") is 4 in Go (bytes) but 1 in Python 3 (code points) and "depends" in JavaScript (where the string is internally UTF-16 and a rocket is one code point but two UTF-16 code units, so "🚀".length === 2).

4.4 Decoding in code¶

package main

import "fmt"

func main() {
    s := "café 🚀"
    fmt.Println("bytes:", len(s))                    // 11
    fmt.Println("runes:", len([]rune(s)))            // 6 (code points: c a f é space 🚀)

    for i, r := range s {
        fmt.Printf("  byte offset %d -> rune %U (%q)\n", i, r, string(r))
    }
}

for i, r := range s in Go decodes UTF-8 on the fly: i is a byte offset, r is a code point, and r's width in bytes is utf8.RuneLen(r) (or just read the next i to see the next character's start). This is one of Go's standout language features - it makes UTF-8 the path of least resistance, not the path of most pain.

5. UTF-16, surrogates, and the BMP¶

UTF-16 is what Java's String, JavaScript's strings, and the Windows API use. It encodes:

Code points U+0000 - U+FFFF (the Basic Multilingual Plane or BMP) as a single 16-bit code unit.
Code points U+10000 - U+10FFFF (the supplementary planes - emoji, less common CJK, ancient scripts) as a surrogate pair: a high surrogate (U+D800 - U+DBFF) followed by a low surrogate (U+DC00 - U+DFFF).

The code point range U+D800 - U+DFFF is permanently reserved - no character will ever be assigned there. That reservation is what makes the surrogate trick unambiguous: a code unit in that range cannot be a real character, so it must be half of a pair.

The cost of UTF-16's "mostly fixed-width" design is that string.length in JavaScript and String.length() in Java return code units, not code points or characters. "🚀".length is 2 in JavaScript because the rocket is a surrogate pair. Index s.charAt(0) and you get half a rocket - an unpaired surrogate, which is not a valid character on its own. This is why JavaScript got String.prototype.codePointAt and the for (const ch of s) iterator: the old API operates on the wrong unit, and they could not break it.

6. Where each encoding shows up¶

System	Encoding
Web (HTML, JSON over HTTP)	UTF-8 (mandated for JSON)
Linux filenames	UTF-8 (by convention)
macOS filenames	UTF-8 (NFD-normalized)
Windows filenames	UTF-16
Go `string`	UTF-8 (bytes, decoded with `range`)
Rust `String`	UTF-8 (validated)
Java `String`, JavaScript string	UTF-16 (code units)
Python 3 `str`	abstract code points (internal repr varies)
Python 2 `str`	bytes (you had to know the encoding yourself; a major reason Py3 happened)
SQLite, PostgreSQL	UTF-8
MySQL `utf8`	not actually full UTF-8! Only the BMP. Use `utf8mb4` for real UTF-8.

The MySQL trap is famous: the historic utf8 charset in MySQL only stores 1-3 byte UTF-8, which misses every emoji and many CJK characters. Inserting 🚀 into a utf8 column silently truncates or errors depending on mode. The fix is to use utf8mb4 everywhere; the gotcha is that older schemas still say utf8. If you see "characters get cut off after a name" or "emoji disappear from comments" in a MySQL-backed product, this is your first suspect.

7. Production bugs the encoding model creates¶

7.1 Slicing strings by byte¶

s := "café"
fmt.Println(s[:3])   // "caf"
fmt.Println(s[:4])   // "caf\xc3"  ← BROKEN! cut mid-character

Slicing UTF-8 mid-character produces an invalid byte sequence. Any function that "shortens a string" by byte offset (truncating to fit a UI, splitting at a fixed width, breaking long messages) is a latent bug. Use utf8.DecodeRuneInString to step character-by-character, or []rune(s) if you want to slice by code point.

7.2 Normalization¶

é can be U+00E9 (one code point) or e + U+0301 (combining acute, two code points). They render identically. == comparison sees them as different strings, sorts them differently, hashes them differently. Two visually identical filenames can fail to match because one was created on macOS (NFD - decomposed) and one was typed into a Linux terminal (NFC - composed).

Production code that compares user-supplied text for equality must normalize first. Go: golang.org/x/text/unicode/norm. Java: Normalizer.normalize(s, Form.NFC). The standard is to normalize to NFC (composed) before any comparison or storage.

7.3 Grapheme clusters and "length"¶

"👨‍👩‍👧‍👦"  (family emoji)

How many "characters" is this? One to a human - it is rendered as a single glyph. It is seven code points (four people emoji and three U+200D zero-width joiners). In UTF-8 it is 25 bytes. In UTF-16 it is 11 code units. There is no single "right" answer; there are different answers for different questions.

If you need "what would the user call one character?", you want a grapheme-cluster iterator: Go's golang.org/x/text/internal/triegen (or rivo/uniseg in the wider ecosystem), Swift's Character type, JavaScript's Intl.Segmenter. Reaching for s.length here will give you the wrong answer in seven different ways.

7.4 The Byte Order Mark (BOM)¶

UTF-8 has an optional BOM (the bytes EF BB BF) at the start of a file to declare "this is UTF-8." Most tools tolerate it; some break on it. PowerShell writes BOMs by default; Linux shell tools usually do not. A #!/bin/bash line preceded by a BOM is not recognized as a shebang and the script fails with "command not found" pointing at the BOM as if it were part of the path. UTF-16 BOMs (FE FF or FF FE) declare endianness - they are necessary because UTF-16 has byte-order ambiguity that UTF-8 (single-byte units) does not.

Rule of thumb for new files: do not write a UTF-8 BOM unless something downstream demands it.

7.5 Mojibake¶

When text encoded in encoding A is decoded as encoding B, you get mojibake - garbage that looks like it might be text. Patterns:

Â£ instead of £: same problem with the pound sign.
? or □ everywhere: characters that exist in the source but cannot be represented in the destination's encoding got replaced with a placeholder.

The fix is always find the place where bytes were decoded with the wrong encoding and fix it there. Re-encoding mojibake is fragile and usually breaks something else.

8. Advanced: where this gets weird¶

8.1 Locale-aware case folding¶

"İ".lower() in Turkish locale produces i̇ (lowercase i with a dot above), not i. In English locale it produces i. Case folding is locale-dependent because Turkish distinguishes dotted and dotless i. Any case-insensitive comparison that does not specify a locale is wrong somewhere; "lowercase the username" is a security bug if the username is ever displayed in a different locale than it was stored in.

PostgreSQL's LOWER() is locale-aware by default. Go's strings.ToLower is not - it does Unicode default case folding, which is the "no locale" answer. Both are useful; both will surprise you if you assume the other.

8.2 The Han unification controversy¶

Unicode unified visually similar CJK characters into single code points to save space. The unification was controversial: a glyph drawn in the Japanese style and one drawn in the Chinese style sometimes look meaningfully different to native readers, and Unicode collapsed them. The workaround is "variation selectors" (U+FE00-U+FE0F) and lang= tags in HTML to indicate which language a string is in - they affect rendering, not the code points. If you have ever built a multi-language website and seen Japanese readers complain that the kanji "look Chinese," this is why.

8.3 Bidirectional text¶

Arabic and Hebrew flow right-to-left. Mixed left-to-right and right-to-left text creates ambiguity that Unicode resolves with the Bidirectional Algorithm (UAX #9). The algorithm runs over a string and produces a "this character at this position renders here" mapping. It is not optional - any text-rendering library must implement it - and it has security implications: malicious right-to-left override characters can make a filename look like one extension and execute as another. CVE-2021-42574 ("Trojan Source") was exactly this attack vector applied to source code.

8.4 Width and East Asian width¶

CJK characters in fixed-width terminals occupy two cells; ASCII characters occupy one. The Unicode standard tags every character with its "East Asian width" property (F, H, W, Na, A, N). Terminal applications that pad columns need to know this; web pages that use text-align do not. Go's golang.org/x/text/width package or mattn/go-runewidth are the standard implementations.

8.5 Emoji ZWJ sequences¶

The 👨‍👩‍👧‍👦 family emoji from §7.3 is built by joining four person emoji with U+200D ZERO WIDTH JOINER. The platform's emoji renderer recognizes the specific sequence and shows the family glyph; on a renderer that does not recognize it, you see four separate person emoji. New emoji land as new sequences before they get unified code points, so support varies dramatically by platform and time.

8.6 Why Python 3 broke compatibility for this¶

Python 2's str was bytes; you got encoding errors at random points in your program depending on what data you handled. Python 3's str is "code points" (the abstract Unicode model), and you must explicitly encode to bytes when crossing IO boundaries. This is a strictly better model and the most disruptive Python language change ever. Every other modern language has converged on roughly the same answer: strings are Unicode internally; encoding happens at the edge.

9. The mental model to keep¶

Code point is a number (an integer ID from Unicode).
Code unit is a storage unit of an encoding (1 byte in UTF-8, 2 bytes in UTF-16, 4 bytes in UTF-32).
Grapheme cluster is what a human calls "one character" on screen.
UTF-8 is the right default for new systems: ASCII-compatible, no endianness, resynchronizable, dominant on the web and Linux.
len(s) answers a different question depending on the language: bytes (Go), code points (Python 3), code units (Java, JavaScript). Always ask which.
Slice strings by byte boundaries only after you know you are at a character boundary. Otherwise iterate by code point or by grapheme.
Normalize text (NFC) before comparison or storage. é and é may not be ==.
utf8mb4, not utf8, in MySQL. Always.

The day character encodings stop tripping you up is the day you instinctively ask "in what encoding?" when someone says "the string is corrupted."

10. Further reading¶

The Unicode standard, chapters 2 and 3 - the model section is dense but explains the code point / code unit / grapheme distinction with care.
Joel Spolsky, "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" (2003) - still the best beginner essay.
ICU (International Components for Unicode) - the reference implementation of pretty much every Unicode algorithm.
Go's golang.org/x/text subpackages - the standard library does the basics; this is where the deep work lives.
Endianness - prerequisite for the UTF-16 BOM discussion.