Tokenization
Tokenization
LLMs don't read characters or words — they read tokens. Understanding tokenization explains many model quirks.
A tokenizer converts raw text into a sequence of integers (token IDs). The vocabulary is built using Byte-Pair Encoding (BPE) or similar algorithms — frequent subword sequences become single tokens, rare ones split into smaller pieces.
Token Visualizer
The boundaries below indicate how text is split into tokens. Common words become single tokens; rare or technical terms split into multiple.
The | quick | brown | fox | un | expect | edly | ran | through | the | JSON | ify | ();
Legend:
- 🟣 High-frequency tokens
- 🟢 Medium-frequency tokens
- ⚪ Low-frequency / split tokens
Why Tokens Matter for Prompting
- 💰 Pricing: APIs charge per token (input + output). A 1000-word document ≈ 1300–1500 tokens.
- 📏 Context limits:
- Claude 3.5 Sonnet: 200K tokens
- GPT-4o: 128K tokens
- Llama 3 70B: 8K–128K tokens
- 🧮 Counting quirks: Spaces, punctuation, and capitalization all affect tokenization.
GPT-4omay be 3 tokens;gpt4omay be 2. - 🌍 Non-English text: Often tokenizes less efficiently, requiring more tokens for the same meaning compared to English.
Token Counting with tiktoken
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Hello, how many tokens am I using?"
tokens = enc.encode(text)
print(f"{len(tokens)} tokens: {tokens}")
# Output: 9 tokens: [9906, 11, 1268, 1690, 11460, 939, 358, 1701, 30]