Skip to content

Agentic AI Course Notes

Tokenization

Agentic AI Course Notes

Home
Module 01 - LLM and Prompt Engineering Foundations
Module 01 - LLM and Prompt Engineering Foundations
- Foundation of LLM
  Foundation of LLM
  - Intro to LLM
  - Tokenization Tokenization
    Table of contents
    
    Tokenization
    
    Token Visualizer
    
    Why Tokens Matter for Prompting
    
    Token Counting with tiktoken
  - Architecture
  - LLM Types & Taxonomy
- Prompt Engineering
  Prompt Engineering
- Parameters
  Parameters
- Projects
  Projects

Tokenization

Tokenization

LLMs don't read characters or words — they read tokens. Understanding tokenization explains many model quirks.

A tokenizer converts raw text into a sequence of integers (token IDs). The vocabulary is built using Byte-Pair Encoding (BPE) or similar algorithms — frequent subword sequences become single tokens, rare ones split into smaller pieces.

Token Visualizer

The boundaries below indicate how text is split into tokens. Common words become single tokens; rare or technical terms split into multiple.

The | quick | brown | fox | un | expect | edly | ran | through | the | JSON | ify | ();

Legend: - 🟣 High-frequency tokens
- 🟢 Medium-frequency tokens
- ⚪ Low-frequency / split tokens

Why Tokens Matter for Prompting

💰 Pricing: APIs charge per token (input + output). A 1000-word document ≈ 1300–1500 tokens.
📏 Context limits:
Claude 3.5 Sonnet: 200K tokens
GPT-4o: 128K tokens
Llama 3 70B: 8K–128K tokens
🧮 Counting quirks: Spaces, punctuation, and capitalization all affect tokenization. GPT-4o may be 3 tokens; gpt4o may be 2.
🌍 Non-English text: Often tokenizes less efficiently, requiring more tokens for the same meaning compared to English.

Token Counting with `tiktoken`

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
text = "Hello, how many tokens am I using?"
tokens = enc.encode(text)

print(f"{len(tokens)} tokens: {tokens}")
# Output: 9 tokens: [9906, 11, 1268, 1690, 11460, 939, 358, 1701, 30]