π§ BPE Tokenizer (GPT-2 Style)
A from-scratch implementation of the Byte Pair Encoding (BPE) algorithm to demystify how Large Language Models (LLMs) process text.
β οΈ Note for Learning: This project was built for educational purposes to reverse-engineer and understand the BPE algorithm. While functional, it is a Python-based implementation optimized for clarity and learning rather than maximum training speed on massive corpora.
π The Story
About 8 months ago, fascinated by the release of powerful LLMs like ChatGPT, I decided that simply using them wasn't enoughβI wanted to understand them. I went back to the basics: The Tokenizer.
I built this project to reverse-engineer the tokenization process used by models like GPT-2. Instead of relying on pre-built libraries like HuggingFace, I wrote the logic myself in pure Python to fully grasp the mechanics of vocabulary construction, merge rules, and text compression.
π Key Features
- Custom Training: Capable of training a unique vocabulary on any raw text dataset (e.g., Shakespeare).
- Persistence: Saves trained models (vocabulary & merge rules) to disk and "lazy loads" them when needed.
- GPT-2 Pre-tokenization: Implements the specific whitespace handling (converting spaces to
Δ) used by GPT-2 to preserve sentence structure. - Object-Oriented Design: Clean, modular class structure suitable for integration into larger projects.
- Efficient Encoding: Optimized to load merge rules into memory once, preventing redundant I/O operations.
π οΈ Technical Implementation
This project implements Byte Pair Encoding (BPE), an iterative algorithm that compresses text by replacing the most frequent pair of adjacent bytes with a new, single token.
How it works:
- Refinement: Raw text is converted into a list of characters, with spaces preserved as special tokens.
- Frequency Analysis: The algorithm scans the corpus to count all adjacent character pairs.
- Merge & Update: The most frequent pair (e.g.,
('t', 'h')) is merged into a new token ('th'). - Iteration: This process repeats until the desired vocabulary size is reached.
π Dataset
This tokenizer was trained on the complete works of William Shakespeare.
To reproduce the training results:
- Download the raw text file from Project Gutenberg.
- Save it as
the_complete_work.txtin the project root. - Run the training script.
(Note: The dataset itself is not included in this repository to keep it lightweight.)
π» Usage
from tokenizer import Tokenizer
# 1. Initialize
tokenizer = Tokenizer()
# 2. Train on your own dataset (only needs to be done once)
# tokenizer.train("the_complete_work.txt", ntokens=4000)
# 3. Encode text (converts string -> list of integer IDs)
text = "I think the noble Duke is good"
ids = tokenizer.encode(text)
print(f"Token IDs: {ids}")
# Output: [7, 1062, 1959, 315, 1494, 2218]
# 4. Decode (converts IDs -> original string)
decoded_text = tokenizer.decode(ids)
print(f"Decoded: {decoded_text}")
π§ What I Learned
Building this project was a deep dive into Natural Language Processing (NLP) and software engineering fundamentals:
- Algorithm Design: I learned how to translate a theoretical paper (BPE) into working code.
- Data Structures: Gained a deeper appreciation for
dictionariesandsetsfor O(1) lookups during the frequent pair counting process. - Optimization: My initial version was slow on large datasets. I learned to identify bottlenecks (like file I/O inside loops) and refactored the code to load models into RAM, drastically improving performance.
- Pythonic Best Practices: Evolved the project from a simple script into a robust Class with Type Hinting, Docstrings, and proper encapsulation.
π References & Research
This implementation is based on the methods described in the foundational papers that adapted BPE (originally a compression algorithm) for Neural Machine Translation and LLMs.
Primary Reference:
"Neural Machine Translation of Rare Words with Subword Units" > Rico Sennrich, Barry Haddow, Alexandra Birch (2016)
"We introduce a simpler and more effective approach... making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units."
Original Algorithm:
"A New Algorithm for Data Compression" > Philip Gage (1994)
Created by [Ahmed Sadik]