My recent projects deconstruct LLM components by implementing them myself and explaining the underlying theory:

[Transformer LM] From-scratch implementation featuring RoPE and KV cache, emphasis on code quality and clarity
[RL Course] Teaching RL through rigorous mathematical derivations and implementations from first principles
[BPE Tokeniser] Optimised training implementation (hours → 13s) with detailed technical writeup

I’m also exploring [Mechanistic Interpretability] through toy model experiments: reproducing superposition research and training SAEs to extract learned features.

Oxford MCompPhil • Computer Science + Philosophy of Mind & Ethics
First Class (2024)

Full CV

Rotary Positional Embeddings

When a language model processes text, it must determine how much the context of one token helps explain the meaning of another. This depends on both content (a dog should attend more strongly to puppy than tshirt) and relative position - the word cat appearing in the same sentence has far more semantic influence than the same word, cat, appearing chapters away. Without positional information, Transformers are permutation-invariant and cannot distinguish between these cases.

Read More

From Hours to Seconds: Optimising BPE Tokeniser Training

Training a tokeniser on even modest datasets can be surprisingly slow. When I implemented the standard chunked BPE algorithm (following Andrej Karpathy’s tutorial) and ran it on a 100MB dataset, the estimated training time was several hours. For context, I’m building a toy language model from scratch. The tokeniser is just a small part of the puzzle, and here I was, watching my CPU churn through what seemed like it should be a straightforward algorithm.

Read More