My recent projects deconstruct LLM components by implementing them myself and explaining the underlying theory:

[Transformer LM] From-scratch implementation featuring RoPE and KV cache, emphasis on code quality and clarity
[RL Course] Teaching RL through rigorous mathematical derivations and implementations from first principles
[BPE Tokeniser] Optimised training implementation (hours → 13s) with detailed technical writeup

I’m also exploring [Mechanistic Interpretability] through toy model experiments: reproducing superposition research and training SAEs to extract learned features.

Oxford MCompPhil • Computer Science + Philosophy of Mind & Ethics
First Class (2024)

Full CV

Wild Chameleons: If Neural Chameleons emerged naturally, how would they work?

A central worry for the AI safety mechanism of monitoring is that a sufficiently intelligent model could strategically evade a monitor when scheming, rendering it useless. This is highly plausible for Chain of Thought (CoT) monitors. Smart models could encode their true reasoning in plain sight using steganography, or sufficiently large models could reason within their forward pass while presenting an unfaithful CoT to mislead the monitor. A recent paper, Neural Chameleons (McGuinness et al., 2025), has shown that it’s also possible for a model to evade model-internals-based monitors.

In this post, I’ll first get you up to speed on the Neural Chameleons paper, then explain why its use of a known trigger represents a best-case scenario for defenders. I introduce a taxonomy of chameleon types (manual, instinctive, and rational) distinguished by their cognitive mechanisms, and explore through a thought experiment how the latter two might arise in the wild. I close by discussing avenues for defence.

Read More

Direct Logit Attribution

Transformer language models generate text by producing a probability distribution over the next token. This distribution is derived from a vector of logits (one value per token in the vocabulary) where higher values indicate tokens the model considers more likely. But how does the model arrive at these logits? Which internal components are responsible for promoting or suppressing particular tokens?

Direct Logit Attribution (DLA) is an interpretability method which allows us to answer these questions by decomposing the output logits into a sum of contributions from each component of a transformer (attention layers and MLPs). For a given prompt, we can determine exactly how much each component directly contributes to the logit of any token – revealing, for instance, that a specific MLP directly promotes “Paris” as the next token after “The Eiffel Tower is in the city of”.

Read More

DeepSeek Sparse Attention

The vanilla attention algorithm1 scales quadratically in context length due to the pairwise Key/Query dot products. This makes long contexts – a requirement for complex reasoning and long agentic tasks – computationally expensive. However, each token typically attends to only a small fraction of prior tokens. Computing accurate attention scores for these relevant positions requires high numerical precision (BF16) and separate computations per head. But identifying which positions are irrelevant can use a much cheaper approximation: lower precision (FP8) and a single shared score across all heads. DeepSeek’s new sparse attention mechanism, called DeepSeek Sparse Attention (DSA), introduces a lightweight Lightning Indexer that efficiently identifies the most relevant tokens for each query. The main MLA attention then computes precise attention scores over only this selected subset, dramatically reducing computational cost whilst maintaining model quality.

  1. Vaswani, et al. (2017). Attention is all you need arXiv:1706.03762 

Read More

Deep RL: Baselines, Actor-Critic and GAE

A central challenge in deep RL is the high variance in gradient estimates, leading to unstable training and poor sample efficiency. This blog post explores how baselines, actor-critic methods, and Generalised Advantage Estimation (GAE) tackle this problem.

A surprising result underpins these methods: we can subtract arbitrary baselines from returns without biasing gradient estimates – yet this modification may dramatically reduce variance. We’ll progress from simple constant baselines through to state-dependent baselines (actor-critic), culminating in GAE, which allows precise control of the bias-variance trade-off. Along the way, we’ll examine the effects and guarantees of each method with respect to bias and variance.

Read More

Rotary Positional Embeddings

When a language model processes text, it must determine how much the context of one token helps explain the meaning of another. This depends on both content (a dog should attend more strongly to puppy than tshirt) and relative position - the word cat appearing in the same sentence has far more semantic influence than the same word, cat, appearing chapters away. Without positional information, Transformers are permutation-invariant and cannot distinguish between these cases.

Rotary Position Embeddings (RoPE) elegantly solve this by rotating query and key vectors based on their positions before computing attention scores1. Rather than adding positional embeddings to tokens, RoPE bakes position-awareness directly into the query-key interactions. This approach naturally captures relative positions, extends to arbitrary sequence lengths, and allows different attention heads to learn different position-sensitivity patterns.

This post examines RoPE from first principles. We’ll start with a review of traditional positional embeddings and their limitations, then derive RoPE’s mathematical formulation from 2D rotations to the general case. We’ll explore why it works so well in practice, and finish with implementation details and real-world performance considerations.

  1. Su, et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864

Read More

From Hours to Seconds: Optimising BPE Tokeniser Training

Training a tokeniser on even modest datasets can be surprisingly slow. When I implemented the standard chunked BPE algorithm (following Andrej Karpathy’s tutorial) and ran it on a 100MB dataset, the estimated training time was several hours. For context, I’m building a toy language model from scratch. The tokeniser is just a small part of the puzzle, and here I was, watching my CPU churn through what seemed like it should be a straightforward algorithm.

This article details the six optimisations to the vanilla BPE algorithm I implemented - how and why they work as well as their performance gains. When combined they bring that training time down to just 13.3s! The full implementations are available on GitHub so you can follow along with the optimisations as we go.

Read More