Logan Thomson – LLMs from Scratch, Component by Component

Deep RL: Baselines, Actor-Critic and GAE

A central challenge in deep RL is the high variance in gradient estimates, leading to unstable training and poor sample efficiency. This blog post explores how baselines, actor-critic methods, and Generalised Advantage Estimation (GAE) tackle this problem.

A surprising result underpins these methods: we can subtract arbitrary baselines from returns without biasing gradient estimates—yet this modification may dramatically reduce variance. We’ll progress from simple constant baselines through to state-dependent baselines (actor-critic), culminating in GAE, which allows precise control of the bias-variance trade-off. Along the way, we’ll examine the effects and guarantees of each method with respect to bias and variance.

Rotary Positional Embeddings

When a language model processes text, it must determine how much the context of one token helps explain the meaning of another. This depends on both content (a dog should attend more strongly to puppy than tshirt) and relative position - the word cat appearing in the same sentence has far more semantic influence than the same word, cat, appearing chapters away. Without positional information, Transformers are permutation-invariant and cannot distinguish between these cases.

Rotary Position Embeddings (RoPE) elegantly solve this by rotating query and key vectors based on their positions before computing attention scores¹. Rather than adding positional embeddings to tokens, RoPE bakes position-awareness directly into the query-key interactions. This approach naturally captures relative positions, extends to arbitrary sequence lengths, and allows different attention heads to learn different position-sensitivity patterns.

This post examines RoPE from first principles. We’ll start with a review of traditional positional embeddings and their limitations, then derive RoPE’s mathematical formulation from 2D rotations to the general case. We’ll explore why it works so well in practice, and finish with implementation details and real-world performance considerations.

Su, et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864. ↩

From Hours to Seconds: Optimising BPE Tokeniser Training

Training a tokeniser on even modest datasets can be surprisingly slow. When I implemented the standard chunked BPE algorithm (following Andrej Karpathy’s tutorial) and ran it on a 100MB dataset, the estimated training time was several hours. For context, I’m building a toy language model from scratch. The tokeniser is just a small part of the puzzle, and here I was, watching my CPU churn through what seemed like it should be a straightforward algorithm.

This article details the six optimisations to the vanilla BPE algorithm I implemented - how and why they work as well as their performance gains. When combined they bring that training time down to just 13.3s! The full implementations are available on GitHub so you can follow along with the optimisations as we go.

[Transformer LM]	From-scratch implementation featuring RoPE and KV cache, emphasis on code quality and clarity
[RL Course]	Teaching RL through rigorous mathematical derivations and implementations from first principles
[BPE Tokeniser]	Optimised training implementation (hours → 13s) with detailed technical writeup