How a base language model learns to reason with reinforcement learning — the actual mechanics behind nano-aha-moment, descending from the one-sentence intuition to the gradient that does the work.
Take a raw base model. Ask it to solve a puzzle. If the answer is correct, nudge it to do more of whatever it just did. Repeat a few thousand times.
That's the whole shape of it. The twist that made DeepSeek's R1-zero famous: the reward isn't a human rating or a learned "reward model" — it's a verifier, a little program that checks the answer. This is called RLVR: reinforcement learning with verifiable rewards.
Train this way long enough and something strange emerges on its own — the model starts to backtrack, second-guess, and re-derive. Nobody taught it to. That spontaneous "wait, let me reconsider" is the aha moment the repo is named for.
The task here is Countdown: given numbers like [19, 36, 55, 7] and a target like 65, write an equation using each number once that hits the target. Perfect for RLVR because correctness is trivially checkable — just evaluate the equation.
To use RL, we relabel text generation in RL's vocabulary. Every token the model emits is an action; the text so far is the state; the model itself is the policy. One full answer is one episode, and the reward only arrives at the very end.
Notice what's missing from a normal RL setup: there's no game world reacting to each move. The "environment" is just a scoring function that runs once the model stops talking. That sparsity is exactly what the next few layers are built to cope with.
A pure "1 if correct, else 0" reward is hopeless early on — a fresh model basically never solves Countdown, so it gets no signal to climb. So the reward is split into two added parts, giving a score from 0 to 2.
Format reward teaches the model to wrap its thinking in <think> tags and its answer in <answer> tags — the partial 0.5 is a stepping stone for "right structure, slightly wrong contents." Equation reward only fires when the math actually evaluates to the target and uses each number exactly once. This is reward shaping: an easy gradient to grab first, a hard one to chase later.
Everything lives inside one repeating loop. The key move — and the reason the next layer matters — happens in step 4.
Generation is the slow part, so it runs on vLLM; the gradient update runs on DeepSpeed; after each update the new weights are pushed back into the generator so the next batch is sampled from the improved policy. Crucially, for each prompt the model produces several answers, not one — that group is about to become the secret to the whole method.
Classic PPO trains a second network — a critic — just to estimate "how good is an average answer here?" so it can judge each answer relative to that bar. It's expensive and finicky. GRPO's insight: you already generated 4 answers to the same prompt. Their average reward is the bar. No critic needed.
The advantage of each answer is just how far above or below the group average it landed, divided by the group's spread. Drag the rewards and watch:
# the entire GRPO advantage, verbatim from the repo advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)
Positive advantage → "do more of this." Negative → "do less." An answer exactly at the group average gets zero advantage — no information, no update. And every token in a given answer inherits that one number: GRPO does no per-token bookkeeping. Simple, and it works.
The policy gradient loss is -log_prob × advantage. That minus sign and the sign of the advantage do all the steering:
When the advantage is positive, minimizing -log_prob × advantage means maximizing the log-probability of those exact tokens — the model becomes more likely to write them. Flip the advantage negative and the gradient flips too, suppressing them. That's it. That's REINFORCE, the oldest idea in policy-gradient RL, wearing a transformer.
This minimal version takes one gradient step per fresh batch, so it skips PPO's importance-ratio clipping. It's deliberately closer to plain policy gradient than to full PPO — fewer moving parts to understand.
Everything above — sample, score, normalize, update — run once on two Countdown prompts with N = 2 prompts and G = 4 responses each. Step through it; the advantages are computed live with the exact same formula as the calculator in 05.
Same arithmetic as the GRPO calculator, frozen onto two real prompts and carried all the way to the weight update. Notice the positive advantages in Group B are modest (+0.58): when three of four answers are already right, there's little left to teach. The fat gradients live where the spread is widest — Group A's ±1.41.
Left alone, a policy will happily reward-hack — collapsing into degenerate text that scores well but reads like noise. So a KL penalty tethers the trained policy to a frozen copy of the original model (the reference). Drag the coefficient:
# the low-variance "k3" KL estimator, between policy and reference ref_logratio = ref_logps - logps kl_penalty = torch.exp(ref_logratio) - 1 - ref_logratio loss = (policy_loss + KL_COEFFICIENT * kl_penalty).sum() / total_response_len
The repo's default is a feather-light 0.001 — just enough to keep the model fluent and on-distribution while it explores. The reference model never trains; it sits frozen (parked on CPU to save memory) purely as the "don't forget who you were" anchor.
GRPO's blunt instrument gives every token in an answer the same advantage. But a 300-token chain of reasoning surely has good moves and bad moves. VinePPO (the experimental path) estimates a real value at checkpoints along the response — by literally rolling out several Monte-Carlo continuations from each point and averaging their rewards — then assigns each segment the change in expected value across it.
Each VinePPO bar is V(next state) − V(this state): a segment that raised the odds of eventually being right gets positive credit; one that hurt gets negative. It's the same advantage idea as a critic, but the value is sampled rather than learned — finer credit assignment, paid for with a lot of extra rollouts.
It's policy gradient at the core, with a variance-reducing baseline that's either the group mean (GRPO) or a Monte-Carlo value estimate (VinePPO), plus a KL leash to a frozen reference, all fed by a verifiable, shaped reward.
Everything else in those 1000 lines is engineering around these five ideas. Understand the sign of the advantage and the role of the baseline, and you understand the method.