Paper Reading: Training Compute-Optimal Large Language Models

On March 29, 2022, a team of researchers from DeepMind uploaded a paper to arXiv (a preprint server where researchers can publish papers without waiting for journal peer review): Training Compute-Optimal Large Language Models.

The first author is Jordan Hoffmann, with co-authors including Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, and many others — all at DeepMind at the time. Arthur Mensch would later co-found Mistral AI, one of Europe’s most prominent AI companies.

The paper is often called the “Chinchilla paper,” after the 70-billion-parameter model the team trained to validate their findings. That name stuck — not the paper’s title, but the animal. In AI circles, “Chinchilla scaling” became shorthand for the paper’s central claim.

And that claim was simple, bold, and uncomfortable for most of the industry: many of the biggest language models of 2022 were not “too small” — they were significantly undertrained given their compute budgets.

1. The Question

By early 2022, the AI community had internalized a clear lesson from Kaplan et al. (2020): bigger models are predictably better. The scaling laws paper had shown that performance follows power laws, and that for a given compute budget, you should make the model as large as possible.

The industry took that advice to heart. By spring 2022, GPT-3 had 175 billion parameters trained on 300 billion tokens. DeepMind’s own Gopher had 280 billion parameters trained on 300 billion tokens. Google would soon release PaLM with 540 billion parameters. The trend was clear: crank up the parameter count.

But there was a problem hiding in plain sight. Kaplan et al. had concluded that when you scale compute, most of the budget should go to model size (N ∝ C^0.73) and relatively little to training data (D ∝ C^0.27). This meant: make the model huge, train it on a moderate amount of data.

Hoffmann’s team asked a simple question: is that actually right?

2. Three Independent Approaches, One Answer

What makes this paper unusually convincing is its methodology. The team did not rely on a single experiment. They approached the same question from three completely independent angles, and all three converged on the same answer.

Approach 1: Fix the compute, vary the split. They trained over 400 models ranging from 70 million to over 16 billion parameters, each with a different allocation between model size and training data, but with the same total compute. For each compute level, they found which model size minimized loss.

Approach 2: IsoFLOP profiles. They trained models of 9 different sizes (from 70M to 10B parameters) on varying amounts of data, specifically designed so each group of runs used approximately the same total compute. Then they fit curves to find the optimal model size for each compute level.

Approach 3: Fit a parametric loss function. They fit the following equation to all their training runs:

\hat{L}(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

Where E is the irreducible loss (the entropy of natural language — no model can do better), A/N^α captures the model-size bottleneck, and B/D^β captures the data bottleneck. From the fitted parameters, they derived optimal N and D as functions of compute.

All three approaches agreed:

N_{\mathrm{opt}} \propto C^a, \quad D_{\mathrm{opt}} \propto C^b, \quad a \approx 0.50, \quad b \approx 0.50

1
def optimal_scaling(compute: float) -> tuple[float, float]:
2
    a = 0.50
3
    b = 0.50
4
    n_opt = compute ** a
5
    d_opt = compute ** b
6
    return n_opt, d_opt

The exponents a ≈ b ≈ 0.5 mean that as compute grows, model size and training data should scale at approximately the same rate. When compute grows 10x, both should increase by roughly 3.2x; when compute doubles, both increase by roughly 1.4x. In other words, for every doubling of model size, the number of training tokens should also double. This directly contradicts Kaplan et al., who said compute should be spent primarily on model size.

3. Why Kaplan Got It Wrong

This is not a matter of one team getting it wrong. Both teams did rigorous work. The difference lies in experimental setup, which ultimately led to different optimal-allocation conclusions.

Kaplan et al. used a fixed learning rate schedule that did not adjust for training duration. When you train a model for more steps without adjusting the learning rate schedule, performance suffers — not because the model is inherently worse, but because the optimization is suboptimal. This made long training runs look less effective than they actually are, biasing the results toward larger models trained for fewer steps.

Hoffmann’s team adjusted the learning rate schedule for each training run, ensuring each configuration got a fair shot. When you do this, training longer on more data turns out to be much more valuable than Kaplan’s numbers suggested.

1
from dataclasses import dataclass
2
from typing import Literal
3

4

5
@dataclass(frozen=True)
6
class TrainingConfig:
7
    n_params: float
8
    n_tokens: float
9
    schedule: Literal["fixed", "cosine_with_warmup"]
10
    warmup_steps: int
11
    total_steps: int

4. The Parametric Loss Function

The paper’s Approach 3 deserves a closer look because it gives a complete mathematical model of performance:

\hat{L}(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

Where the fitted constants are:

E = 1.69 — the irreducible loss (entropy of natural language)
A = 406.4, α = 0.34 — the model-size term
B = 410.7, β = 0.28 — the data term

The structure of this equation is worth studying. Loss has three components: a floor you can never get below (E), a penalty for having too few parameters (A/N^α), and a penalty for having too little data (B/D^β). The model-size penalty and data penalty are additive — they compete for your attention and your compute budget.

1
def estimated_loss(n_params: float, n_tokens: float) -> float:
2
    e = 1.69
3
    a = 406.4
4
    alpha = 0.34
5
    b = 410.7
6
    beta = 0.28
7
    return e + a / (n_params ** alpha) + b / (n_tokens ** beta)
8

9

10
def optimal_params_and_tokens(compute_flops: float) -> tuple[float, float]:
11
    alpha = 0.34
12
    beta = 0.28
13
    a = beta / (alpha + beta)
14
    b = alpha / (alpha + beta)
15
    g = 2.0
16

17
    base = compute_flops / 6.0
18
    n_opt = g * (base ** a)
19
    d_opt = (1.0 / g) * (base ** b)
20
    return n_opt, d_opt

5. The Damning Table

The paper’s Table 1 lists the actual parameter counts and training tokens for several well-known models. Table 3 gives the compute-optimal token estimates for different model sizes. Put the two tables side by side, and you get something that reads like an audit of the entire industry:

Model	Parameters	Tokens Used	Chinchilla-Optimal Tokens
GPT-3	175B	300B	3.7T
Gopher	280B	300B	5.9T
Jurassic-1	178B	300B	3.7T
MT-NLG	530B	270B	11.0T

Every single model was trained on roughly 300 billion tokens. But according to Chinchilla’s analysis, GPT-3 should have been trained on 3.7 trillion tokens — more than 12 times what it actually saw. Gopher should have seen nearly 6 trillion. MT-NLG, the largest of the bunch at 530 billion parameters, should have been trained on 11 trillion tokens — 40 times its actual training data.

1
from dataclasses import dataclass
2

3

4
@dataclass(frozen=True)
5
class ModelComparison:
6
    name: str
7
    params_billions: float
8
    tokens_used_billions: float
9
    optimal_tokens_billions: float
10

11

12
def industry_models() -> list[ModelComparison]:
13
    return [
14
        ModelComparison("GPT-3", 175.0, 300.0, 3_700.0),
15
        ModelComparison("Gopher", 280.0, 300.0, 5_900.0),
16
        ModelComparison("Jurassic-1", 178.0, 300.0, 3_700.0),
17
        ModelComparison("MT-NLG", 530.0, 270.0, 11_000.0),
18
    ]

The pattern is striking. The entire industry had settled on roughly the same amount of training data — around 300 billion tokens — regardless of model size. It was as if everyone had decided that 300B tokens was “enough” and poured all additional compute into making models bigger. Chinchilla says this was exactly backwards.

6. The Proof: Chinchilla vs. Gopher

To validate their theory, the team trained Chinchilla: a 70-billion-parameter model on 1.4 trillion tokens. Chinchilla used the same compute budget as Gopher (280 billion parameters, 300 billion tokens) — the same total training cost, just allocated differently.

The result was decisive. Chinchilla outperformed Gopher on nearly every benchmark, despite being 4 times smaller:

MMLU (Massive Multitask Language Understanding): Chinchilla 67.6% vs. Gopher 60.0% vs. GPT-3 43.9%
Reading comprehension (RACE-h): Chinchilla 73.3% vs. Gopher 71.6%
Common sense (HellaSwag): Chinchilla 80.8% vs. Gopher 79.2%
BIG-bench: Chinchilla outperformed Gopher on the majority of tasks

1
from dataclasses import dataclass
2

3

4
@dataclass(frozen=True)
5
class ModelConfig:
6
    name: str
7
    params_billions: float
8
    tokens_billions: float
9
    mmlu_accuracy: float
10

11

12
def chinchilla_vs_gopher() -> tuple[float, float]:
13
    gopher = ModelConfig("Gopher", 280.0, 300.0, 60.0)
14
    chinchilla = ModelConfig("Chinchilla", 70.0, 1_400.0, 67.6)
15

16
    gopher_flops = 6.0 * gopher.params_billions * 1e9 * gopher.tokens_billions * 1e9
17
    chinchilla_flops = 6.0 * chinchilla.params_billions * 1e9 * chinchilla.tokens_billions * 1e9
18
    return gopher_flops, chinchilla_flops

A model that is 4 times smaller beating the larger model on nearly every benchmark — using the same compute — is a powerful demonstration. The compute was not wasted; it was simply redirected from parameters to data.

7. The Practical Consequences

The Chinchilla paper had immediate, concrete consequences for the industry.

Smaller models are cheaper to use. Training cost is a one-time expense, but inference cost — the cost of actually running the model to generate text — scales with model size, every single time a user sends a query. A 70B model is 4x cheaper to serve than a 280B model. If the smaller model performs better, the win is double: better quality at lower cost.

Data became the bottleneck. Before Chinchilla, the limiting factor was compute: how many GPUs can you get? After Chinchilla, the limiting factor shifted to data: where do you find trillions of high-quality tokens? This sparked an industry-wide scramble for training data — web scraping at massive scale, dataset curation efforts, and eventually the synthetic data movement.

The LLaMA moment. Meta’s LLaMA (February 2023) was arguably the most direct application of Chinchilla scaling. LLaMA-13B, trained on 1 trillion tokens, outperformed GPT-3 (175B) on most benchmarks. LLaMA-65B, trained on 1.4 trillion tokens, was competitive with Chinchilla and PaLM-540B. Meta explicitly cited the Chinchilla paper and deliberately trained smaller models on far more data than earlier conventions would have suggested.

1
def inference_cost_comparison() -> tuple[float, float]:
2
    gopher_cost_per_token = 280.0
3
    chinchilla_cost_per_token = 70.0
4

5
    queries_per_day = 1_000_000.0
6
    tokens_per_query = 500.0
7

8
    daily_cost_gopher = queries_per_day * tokens_per_query * gopher_cost_per_token
9
    daily_cost_chinchilla = queries_per_day * tokens_per_query * chinchilla_cost_per_token
10
    return daily_cost_gopher, daily_cost_chinchilla

8. My Takeaways

First, this paper is a correction — and a graceful one. It takes Kaplan et al.’s framework, identifies a methodological flaw (fixed learning rate schedules), fixes it, and arrives at a different answer. It does not dismiss the earlier work; it builds on it. The parametric loss function L̂(N, D) = E + A/N^α + B/D^β is a refinement of Kaplan’s formulation, not a replacement. Science at its best is exactly this: someone does careful work, someone else does more careful work, and the field moves forward.

Second, the paper’s most surprising finding is not the math — it is the gap between theory and practice. Everyone in the industry could see that 300 billion tokens was becoming a default. Nobody questioned it seriously until this team ran the numbers. The models were not small; they were starved. The solution was not to build bigger — it was to feed more.

Third, the equal-scaling result (a ≈ b ≈ 0.5) is beautiful in its simplicity. There is no asymmetry between model size and data. If you have more compute, scale both equally. No complicated allocation strategy needed. “Where should I spend my next dollar of compute?” Chinchilla’s answer is not to keep betting on parameter count alone, but to let model size and training data grow at approximately the same rate.

Fourth, the practical legacy is enormous. Before Chinchilla, the path to better AI was “make it bigger.” After Chinchilla, the path became “train it better.” This one shift made powerful models accessible to organizations that could not afford the largest parameter counts but could curate large datasets. LLaMA, Mistral, and the entire open-source LLM ecosystem owe a direct debt to this insight.

The Kaplan paper said: bigger models are predictably better. The Chinchilla paper said: yes, but you have been making them big in the wrong way. Stop hoarding parameters. Start feeding data.

One paper gave the industry permission to scale. The other taught it how.

Paper Reading Series

Sequence to Sequence Learning with Neural Networks — Establishing the encoder-decoder paradigm
Neural Machine Translation by Jointly Learning to Align and Translate — The origin of attention
Attention Is All You Need — Attention takes center stage: the birth of the Transformer
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Establishing the pre-training paradigm
Scaling Laws for Neural Language Models — The mathematics of scale
Language Models are Few-Shot Learners — Larger models, better at eliciting abilities from context