Training Compute-Optimal Large Language Models: What Chinchilla Changed

Chinchilla did not dispute that scale works. It corrected how scale should be allocated under a fixed compute budget. Many large models in 2022 spent heavily on parameters without giving data the same growth. The result was not that the models were too small; they were undertrained.

Training Compute-Optimal Large Language Models is not an argument against large models. It says parameters and data have to consume the compute budget together. Stronger models do not come from size alone. They come from the right ratio between parameters, data, and training steps.

1. The Question

By early 2022, the AI community had internalized a clear lesson from Kaplan et al. (2020): bigger models are predictably better. The scaling laws paper had shown that performance follows power laws, and that for a given compute budget, you should make the model as large as possible.

The industry took that advice to heart. By spring 2022, GPT-3 had 175 billion parameters trained on 300 billion tokens. DeepMind’s own Gopher had 280 billion parameters trained on 300 billion tokens. Google would soon release PaLM with 540 billion parameters. The trend was clear: crank up the parameter count.

But there was a problem hiding in plain sight. Kaplan et al. had concluded that when you scale compute, most of the budget should go to model size (N ∝ C^0.73) and relatively little to training data (D ∝ C^0.27). This meant: make the model huge, train it on a moderate amount of data.

Hoffmann’s team asked a simple question: is that actually right?

2. Three Independent Approaches, One Answer

What makes this paper unusually convincing is its methodology. The team did not rely on a single experiment. They approached the same question from three completely independent angles, and all three converged on the same answer.

Approach 1: Fix the compute, vary the split. They trained over 400 models ranging from 70 million to over 16 billion parameters, each with a different allocation between model size and training data, but with the same total compute. For each compute level, they found which model size minimized loss.

Approach 2: IsoFLOP profiles. They trained models of 9 different sizes (from 70M to 10B parameters) on varying amounts of data, specifically designed so each group of runs used approximately the same total compute. Then they fit curves to find the optimal model size for each compute level.

Approach 3: Fit a parametric loss function. They fit the following equation to all their training runs:

\hat{L}(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

Where E is the irreducible loss (the entropy of natural language — no model can do better), A/N^α captures the model-size bottleneck, and B/D^β captures the data bottleneck. From the fitted parameters, they derived optimal N and D as functions of compute.

All three approaches agreed:

N_{\mathrm{opt}} \propto C^a, \quad D_{\mathrm{opt}} \propto C^b, \quad a \approx 0.50, \quad b \approx 0.50

1
def optimal_scaling(compute: float) -> tuple[float, float]:
2
    a = 0.50
3
    b = 0.50
4
    n_opt = compute ** a
5
    d_opt = compute ** b
6
    return n_opt, d_opt

The exponents a ≈ b ≈ 0.5 mean that as compute grows, model size and training data should scale at approximately the same rate. When compute grows 10x, both should increase by roughly 3.2x; when compute doubles, both increase by roughly 1.4x. In other words, for every doubling of model size, the number of training tokens should also double. This directly contradicts Kaplan et al., who said compute should be spent primarily on model size.

3. Why Kaplan Got It Wrong

This is not a matter of one team getting it wrong. Both teams did rigorous work. The difference lies in experimental setup, which ultimately led to different optimal-allocation conclusions.

Kaplan et al. used a fixed learning rate schedule that did not adjust for training duration. When you train a model for more steps without adjusting the learning rate schedule, performance suffers — not because the model is inherently worse, but because the optimization is suboptimal. This made long training runs look less effective than they actually are, biasing the results toward larger models trained for fewer steps.

Hoffmann’s team adjusted the learning rate schedule for each training run, ensuring each configuration got a fair shot. When you do this, training longer on more data turns out to be much more valuable than Kaplan’s numbers suggested.

1
from dataclasses import dataclass
2
from typing import Literal
3

4

5
@dataclass(frozen=True)
6
class TrainingConfig:
7
    n_params: float
8
    n_tokens: float
9
    schedule: Literal["fixed", "cosine_with_warmup"]
10
    warmup_steps: int
11
    total_steps: int

4. The Parametric Loss Function

The paper’s Approach 3 deserves a closer look because it gives a complete mathematical model of performance:

\hat{L}(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

Where the fitted constants are:

E = 1.69 — the irreducible loss (entropy of natural language)
A = 406.4, α = 0.34 — the model-size term
B = 410.7, β = 0.28 — the data term

The structure of this equation is worth studying. Loss has three components: a floor you can never get below (E), a penalty for having too few parameters (A/N^α), and a penalty for having too little data (B/D^β). The model-size penalty and data penalty are additive — they compete for your attention and your compute budget.

1
def estimated_loss(n_params: float, n_tokens: float) -> float:
2
    e = 1.69
3
    a = 406.4
4
    alpha = 0.34
5
    b = 410.7
6
    beta = 0.28
7
    return e + a / (n_params ** alpha) + b / (n_tokens ** beta)
8

9

10
def optimal_params_and_tokens(compute_flops: float) -> tuple[float, float]:
11
    alpha = 0.34
12
    beta = 0.28
13
    a = beta / (alpha + beta)
14
    b = alpha / (alpha + beta)
15
    g = 2.0
16

17
    base = compute_flops / 6.0
18
    n_opt = g * (base ** a)
19
    d_opt = (1.0 / g) * (base ** b)
20
    return n_opt, d_opt

5. The Damning Table

The paper’s Table 1 lists the actual parameter counts and training tokens for several well-known models. Table 3 gives the compute-optimal token estimates for different model sizes. Put the two tables side by side, and you get something that reads like an audit of the entire industry:

Model	Parameters	Tokens Used	Chinchilla-Optimal Tokens
GPT-3	175B	300B	3.7T
Gopher	280B	300B	5.9T
Jurassic-1	178B	300B	3.7T
MT-NLG	530B	270B	11.0T

Every single model was trained on roughly 300 billion tokens. But according to Chinchilla’s analysis, GPT-3 should have been trained on 3.7 trillion tokens — more than 12 times what it actually saw. Gopher should have seen nearly 6 trillion. MT-NLG, the largest of the bunch at 530 billion parameters, should have been trained on 11 trillion tokens — 40 times its actual training data.

1
from dataclasses import dataclass
2

3

4
@dataclass(frozen=True)
5
class ModelComparison:
6
    name: str
7
    params_billions: float
8
    tokens_used_billions: float
9
    optimal_tokens_billions: float
10

11

12
def industry_models() -> list[ModelComparison]:
13
    return [
14
        ModelComparison("GPT-3", 175.0, 300.0, 3_700.0),
15
        ModelComparison("Gopher", 280.0, 300.0, 5_900.0),
16
        ModelComparison("Jurassic-1", 178.0, 300.0, 3_700.0),
17
        ModelComparison("MT-NLG", 530.0, 270.0, 11_000.0),
18
    ]

The pattern is striking. The entire industry had settled on roughly the same amount of training data — around 300 billion tokens — regardless of model size. It was as if everyone had decided that 300B tokens was “enough” and poured all additional compute into making models bigger. Chinchilla says this was exactly backwards.

6. The Proof: Chinchilla vs. Gopher

To validate their theory, the team trained Chinchilla: a 70-billion-parameter model on 1.4 trillion tokens. Chinchilla used the same compute budget as Gopher (280 billion parameters, 300 billion tokens) — the same total training cost, just allocated differently.

The result was decisive. Chinchilla outperformed Gopher on nearly every benchmark, despite being 4 times smaller:

MMLU (Massive Multitask Language Understanding): Chinchilla 67.6% vs. Gopher 60.0% vs. GPT-3 43.9%
Reading comprehension (RACE-h): Chinchilla 73.3% vs. Gopher 71.6%
Common sense (HellaSwag): Chinchilla 80.8% vs. Gopher 79.2%
BIG-bench: Chinchilla outperformed Gopher on the majority of tasks

1
from dataclasses import dataclass
2

3

4
@dataclass(frozen=True)
5
class ModelConfig:
6
    name: str
7
    params_billions: float
8
    tokens_billions: float
9
    mmlu_accuracy: float
10

11

12
def chinchilla_vs_gopher() -> tuple[float, float]:
13
    gopher = ModelConfig("Gopher", 280.0, 300.0, 60.0)
14
    chinchilla = ModelConfig("Chinchilla", 70.0, 1_400.0, 67.6)
15

16
    gopher_flops = 6.0 * gopher.params_billions * 1e9 * gopher.tokens_billions * 1e9
17
    chinchilla_flops = 6.0 * chinchilla.params_billions * 1e9 * chinchilla.tokens_billions * 1e9
18
    return gopher_flops, chinchilla_flops

A model that is 4 times smaller beating the larger model on nearly every benchmark — using the same compute — is a powerful demonstration. The compute was not wasted; it was simply redirected from parameters to data.

7. The Practical Consequences

The Chinchilla paper had immediate, concrete consequences for the industry.

Smaller models are cheaper to use. Training cost is a one-time expense, but inference cost — the cost of actually running the model to generate text — scales with model size, every single time a user sends a query. A 70B model is 4x cheaper to serve than a 280B model. If the smaller model performs better, the win is double: better quality at lower cost.

Data became the bottleneck. Before Chinchilla, the limiting factor was compute: how many GPUs can you get? After Chinchilla, the limiting factor shifted to data: where do you find trillions of high-quality tokens? This sparked an industry-wide scramble for training data — web scraping at massive scale, dataset curation efforts, and eventually the synthetic data movement.

The LLaMA moment. Meta’s LLaMA (February 2023) was arguably the most direct application of Chinchilla scaling. LLaMA-13B, trained on 1 trillion tokens, outperformed GPT-3 (175B) on most benchmarks. LLaMA-65B, trained on 1.4 trillion tokens, was competitive with Chinchilla and PaLM-540B. Meta explicitly cited the Chinchilla paper and deliberately trained smaller models on far more data than earlier conventions would have suggested.

1
def inference_cost_comparison() -> tuple[float, float]:
2
    gopher_cost_per_token = 280.0
3
    chinchilla_cost_per_token = 70.0
4

5
    queries_per_day = 1_000_000.0
6
    tokens_per_query = 500.0
7

8
    daily_cost_gopher = queries_per_day * tokens_per_query * gopher_cost_per_token
9
    daily_cost_chinchilla = queries_per_day * tokens_per_query * chinchilla_cost_per_token
10
    return daily_cost_gopher, daily_cost_chinchilla

8. What This Paper Changed

Chinchilla’s sharpest lesson is: it is not against large models; it says parameters and data must consume the compute budget together.

The paper did not stop scaling. It corrected the allocation of scaling. Kaplan gave the field confidence to go bigger. Chinchilla asked, under the same compute budget, how big and how long is actually optimal? The answer was not to keep piling on parameters. It was to grow parameter count and training tokens at roughly the same rate.

That pulled “model size” out of single-variable worship. A 280B-parameter model trained on 300B tokens can lose to a 70B-parameter model trained on 1.4T tokens. Capability is not parameter count by itself. It is what happens after compute is absorbed jointly by parameters and data.

The next time you see a large-model release, do not start with parameter count. Ask how many tokens it trained on, how compute was allocated, and whether it is undertrained. Real scaling is not inflating the model. It is putting every unit of compute where it can reduce loss.

Paper Reading Series

Sequence to Sequence Learning with Neural Networks — Establishing the encoder-decoder paradigm
Neural Machine Translation by Jointly Learning to Align and Translate — The origin of attention
Attention Is All You Need — Attention takes center stage: the birth of the Transformer
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding — Establishing the pre-training paradigm
Scaling Laws for Neural Language Models — The mathematics of scale
Language Models are Few-Shot Learners — Larger models, better at eliciting abilities from context