Justin Huang

Technical Report Reading: Attention Residuals

Thu, 19 Mar 2026 08:49:27 GMT

On March 16, 2026, Kimi Team uploaded a technical report to arXiv: [*Attention Residuals*](/papers/2603.15031v1.pdf). You can tell what the authors really care about just from the shape of the report. It is not simply "here is a new module." It walks through `motivation -> AttnRes -> Block AttnRes -> infrastructure -> experiments -> discussion`, and in doing so, it retells a deeper question: what is a residual connection actually doing? ## 0. A Few Terms First If you do not have a machine learning background, it helps to build intuition in the same order this report does: - `Transformer`: the basic architecture behind most large language models today. You can think of it as a machine that processes information layer by layer. - `hidden state`: the model's internal intermediate representation at a given layer. Roughly speaking, it is the model's temporary working note at that point. - `residual connection`: a path between layers that preserves the old input, then adds the layer's new computation on top of it. - `residual`: closer to the new increment added by the current layer, the "extra part" introduced inside that residual connection. - `attention`: a mechanism for selecting which pieces of information matter most right now. A useful first intuition is "looking at the important parts selectively." - `PreNorm`: normalizing values before entering a layer, then doing the actual computation afterward. You can think of it as adjusting the volume before continuing the mix. ## 1. The One-Sentence Version This technical report asks a very sharp question: **If the Transformer has already replaced recurrence with attention along the sequence dimension, why is information aggregation along the depth dimension still stuck with fixed addition?** Modern LLMs almost all use a common layer pattern: first PreNorm, then a residual path. In plain language: normalize the scale, compute something new, then add that new result back to the original input. We usually think of this as a tool for stable optimization, something that helps very deep networks avoid falling apart during training. But the report reminds us that residual connections play another equally important role that has largely gone underexamined: **They define how information is aggregated across depth.** If the formula below is not your favorite thing in the world, do not get stuck on it. The plain-English translation right after it is the part that matters. The standard residual rule is simple: $$ h_l = h_{l-1} + f_{l-1}(h_{l-1}) $$ You can split it into two parts: - $h_{l-1}$: the old content, meaning the representation already produced by the previous layer - $f_{l-1}(h_{l-1})$: the new increment computed by the current layer, which is closer to what "residual" means as a word And the act of adding those two parts back together is what is more precisely called the residual connection. If you expand the recurrence, you get: $$ h_l = h_1 + \sum_{i=1}^{l-1} f_i(h_i) $$ In plain English, that means: the input seen by layer $l$ is basically "the embedding plus the uniform sum of all previous layer outputs." Every earlier layer gets weight 1. There is no selection, no suppression, no way to say "for this step I should care more about layer 3 than layer 17." The core idea of AttnRes is just one sentence: **Replace fixed residual addition with a softmax attention operation over depth.** ## 2. What Is Wrong with the Old Residual Rule The most important thing about this report is not that it proposes a new formula. It is that it turns something everyone had gotten used to back into a problem. Standard residual connections have long been treated as optimization infrastructure. As long as gradients can pass through, the mechanism is considered to have done its job. But from the perspective of information flow, that path is surprisingly crude. Imagine you are working on a document that keeps being revised. At each round, instead of selecting the most relevant parts of older versions and merging them thoughtfully, you just append the full text of every previous draft to the end. By revision 20, the important insights from revision 3 are still technically there, but they are buried inside an ever-thickening pile. That is the PreNorm problem the report highlights. It builds on observations from SiameseNorm and argues that under PreNorm, the magnitude of the `hidden state` grows approximately like $O(L)$ with depth. Here, hidden state is just the model's internal running note at each layer. The result is: - later layers see an increasingly bloated historical sum - early-layer information does not disappear, but it gets diluted - later layers are forced to emit larger and larger magnitudes if they want to be heard The report calls this `PreNorm dilution`. It is an excellent name. The problem is not that gradients vanish, nor that training explodes. It is that each layer's relative contribution gets progressively washed out. There is a line of reasoning under the surface here that I really like: along the sequence dimension, we stopped being satisfied with "treat every past token the same" a long time ago. That is why attention exists. So why are we still willing to accept "sum every previous layer with equal weight" along the depth dimension? ## 3. What AttnRes Actually Does The form of AttnRes is clean. Layer $l$ no longer mechanically receives the sum of all previous outputs. Instead, it performs a weighted selection over those historical representations: $$ h_l = \sum_{i=0}^{l-1} \alpha_{i \to l} \cdot v_i $$ The weights $\alpha_{i \to l}$ come from a softmax. If you are not used to that term, the easiest working definition is: softmax turns a set of scores into weights that add up to 1, which lets the model say clearly "look more here, less there": $$ \alpha_{i \to l} = \operatorname{softmax}\left(w_l^T \operatorname{RMSNorm}(k_i)\right) $$ If you have never worked with attention before, here is the cheapest way to think about it: - `query`: what the current layer is looking for - `key`: what kind of index label each historical layer carries - `value`: the content that actually gets retrieved and aggregated Three details in the design matter a lot. First, **the query is not computed from the current hidden state. It is a learned pseudo-query vector $w_l$ for each layer.** This is slightly counterintuitive. Normally, when we see attention, we assume the query must come from the current input. Here the authors intentionally make it layer-specific rather than token-specific. The upside is that multiple queries inside the same block can be computed in batches ahead of time, which opens the door for later infrastructure optimizations. Second, **the keys and values come directly from previous layer outputs.** That means the input dependence does not vanish. It just lives in the layer representations themselves rather than in a dynamic query. Different samples produce different previous-layer outputs, so the depth attention remains input-dependent in the end. Third, **the keys are normalized with RMSNorm first.** This is a small but important choice. Without normalization, layers with larger magnitudes would automatically dominate the dot products. Then the attention weights would reflect "which layer is louder" more than "which layer is more relevant." In Python (using PyTorch), one clean implementation looks like this: ```python import torch from torch import nn def attention_residual( sources: list[torch.Tensor], pseudo_query: torch.Tensor, norm: nn.RMSNorm, ) -> torch.Tensor: keys = torch.stack([norm(source) for source in sources], dim=0) values = torch.stack(sources, dim=0) logits = keys @ pseudo_query weights = torch.softmax(logits, dim=0) return (weights.unsqueeze(-1) * values).sum(dim=0) ``` At first glance, this looks like "put attention on top of residuals." But I think a more accurate description is: **It turns the residual connection from a fixed accumulator into a selective depth retriever.** ## 4. The Best Part of the Report: It Gives Engineering, Not Just an Idea If the report stopped at Full AttnRes, this would still be just a beautiful research idea. Full AttnRes lets every layer attend to all previous layers. Theoretically, it is easy to understand, and even the raw arithmetic cost is not terrifying, because network depth $L$ is usually much smaller than sequence length $T$. So the authors argue that $O(L^2 d)$ arithmetic alone is not the scariest part. The real problems show up in large-scale training: - activation recomputation turns intermediate layer outputs from discardable values into objects you must preserve - pipeline parallelism means those cross-layer representations may need to travel across stages - once every layer must see every previous layer, communication and caching pressure rise quickly That is why they introduce **Block AttnRes**. The idea is to divide the $L$ layers into $N$ blocks. Inside a block, you first use ordinary summation to accumulate a block representation. Across blocks, you then apply attention. So: - Full AttnRes attends to every historical layer - Block AttnRes attends to summaries of historical blocks, plus the partial sum inside the current block In other words, it trades fine-grained cross-layer attention for summary-level cross-block attention to gain scalability. And the authors do not stop at saying "we grouped layers, so memory gets better." They actually work through the systems side of the bill: - during training they use **cross-stage caching** to avoid repeatedly shipping historical blocks through the pipeline - during inference they use **two-phase computation** - phase one computes inter-block attention in parallel - phase two computes intra-block lookback sequentially, then merges results with online softmax The appendix and `table/memory_access.tex` contain the hardest numbers in the whole report. Under the report's representative setting: - standard residual: per-layer residual mechanism I/O is `3d` - naive Full AttnRes: `130d` - optimized Full AttnRes: `24d` - Block AttnRes: `5.5d` - mHC: `34d` That comparison says a lot. Block AttnRes is not "as cheap as a standard residual." But it has already moved from "obviously impractical" to "interesting enough to try in a real system." And the measured overhead is modest: - training wall-clock overhead is below 4% - inference latency overhead is below 2% That is why this reads like a real systems-minded technical report to me. A lot of papers have new ideas and fuzzy accounting. This one cares about the accounting. ## 5. What Matters Most in the Experiments ### 5.1 Scaling Law Results: Not a One-Off Win The authors first run scaling-law experiments across five model sizes, comparing Baseline, Full AttnRes, and Block AttnRes. The fitted curves are: - Baseline: $1.891 \times C^{-0.057}$ - Block AttnRes: $1.870 \times C^{-0.058}$ - Full AttnRes: $1.865 \times C^{-0.057}$ The most important thing here is not which slope differs by how much. It is this: **AttnRes stays consistently lower across the compute range.** The report gives a clean headline claim: at `5.6 PFLOP/s-days`, the loss of Block AttnRes is equivalent to what the baseline would need about `1.25x` more compute to reach. So this does not look like "we happened to tune one model size well." It looks like a reasonably stable scaling benefit. ### 5.2 The Main Model Is Not a Toy The main experiment is not a small toy benchmark. It uses a large Kimi Linear configuration: - `48B total / 3B activated parameters` - 27 Transformer blocks, which means 54 layers - 8-of-256 routed experts plus 1 shared expert - pretraining on `1.4T tokens` That matters, because it shows the authors are not just drawing pretty curves on small models. They actually inserted this residual redesign into a large training recipe. ### 5.3 The Most Revealing Figure: Output Magnitudes Stop Running Away The most striking figure in the report, to me, is not the benchmark table but the training-dynamics plot. In the baseline, output magnitudes keep rising with depth. The values in the plot are dramatic: the early blocks sit around `0.04`, `0.06`, `0.10`, while later blocks climb to `10.47` and `12.15`. That is PreNorm dilution made visible. Block AttnRes looks completely different. The magnitudes show a kind of periodic reset at block boundaries, fluctuating roughly between `0.21` and `1.91`, without the same runaway upward drift. This matters because it suggests AttnRes is not merely "a few more benchmark points at the end." It is changing how representations accumulate across depth during training itself. ### 5.4 Downstream Tasks: The Biggest Gains Are in Reasoning and Code After pretraining, AttnRes is no worse than the baseline on all listed evaluations, and several gains stand out: - MMLU: `73.5 -> 74.6` - GPQA-Diamond: `36.9 -> 44.4` - Math: `53.5 -> 57.1` - HumanEval: `59.1 -> 62.2` - C-Eval: `79.6 -> 82.5` The most interesting part is that gains are larger on tasks like GPQA, Math, and HumanEval, where multi-step reasoning or program synthesis matter more. The report's explanation is that if later layers can retrieve earlier-layer representations more selectively, compositional tasks benefit more. I think that explanation makes sense. Complex reasoning is often not limited by missing information. It is limited by important information getting buried deep inside the network. ## 6. What the Ablations Tell Us The ablation section is strong because it does not only show that the method helps. It also tries to show why. Some of the most interesting takeaways: - **DenseFormer reaches 1.767, almost identical to the baseline at 1.766.** So merely being able to access all previous layers is not enough. What matters is whether the weighting is input-dependent. - **mHC gets to 1.747, which is already a clear improvement.** That suggests dynamic mixing along the depth dimension is genuinely useful. - **Full AttnRes reaches 1.737.** Lower than the baseline, DenseFormer, and mHC, which suggests explicit softmax depth attention is the stronger route. - **SWA, which only looks at a recent window, gets 1.764.** That is valuable because it shows the gain is not just "look at the most recent few layers." The gain comes from selectively reaching further back when needed. - **Changing the block size among 2, 4, and 8 keeps loss around 1.746.** That is why the authors settle on roughly 8 blocks in the end. It is not arbitrary. It is a good engineering-effectiveness sweet spot. - **An input-dependent query version reaches 1.731, even better than Full AttnRes.** This is especially interesting. It means the pseudo-query design in the report is not the performance ceiling. It is a compromise chosen to make infrastructure optimizations easier. In other words, the authors are not unaware of stronger variants. They are deliberately choosing a more scalable one. That is one reason I like this report. When you read the main text, the ablations, and the systems section together, you can see the real trade-off clearly: the goal is not blindly minimizing loss at any cost. The goal is something strong enough, while still trainable in practice. ## 7. How I Read This Report First, the most important thing here is not that the report invents a new module. It is that it elevates residual connections from "optimization stability tool" back into "information routing mechanism." Once you adopt that lens, many old questions get reframed. Residuals stop looking like mere gradient highways. They become depth aggregation rules. And then new questions appear almost automatically: - can each layer selectively access earlier layers? - are there attention-sink-like effects along depth? - were older residual variants already doing something like depth-wise linear attention? That is exactly where the discussion section becomes unusually interesting. The authors reinterpret a bunch of residual variants through the lens of a `depth mixing matrix`, and go one step further: **Many existing methods are, in essence, doing linear attention along the depth dimension; AttnRes is doing softmax attention along depth.** That is a bold framing, but a very productive one. It is basically saying: the Transformer once moved the sequence dimension from recurrence toward softmax attention; AttnRes is trying to push the depth dimension one step further too. Second, the report feels like an example of "ask the problem correctly first, then make the system workable." It does not obsess over making every local piece maximally fancy. For example, the query is intentionally layer-specific instead of token-dependent. That may not be the absolute strongest choice in terms of raw performance, but it creates room for batching, two-phase computation, and pipeline caching. A deployable technical report is often not about the flashiest local design. It is about what survives under global constraints. Third, the line I think is most worth remembering from this whole report is really a question: **Why is depth-wise aggregation still fixed while everything else has become adaptive?** That is exactly the right question to ask. ## 8. Where the Report Stops Before praising it too much, it is worth being clear about the boundaries. First, this is still a **technical report / arXiv preprint**, not a peer-reviewed conference paper. The safest attitude is not "it has proven the future." It is "it has proposed a powerful lens and backed it with an implementation that looks engineering-feasible." Second, the large-scale results are tied to the Kimi Linear line of architecture: MoE, hybrid KDA/MLA attention, and a Moonlight / DeepSeek-V3-style training recipe. That does not weaken the result, but it does mean we should not automatically extrapolate it to every dense decoder-only Transformer. Third, the report itself admits that Full AttnRes is stronger, while Block AttnRes is the practical answer under today's hardware constraints. If memory, bandwidth, and interconnect improve further, or if more efficient variants of depth attention appear, today's block design probably will not be the endpoint. So my view is: - it is already strong enough to deserve serious reading - it is already complete enough to deserve serious reproduction work - it is not yet settled enough to justify a final verdict ## 9. Final Impression If you reduce the last decade of large-model architecture progress to a very rough storyline: - Seq2Seq asked: how do we compress one sequence into another? - Bahdanau asked: why can't decoding look back at different positions in the input? - Transformer asked: why must sequence modeling depend on recurrence? - Chinchilla asked: why should extra compute mainly go into parameter count? Then *Attention Residuals* asks: **Why is information aggregation across depth still living in the era of "sum every historical layer equally"?** That question alone is already valuable. I do not know whether AttnRes will become a default configuration in a few years the way PreNorm did. But I am quite sure this technical report turns residual connections back into something worth thinking about, designing, and optimizing. People used to say attention rewrote sequence modeling. This report is trying to rewrite residuals. In spring 2026, the Kimi team's work already makes one thing clear: when Scaling Laws begin to show signs of nearing a bottleneck, structural innovation in LLMs will continue to emerge. --- **Further Reading** - [*Sequence to Sequence Learning with Neural Networks*](/posts/sequence-to-sequence-learning-with-neural-networks/) — the establishment of the encoder-decoder paradigm - [*Neural Machine Translation by Jointly Learning to Align and Translate*](/posts/neural-machine-translation-by-jointly-learning-to-align-and-translate/) — the origin of the attention mechanism - [*Attention Is All You Need*](/posts/attention-is-all-you-need/) — when attention became the lead role and the Transformer was born - [*BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*](/posts/bert/) — the establishment of the pretraining paradigm - [*Scaling Laws for Neural Language Models*](/posts/scaling-laws-for-neural-language-models/) — the mathematics of scale - [*Language Models are Few-Shot Learners*](/posts/language-models-are-few-shot-learners/) — bigger models become better at pulling abilities out of context - [*Training Compute-Optimal Large Language Models*](/posts/training-compute-optimal-large-language-models/) — how to spend compute wisely

Paper Reading: Training Compute-Optimal Large Language Models

Wed, 11 Mar 2026 08:58:04 GMT

On March 29, 2022, a team of researchers from DeepMind uploaded a paper to arXiv (a preprint server where researchers can publish papers without waiting for journal peer review): [*Training Compute-Optimal Large Language Models*](/papers/2203.15556v1.pdf). The first author is Jordan Hoffmann, with co-authors including Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, and many others — all at DeepMind at the time. Arthur Mensch would later co-found Mistral AI, one of Europe's most prominent AI companies. The paper is often called the "Chinchilla paper," after the 70-billion-parameter model the team trained to validate their findings. That name stuck — not the paper's title, but the animal. In AI circles, "Chinchilla scaling" became shorthand for the paper's central claim. And that claim was simple, bold, and uncomfortable for most of the industry: **many of the biggest language models of 2022 were not "too small" — they were significantly undertrained given their compute budgets.** ## 1. The Question By early 2022, the AI community had internalized a clear lesson from [Kaplan et al. (2020)](/posts/scaling-laws-for-neural-language-models/): bigger models are predictably better. The scaling laws paper had shown that performance follows power laws, and that for a given compute budget, you should make the model as large as possible. The industry took that advice to heart. By spring 2022, GPT-3 had 175 billion parameters trained on 300 billion tokens. DeepMind's own Gopher had 280 billion parameters trained on 300 billion tokens. Google would soon release PaLM with 540 billion parameters. The trend was clear: crank up the parameter count. But there was a problem hiding in plain sight. Kaplan et al. had concluded that when you scale compute, most of the budget should go to model size (N ∝ C^0.73) and relatively little to training data (D ∝ C^0.27). This meant: make the model huge, train it on a moderate amount of data. Hoffmann's team asked a simple question: is that actually right? ## 2. Three Independent Approaches, One Answer What makes this paper unusually convincing is its methodology. The team did not rely on a single experiment. They approached the same question from three completely independent angles, and all three converged on the same answer. **Approach 1: Fix the compute, vary the split.** They trained over 400 models ranging from 70 million to over 16 billion parameters, each with a different allocation between model size and training data, but with the same total compute. For each compute level, they found which model size minimized loss. **Approach 2: IsoFLOP profiles.** They trained models of 9 different sizes (from 70M to 10B parameters) on varying amounts of data, specifically designed so each group of runs used approximately the same total compute. Then they fit curves to find the optimal model size for each compute level. **Approach 3: Fit a parametric loss function.** They fit the following equation to all their training runs: $$ \hat{L}(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta} $$ Where E is the irreducible loss (the entropy of natural language — no model can do better), A/N^α captures the model-size bottleneck, and B/D^β captures the data bottleneck. From the fitted parameters, they derived optimal N and D as functions of compute. All three approaches agreed: $$ N_{\mathrm{opt}} \propto C^a, \quad D_{\mathrm{opt}} \propto C^b, \quad a \approx 0.50, \quad b \approx 0.50 $$ ```python def optimal_scaling(compute: float) -> tuple[float, float]: a = 0.50 b = 0.50 n_opt = compute ** a d_opt = compute ** b return n_opt, d_opt ``` The exponents a ≈ b ≈ 0.5 mean that as compute grows, model size and training data should scale at approximately the same rate. When compute grows 10x, both should increase by roughly 3.2x; when compute doubles, both increase by roughly 1.4x. In other words, for every doubling of model size, the number of training tokens should also double. This directly contradicts Kaplan et al., who said compute should be spent primarily on model size. ## 3. Why Kaplan Got It Wrong This is not a matter of one team getting it wrong. Both teams did rigorous work. The difference lies in experimental setup, which ultimately led to different optimal-allocation conclusions. Kaplan et al. used a fixed learning rate schedule that did not adjust for training duration. When you train a model for more steps without adjusting the learning rate schedule, performance suffers — not because the model is inherently worse, but because the optimization is suboptimal. This made long training runs look less effective than they actually are, biasing the results toward larger models trained for fewer steps. Hoffmann's team adjusted the learning rate schedule for each training run, ensuring each configuration got a fair shot. When you do this, training longer on more data turns out to be much more valuable than Kaplan's numbers suggested. ```python from dataclasses import dataclass from typing import Literal @dataclass(frozen=True) class TrainingConfig: n_params: float n_tokens: float schedule: Literal["fixed", "cosine_with_warmup"] warmup_steps: int total_steps: int ``` ## 4. The Parametric Loss Function The paper's Approach 3 deserves a closer look because it gives a complete mathematical model of performance: $$ \hat{L}(N, D) = E + \frac{A}{N^\alpha} + \frac{B}{D^\beta} $$ Where the fitted constants are: - E = 1.69 — the irreducible loss (entropy of natural language) - A = 406.4, α = 0.34 — the model-size term - B = 410.7, β = 0.28 — the data term The structure of this equation is worth studying. Loss has three components: a floor you can never get below (E), a penalty for having too few parameters (A/N^α), and a penalty for having too little data (B/D^β). The model-size penalty and data penalty are additive — they compete for your attention and your compute budget. ```python def estimated_loss(n_params: float, n_tokens: float) -> float: e = 1.69 a = 406.4 alpha = 0.34 b = 410.7 beta = 0.28 return e + a / (n_params ** alpha) + b / (n_tokens ** beta) def optimal_params_and_tokens(compute_flops: float) -> tuple[float, float]: alpha = 0.34 beta = 0.28 a = beta / (alpha + beta) b = alpha / (alpha + beta) g = 2.0 base = compute_flops / 6.0 n_opt = g * (base ** a) d_opt = (1.0 / g) * (base ** b) return n_opt, d_opt ``` ## 5. The Damning Table The paper's Table 1 lists the actual parameter counts and training tokens for several well-known models. Table 3 gives the compute-optimal token estimates for different model sizes. Put the two tables side by side, and you get something that reads like an audit of the entire industry: | Model | Parameters | Tokens Used | Chinchilla-Optimal Tokens | |-------|-----------|-------------|---------------------------| | GPT-3 | 175B | 300B | 3.7T | | Gopher | 280B | 300B | 5.9T | | Jurassic-1 | 178B | 300B | 3.7T | | MT-NLG | 530B | 270B | 11.0T | Every single model was trained on roughly 300 billion tokens. But according to Chinchilla's analysis, GPT-3 should have been trained on 3.7 trillion tokens — more than 12 times what it actually saw. Gopher should have seen nearly 6 trillion. MT-NLG, the largest of the bunch at 530 billion parameters, should have been trained on 11 trillion tokens — 40 times its actual training data. ```python from dataclasses import dataclass @dataclass(frozen=True) class ModelComparison: name: str params_billions: float tokens_used_billions: float optimal_tokens_billions: float def industry_models() -> list[ModelComparison]: return [ ModelComparison("GPT-3", 175.0, 300.0, 3_700.0), ModelComparison("Gopher", 280.0, 300.0, 5_900.0), ModelComparison("Jurassic-1", 178.0, 300.0, 3_700.0), ModelComparison("MT-NLG", 530.0, 270.0, 11_000.0), ] ``` The pattern is striking. The entire industry had settled on roughly the same amount of training data — around 300 billion tokens — regardless of model size. It was as if everyone had decided that 300B tokens was "enough" and poured all additional compute into making models bigger. Chinchilla says this was exactly backwards. ## 6. The Proof: Chinchilla vs. Gopher To validate their theory, the team trained Chinchilla: a 70-billion-parameter model on 1.4 trillion tokens. Chinchilla used the same compute budget as Gopher (280 billion parameters, 300 billion tokens) — the same total training cost, just allocated differently. The result was decisive. Chinchilla outperformed Gopher on nearly every benchmark, despite being 4 times smaller: - **MMLU** (Massive Multitask Language Understanding): Chinchilla 67.6% vs. Gopher 60.0% vs. GPT-3 43.9% - **Reading comprehension** (RACE-h): Chinchilla 73.3% vs. Gopher 71.6% - **Common sense** (HellaSwag): Chinchilla 80.8% vs. Gopher 79.2% - **BIG-bench**: Chinchilla outperformed Gopher on the majority of tasks ```python from dataclasses import dataclass @dataclass(frozen=True) class ModelConfig: name: str params_billions: float tokens_billions: float mmlu_accuracy: float def chinchilla_vs_gopher() -> tuple[float, float]: gopher = ModelConfig("Gopher", 280.0, 300.0, 60.0) chinchilla = ModelConfig("Chinchilla", 70.0, 1_400.0, 67.6) gopher_flops = 6.0 * gopher.params_billions * 1e9 * gopher.tokens_billions * 1e9 chinchilla_flops = 6.0 * chinchilla.params_billions * 1e9 * chinchilla.tokens_billions * 1e9 return gopher_flops, chinchilla_flops ``` A model that is 4 times smaller beating the larger model on nearly every benchmark — using the same compute — is a powerful demonstration. The compute was not wasted; it was simply redirected from parameters to data. ## 7. The Practical Consequences The Chinchilla paper had immediate, concrete consequences for the industry. **Smaller models are cheaper to use.** Training cost is a one-time expense, but inference cost — the cost of actually running the model to generate text — scales with model size, every single time a user sends a query. A 70B model is 4x cheaper to serve than a 280B model. If the smaller model performs better, the win is double: better quality at lower cost. **Data became the bottleneck.** Before Chinchilla, the limiting factor was compute: how many GPUs can you get? After Chinchilla, the limiting factor shifted to data: where do you find trillions of high-quality tokens? This sparked an industry-wide scramble for training data — web scraping at massive scale, dataset curation efforts, and eventually the synthetic data movement. **The LLaMA moment.** Meta's LLaMA (February 2023) was arguably the most direct application of Chinchilla scaling. LLaMA-13B, trained on 1 trillion tokens, outperformed GPT-3 (175B) on most benchmarks. LLaMA-65B, trained on 1.4 trillion tokens, was competitive with Chinchilla and PaLM-540B. Meta explicitly cited the Chinchilla paper and deliberately trained smaller models on far more data than earlier conventions would have suggested. ```python def inference_cost_comparison() -> tuple[float, float]: gopher_cost_per_token = 280.0 chinchilla_cost_per_token = 70.0 queries_per_day = 1_000_000.0 tokens_per_query = 500.0 daily_cost_gopher = queries_per_day * tokens_per_query * gopher_cost_per_token daily_cost_chinchilla = queries_per_day * tokens_per_query * chinchilla_cost_per_token return daily_cost_gopher, daily_cost_chinchilla ``` ## 8. My Takeaways First, this paper is a correction — and a graceful one. It takes Kaplan et al.'s framework, identifies a methodological flaw (fixed learning rate schedules), fixes it, and arrives at a different answer. It does not dismiss the earlier work; it builds on it. The parametric loss function L̂(N, D) = E + A/N^α + B/D^β is a refinement of Kaplan's formulation, not a replacement. Science at its best is exactly this: someone does careful work, someone else does more careful work, and the field moves forward. Second, the paper's most surprising finding is not the math — it is the gap between theory and practice. Everyone in the industry could see that 300 billion tokens was becoming a default. Nobody questioned it seriously until this team ran the numbers. The models were not small; they were starved. The solution was not to build bigger — it was to feed more. Third, the equal-scaling result (a ≈ b ≈ 0.5) is beautiful in its simplicity. There is no asymmetry between model size and data. If you have more compute, scale both equally. No complicated allocation strategy needed. "Where should I spend my next dollar of compute?" Chinchilla's answer is not to keep betting on parameter count alone, but to let model size and training data grow at approximately the same rate. Fourth, the practical legacy is enormous. Before Chinchilla, the path to better AI was "make it bigger." After Chinchilla, the path became "train it better." This one shift made powerful models accessible to organizations that could not afford the largest parameter counts but could curate large datasets. LLaMA, Mistral, and the entire open-source LLM ecosystem owe a direct debt to this insight. The Kaplan paper said: bigger models are predictably better. The Chinchilla paper said: yes, but you have been making them big in the wrong way. Stop hoarding parameters. Start feeding data. One paper gave the industry permission to scale. The other taught it how. --- **Paper Reading Series** - [*Sequence to Sequence Learning with Neural Networks*](/posts/sequence-to-sequence-learning-with-neural-networks/) — Establishing the encoder-decoder paradigm - [*Neural Machine Translation by Jointly Learning to Align and Translate*](/posts/neural-machine-translation-by-jointly-learning-to-align-and-translate/) — The origin of attention - [*Attention Is All You Need*](/posts/attention-is-all-you-need/) — Attention takes center stage: the birth of the Transformer - [*BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*](/posts/bert/) — Establishing the pre-training paradigm - [*Scaling Laws for Neural Language Models*](/posts/scaling-laws-for-neural-language-models/) — The mathematics of scale - [*Language Models are Few-Shot Learners*](/posts/language-models-are-few-shot-learners/) — Larger models, better at eliciting abilities from context

Paper Reading: Scaling Laws for Neural Language Models

Sun, 01 Mar 2026 08:45:39 GMT

On January 23, 2020, a team of ten researchers from OpenAI uploaded a paper to arXiv (a preprint server where researchers can publish papers without waiting for journal peer review): [*Scaling Laws for Neural Language Models*](/papers/2001.08361v1.pdf). The ten were Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. All at OpenAI at the time. That author list is striking in retrospect. Jared Kaplan and Sam McCandlish are theoretical physicists by training — Kaplan was a string theory professor at Johns Hopkins before joining OpenAI. Dario Amodei was VP of Research. Tom B. Brown would later be the first author of the GPT-3 paper. Alec Radford designed GPT-1 and GPT-2. Within two years, Kaplan, McCandlish, and Amodei would leave OpenAI to co-found Anthropic (the company behind Claude). String theorists have a habit: they look for universal laws. That habit is all over this paper. ## 1. The Question By early 2020, the deep learning community already knew that bigger models tended to perform better. But "tended to" is not science. People could not answer basic practical questions: if I double my compute budget, how much will performance improve? Should I spend that budget on a bigger model, more data, or longer training? Is there a formula? This paper answered those questions. Not with intuition, not with rules of thumb — with equations. ## 2. Power Laws: The Core Discovery The paper's central finding is that language model performance follows **power laws**. Within the range the paper measured, when performance is primarily bottlenecked by one factor and not constrained by the other two, test loss (a measure of how well the model predicts the next word — lower is better) plotted against model size, dataset size, or compute forms an approximately straight line on a log-log plot. Three equations summarize the entire paper: $$ L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}, \quad \alpha_N \approx 0.076 $$ $$ L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \quad \alpha_D \approx 0.095 $$ $$ L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}, \quad \alpha_C \approx 0.050 $$ Do not panic at the notation. Let's break it down: - **L** is the test loss — a single number that captures how well the model performs. Lower is better - **N** is the number of parameters (model size). More parameters means the model can store more patterns - **D** is the number of data tokens the model is trained on. More data means more patterns to learn from - **C** is the total compute used for training, measured in PetaFLOP-days (one PetaFLOP-day = 10^15 floating-point operations running for a full day) - **N_c, D_c, C_c** are constants (reference points on the curve) - **α** (alpha) is the exponent — it tells you the slope of the line on a log-log plot. A bigger exponent means performance improves faster as you scale up The key insight: these are power laws, not logarithmic curves. A logarithmic curve flattens out quickly — doubling the input barely moves the output. A power law is far more generous: at least within the range the paper measured, performance showed no sign of hitting a wall, improving steadily along the power-law trend. The paper is careful to note that this cannot continue forever — loss will eventually flatten — but within the observed range, the trend held cleanly. ```python def power_law_loss(x: float, x_c: float, alpha: float) -> float: return (x_c / x) ** alpha def scaling_law_examples() -> dict[str, float]: alpha_n = 0.076 alpha_d = 0.095 alpha_c = 0.050 return { "10x_params": 10.0 ** alpha_n, "10x_data": 10.0 ** alpha_d, "10x_compute": 10.0 ** alpha_c, } ``` The exponents tell a story. Dataset size (α = 0.095) yields the most improvement per factor of scaling. Model size (α = 0.076) is next. Compute (α = 0.050) yields the least — because scaling compute without properly allocating it between model size and training time is wasteful. The real leverage comes from scaling the right thing. ## 3. Within the Tested Range, Architecture Shape Matters Less Than Scale Here is where the paper surprised everyone. The team tested Transformers with different depths (number of layers), widths (hidden dimension), attention heads, and feed-forward dimensions. Within the range of Transformer shapes they tested, as long as the total non-embedding parameter count was similar, performance differences were remarkably small. A Transformer with 2 layers and a massive hidden dimension? Roughly the same loss as one with 40 layers and a small hidden dimension — given a comparable non-embedding parameter budget. ```python from dataclasses import dataclass @dataclass(frozen=True) class ArchitectureExperiment: n_layers: int d_model: int n_heads: int d_ff: int def non_embedding_params(config: ArchitectureExperiment) -> int: n = config.n_layers d = config.d_model d_ff = config.d_ff return n * (4 * d * d + 2 * d * d_ff + 4 * d) ``` This has a profound implication: you do not need to spend weeks searching for the "optimal" architecture. Just pick a reasonable Transformer shape, then focus your energy on scaling it up. The paper explicitly excluded embedding parameters from N because they found embedding parameters contributed far less to performance than non-embedding parameters — the model's "thinking" capacity lives in the Transformer layers, not the vocabulary table. ## 4. When Models Overfit: The Data Bottleneck Bigger is not always better — not if your dataset is too small. The paper's real elegance here is a unified two-variable formula that captures how model size and dataset size jointly determine performance: $$ L(N, D) = \left[\left(\frac{N_c}{N}\right)^{\alpha_N / \alpha_D} + \frac{D_c}{D}\right]^{\alpha_D} $$ This formula says: loss is not just a function of model size or data size alone — it is a function of both at once. When N is large enough that the first term vanishes, the remaining term shows loss bottlenecked by data. When D is large enough, what remains is the model-size bottleneck. The formula smoothly interpolates between these two regimes and captures overfitting as a natural consequence of the two terms competing. From this relationship, the paper derives a rough rule of thumb for when overfitting begins to bite: $$ D \gtrsim 5 \times 10^3 \times N^{0.74} $$ In plain language: as you make the model bigger, the amount of data you need grows — but sublinearly. A model that is 10 times larger needs only about 10^0.74 ≈ 5.5 times more data. Bigger models are more sample-efficient: they extract more information from each token of training data. ```python def loss_nd(n_params: float, n_tokens: float) -> float: n_c = 8.8e13 d_c = 5.4e13 alpha_n = 0.076 alpha_d = 0.095 ratio = alpha_n / alpha_d return ((n_c / n_params) ** ratio + d_c / n_tokens) ** alpha_d def min_dataset_tokens(n_params: float) -> float: return 5_000.0 * n_params ** 0.74 ``` By this rough estimate, a 175-billion-parameter model would need close to a trillion tokens to keep overfitting within the paper's discussed threshold. GPT-3 was trained on approximately 300 billion tokens — well below that figure. In hindsight, GPT-3's data budget was not generous; it was arguably tight. This is one reason the industry later revisited the model-size-to-data ratio, most notably in the Chinchilla paper (Hoffmann et al., 2022), which argued that many large models had been undertrained relative to their optimal data allocation. ## 5. Compute-Efficient Training: The Real Punchline If you have a fixed compute budget, how should you spend it? This is the most practically important question in the paper, and the answer is counterintuitive. The paper found that optimal allocation follows: $$ N_{\mathrm{opt}} \propto C^{0.73} $$ $$ B_{\mathrm{opt}} \propto C^{0.24} $$ $$ S_{\mathrm{opt}} \propto C^{0.03} $$ Translation: if your compute budget grows 10x, you should make the model ~5.4x bigger, increase the batch size ~1.7x, and barely train longer (~1.07x more steps). The counterintuitive part: **you should train very large models and stop significantly before convergence.** Most people's instinct is to fully train a smaller model. The scaling laws say the opposite — a partially trained large model outperforms a fully trained small model, given the same compute budget. ```python from dataclasses import dataclass @dataclass(frozen=True) class ComputeAllocation: n_params: float batch_size: float training_steps: float def optimal_allocation(compute: float) -> ComputeAllocation: return ComputeAllocation( n_params=compute ** 0.73, batch_size=compute ** 0.24, training_steps=compute ** 0.03, ) def is_compute_efficient(n_params: float, compute: float) -> bool: optimal_n = compute ** 0.73 return abs(n_params / optimal_n - 1.0) < 0.5 ``` This result shaped the entire industry. GPT-3, which came five months after this paper, directly followed this logic: train a 175-billion-parameter model that was enormous for its time, rather than fully training a smaller model. The later "Chinchilla" paper (Hoffmann et al., 2022) updated these exponents and argued that most large models were actually undertrained relative to optimal data allocation — but the core insight, that there is a computable optimal trade-off, originated here. ## 6. Critical Batch Size: Knowing When to Parallelize The paper also discovered that there is a "sweet spot" for batch size, and it depends on the current loss: $$ B_{\mathrm{crit}} \propto L^{-4.8} $$ As training progresses and loss decreases, the critical batch size grows. Early in training, when loss is high, small batches are fine — each batch provides a strong enough gradient signal. Later, when the model has already learned the easy patterns, you need larger batches to average out noise and make progress. Below the critical batch size, doubling the batch roughly halves training time (perfect parallelism). Above it, doubling the batch barely helps — you are just burning compute. ```python def critical_batch_size(loss: float, b_star: float, l_star: float) -> float: return b_star * (l_star / loss) ** 4.8 ``` This is practical engineering wisdom. Many teams train with a fixed batch size throughout. The scaling laws say you should increase it as training progresses — start small, scale up as the model gets better. ## 7. My Takeaways After reading this paper, a few things stand out. First, the paper's deepest contribution is not any specific number. It is the demonstration that neural network performance is governed by simple, predictable laws. Before this paper, training large models was largely empirical — you tried things, you tweaked hyperparameters, you hoped for the best. After this paper, you could do math. You could predict how well a model would perform before training it. It at least pushed the most expensive, most consequential part of large model training — resource allocation — from empirical trial-and-error toward something estimable and plannable. Second, the backgrounds of the authors matter. Kaplan and McCandlish brought the mindset of theoretical physics: measure precisely, fit power laws, look for universality. This is not how most machine learning papers are written. Most ML papers propose a new architecture and show it beats baselines on benchmarks. This paper proposed no new architecture. It proposed a way of thinking. The tool is not new — the insight is. Third, the conclusion that "you should make the model as large as possible, and you do not need to train it to completion" is genuinely counterintuitive, and it reshaped how the industry allocates resources. Before this paper, the default was to pick a model size and train it until full convergence — spending the entire compute budget to squeeze every last drop of performance out of that model. After this paper, the question flipped: given the same compute budget, rather than training a small model to exhaustion, make the model as large as you can afford and stop when it is "good enough" — because a large model that has not finished training outperforms a small model that has been trained to the limit. That reasoning directly led to GPT-3 (175B parameters, 300B tokens) and influenced every large model that followed. Fourth, from a historical perspective, this paper can be read as the theoretical foundation for the [GPT-3 paper](/posts/language-models-are-few-shot-learners/). GPT-3 cites it directly, and the GPT-3 paper explicitly shows that few-shot performance scales smoothly with model capacity. It is reasonable to see GPT-3's 175-billion-parameter bet as informed by the scaling laws — though the GPT-3 paper itself does not say "we set the parameter count by plugging into Kaplan's formula." Still, without the confidence that scaling laws provided, the decision to train at that scale would have carried far more uncertainty. "Bigger models are better" was just a feeling before 2020. This paper turned it into a set of equations — telling you how much better, how much it costs, and how to spend most efficiently. The AI industry later became a compute race. After reading this paper, you understand why: it was not a blind arms race. Someone did the math first. --- **Paper Reading Series** - [*Sequence to Sequence Learning with Neural Networks*](/posts/sequence-to-sequence-learning-with-neural-networks/) — Establishing the encoder-decoder paradigm - [*Neural Machine Translation by Jointly Learning to Align and Translate*](/posts/neural-machine-translation-by-jointly-learning-to-align-and-translate/) — The origin of attention - [*Attention Is All You Need*](/posts/attention-is-all-you-need/) — Attention takes center stage: the birth of the Transformer - [*BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*](/posts/bert/) — Establishing the pre-training paradigm - [*Language Models are Few-Shot Learners*](/posts/language-models-are-few-shot-learners/) — Larger models, better at eliciting abilities from context - [*Training Compute-Optimal Large Language Models*](/posts/training-compute-optimal-large-language-models/) — How to spend your compute budget wisely

OpenClaw Deep Dive: Architecture Analysis 🦞

Tue, 24 Feb 2026 08:37:11 GMT

![OpenClaw](/images/openclaw-logo-text-dark.webp) The [last article](/posts/openclaw-ecosystem/) covered the ecosystem. This one tears apart the architecture. OpenClaw's codebase is not small -- 430,000 lines of TypeScript. But the interesting part isn't the line count; it's the architectural choices. An AI assistant that has to juggle twenty-plus chat platforms, manage multiple Agents, and invoke tools on the fly -- how do you cram all of that into one system without it falling apart? ## 0. A Few Terms First If you do not come from a systems background, that is fine. Six quick terms will make the rest much easier: - `architecture`: how a system is split into layers, and what each layer is responsible for - `Channel / adapter layer`: the layer that speaks each external chat platform's protocol - `Gateway`: the central coordinator that routes messages and manages sessions and Agents - `Node`: the execution endpoint that actually runs commands and interacts with devices - `workspace`: the local folder where an Agent's config, memory, skills, and session files live - `sandbox`: an isolated execution environment for risky operations ## 1. 3-Layer Architecture: Channel -> Gateway -> Node First, the big picture: ![OpenClaw Architecture](/images/openclaw-architecture.svg) Think of OpenClaw as a company with 3 departments: | Who | What it does | Analogy | |-----|-------------|---------| | **Channel (Adapter Layer)** | Connects to WhatsApp, Telegram, Discord, Feishu, and 20+ other chat platforms | The front desk -- answers calls, receives mail, greets visitors | | **Gateway** | Central dispatch; manages all sessions and Agents | The brain -- decides whose message goes where and how to reply | | **Node (Execution Node)** | Does things on the device -- takes photos, captures screens, runs commands | The hands and feet -- the brain says "take a photo," the Node goes and does it | The heart of the entire system is the Gateway -- a long-running process that by default only listens on localhost (`127.0.0.1:18789`), never exposed to the public internet. Want remote access? Use a Tailscale tunnel; don't open ports directly. This design is called "Loopback-First," and it's a clever security move: if you open zero ports, there's zero attack surface. Why a single process instead of a distributed setup? The reason is practical: WhatsApp's protocol requires that only one device be online at a time. Spin up two processes and they'll fight each other. Rather than piling on coordination logic for the sake of "architectural correctness," a single process handles everything end to end. For the vast majority of personal users, that's more than enough. ## 2. The Journey of a Single Message You send your OpenClaw a message on Telegram: "What's the weather in Beijing tomorrow?" Here's what happens: **Step 1: Intake.** The Telegram adapter receives the message and translates the Telegram-format data into OpenClaw's internal unified format. No matter which platform the message came from, the translated result looks identical. **Step 2: Identity check and routing.** Who are you? Are you allowed to talk to this Agent? If you're a stranger, a pairing code is sent first -- the owner has to approve you. Once cleared, the routing engine decides which Agent gets the message. **Step 3: Context assembly.** Your previous chat history is loaded from disk, along with the Agent's personality definition (SOUL.md), behavioral rules (AGENTS.md), and semantically relevant entries from the memory store. All of this is assembled into a complete "brain briefing." **Step 4: Query the LLM.** That briefing is sent to whichever LLM provider you've configured -- Anthropic, OpenAI, DeepSeek, MiniMax, GLM, Qwen, Gemini, and others. The model streams its reply back token by token. **Step 5: Tool execution.** If the model says "I need to run a command to check the weather," the runtime intercepts the request and, based on security policy, decides where to execute it (admin commands run directly; stranger commands run inside a Docker sandbox). The result is fed back to the model to continue generating. **Step 6: Reply.** The final answer is formatted according to Telegram's requirements (character limits, Markdown rules, etc.), split if necessary, and sent back to you. The conversation is written to disk. Across this entire chain, the only truly slow part is Step 4 -- waiting for the model to produce its first token. Everything else is millisecond-level. ## 3. The Channel Adapter Layer: Build Once, Talk Everywhere OpenClaw integrates with over twenty chat platforms, but it doesn't write a full implementation from scratch for each one. It abstracts an "adapter" layer, and each platform's adapter is responsible for exactly four things: - **Login**: WhatsApp scans a QR code, Telegram takes a Bot Token, iMessage uses native macOS capabilities -- each platform, its own rules - **Translate incoming messages**: Text, images, replies, reactions -- everything gets translated into the internal unified format - **Access control**: Who can DM, whether the bot replies only when @-mentioned in groups, which groups are allowed - **Translate outgoing replies**: Adapt the Agent's response into the format each platform can display Six major platforms are built in (WhatsApp, Telegram, Discord, Slack, Signal, iMessage); thirty-plus others (Feishu, LINE, Matrix, Mattermost, etc.) connect via plugins. The biggest payoff of this abstraction: the Agent never needs to know which platform a message came from. The Agent you chat with on WhatsApp is the same one you chat with on Telegram -- identical logic, fully reused. Channels are just megaphones; swapping the megaphone doesn't change the person speaking. ## 4. The Agent Runtime and Workspace The Agent runtime's core comes from Pi-mono (an open-source coding Agent), embedded directly within the Gateway process. ### The "Everything Is a Text File" Workspace This is one of OpenClaw's most compelling designs -- every piece of configuration for an Agent is a plain text file you can open and edit directly: ``` workspace/ ├── AGENTS.md # Defines who the Agent is and how it operates (think resume + employee handbook) ├── SOUL.md # Soul definition -- personality, values (set it and leave it alone) ├── USER.md # User profile -- your name, preferences ├── MEMORY.md # Long-term memory -- important things the Agent actively notes down ├── HEARTBEAT.md # Scheduled tasks -- e.g., report weather every morning ├── memory/ # The diary │ └── YYYY-MM-DD.md # One page per day, append-only ├── skills/ # Skill packs └── sessions.json # Session records ``` No database, no dedicated admin panel -- just plain text files. Want to change the Agent's personality? Open SOUL.md and tweak a couple of lines. Want to see what it remembers? Open MEMORY.md and read it. That "you can see its entire brain" transparency is something most closed AI products simply cannot offer. ### How Each Conversation Turn Runs 1. **Identify the caller**: Are you the admin in a direct chat, a friend in a DM, or someone who @-mentioned the bot in a group? Different sources, different security levels 2. **Assemble memory**: Load chat history, personality and rules, search relevant long-term memories, stitch it all into a complete context 3. **Query the LLM**: Send to the configured model, with fallback support -- if the primary model goes down, it automatically switches to the backup 4. **Act and remember**: Execute tool calls, update the conversation log Context assembly isn't a brute-force dump of everything. Irrelevant skills aren't loaded, unnecessary tools aren't injected -- saving tokens is saving money. ## 5. 4 Layers of Memory: Making AI Actually Remember You Most chatbots forget you the moment you close the window. OpenClaw doesn't. It has 4 layers of memory, from the deepest "who am I" to the shallowest "what did we just talk about": | Layer | What it is | Analogy | |-------|-----------|---------| | **SOUL** | Personality definition, never changes | Your character -- set from birth | | **TOOLS** | Currently installed skills | The tools you brought with you today | | **USER** | Long-term memory about you | An old friend who remembers your favorite food | | **Session** | The current conversation | The topic you're discussing right now | A few clever mechanisms: **Daily diary.** Each day's conversations are automatically written into `memory/2026-03-12.md` (append-only, never overwritten). When the next conversation starts, the Agent automatically flips through today's and yesterday's diary to maintain continuity -- "Did you end up buying that book you mentioned yesterday?" **Automatic memory rescue.** What happens when a conversation runs long and the context window is almost full? OpenClaw quietly runs an invisible turn in the background, saving critical information to MEMORY.md, then compresses the older content. You don't notice this happening, but the key facts are preserved. This mechanism is called Pre-Compaction. **Semantic search.** You say "that deployment issue we talked about before," and it can actually find it in memory -- not through exact keyword matching, but through semantic understanding. Under the hood, it's a dual approach: vector search (SQLite-vec) plus traditional keyword search (BM25). **Cross-platform identity.** You chat with it on Telegram for half an hour, then switch to WhatsApp and keep going -- it knows you. The same person's IDs across different platforms are linked to a single identity, sharing the same memory. But group chat memories are isolated -- what you said in a group won't leak into your private conversation. ## 6. The Tool System: Just 4 Knives This is OpenClaw's most "rebellious" design choice. Other AI Agent frameworks try to cram in a hundred built-in tools. OpenClaw gives you exactly four: | Tool | One-liner | |------|-----------| | **Read** | Read a file | | **Write** | Write a file | | **Edit** | Modify a file | | **Bash** | Run a command | That's it. Seriously. The founder's logic: with a command line (Bash), you can do anything. Check the weather? `curl` it. Send an email? Call a CLI tool. Query a database? `psql` does the job. No need to pre-build a dedicated tool for every scenario. This is the Unix philosophy -- small tools, composable, text streams. The tradeoff? You need a model smart enough to figure out which command to run on its own. That's why OpenClaw recommends a Claude Opus-tier model. Weaker models may not cut it. On top of these four core tools, there are 55 built-in Skills and the ClawHub skill marketplace. Skills can be installed and uninstalled -- think of them as apps for your Agent. **Here's where it gets spicy: OpenClaw deliberately does not support MCP.** MCP is Anthropic's tool protocol standard, and seemingly every AI framework in the world is adopting it. OpenClaw refuses. Peter's exact words: "MCP is garbage, it doesn't scale. You know what scales? CLI. Unix." The alternative is a built-in `mcporter` bridge. **Even more interesting is self-extension.** When an OpenClaw Agent encounters something it can't do, it writes a skill to handle it, then auto-installs it. Finds a bug in the skill? Fixes it and reloads. This means your Agent gets stronger over time through use -- it's essentially raising itself. ## 7. Multi-Agent Routing: One Brain, Multiple Personalities A single Gateway can run several Agents simultaneously, each doing its own thing. The routing rules look like this: ```json { "bindings": [ { "agentId": "home", "match": { "channel": "whatsapp", "accountId": "personal" } }, { "agentId": "work", "match": { "channel": "slack" } }, { "agentId": "bot", "match": { "channel": "discord", "guildId": "123456" } } ] } ``` In plain English: messages from your personal WhatsApp go to the "home assistant," Slack messages go to the "work assistant," and a specific Discord server goes to the "community bot." The three Agents are fully isolated -- each with its own personality, memories, skills, and security policies. ## 8. Security: 3 Doors Your Agent runs on your own server. It can execute commands and read/write files -- so security is obviously a big deal. OpenClaw's security model has three doors: **Door one: Who are you? (DM pairing)** A stranger messages your Agent, and the Agent doesn't just reply. It sends a 6-digit pairing code; you confirm it through an already-authenticated channel, and only then can the stranger interact. This is the default behavior -- turn it off and anyone who knows your number can burn through your API credits for free. **Door two: VIP lane (allowlist).** Trusted people can be added directly to the allowlist (`allowFrom`), bypassing pairing and chatting immediately. **Door three: Don't butt in (group rules).** In group chats, the Agent only replies when @-mentioned by default -- it won't pop up in response to every single message. This saves tokens and avoids annoying everyone. Beneath these are layers of defense in depth: five-tier tool permission filtering, Docker sandbox isolation (stranger commands run in the sandbox), and security audit commands (`openclaw security audit`). These layers are independent -- pairing code compromised? The sandbox is still there. Sandbox bypassed? Tool policies still restrict what can be called. ## 9. Takeaways **Single-process isn't laziness; it's pragmatism.** For individual users, one process handling everything is far more reliable than a distributed setup. Where's the ceiling? Probably when concurrent message volume exceeds what a single machine can handle -- and for the vast majority of people, that day will never come. **The channel abstraction layer is the most valuable layer.** Twenty-plus platforms' quirks are fully encapsulated in adapters; the Agent doesn't care where a message came from. Want to add a new platform? Write an adapter -- zero changes to Agent logic. The decoupling is exceptionally clean. **The security design is serious, but the implementation still has gaps.** Architecturally, identity verification, sandboxing, and tool policies create defense in depth. But Kaspersky's audit found 512 vulnerabilities (8 critical), showing that the distance between a sound blueprint and actual security is measured in sustained engineering effort. **The four-core-tool minimalism is a bet.** It's a bet that model capabilities will keep climbing -- powerful enough that you won't need pre-built tools, because "everything is Bash-able." If LLM capabilities plateau, this path gets tough. But if model intelligence keeps rising, this might be the most elegant approach there is. **The ultimate test isn't how pretty the architecture is, but how reliably it runs.** This analysis is based on static source code reading. Real-world performance -- Gateway stability under high concurrency, whether the sandbox truly withstands attacks, edge cases in cross-channel identity linking -- needs more production data to verify. Architecture is just the skeleton. Production is the exam.

Paper Reading: Language Models are Few-Shot Learners

Wed, 11 Feb 2026 08:22:54 GMT

On May 28, 2020, OpenAI uploaded a 75-page paper to arXiv (a preprint server where researchers can publish papers without waiting for journal peer review): [*Language Models are Few-Shot Learners*](/papers/2005.14165v4.pdf). The paper has 31 authors, all from OpenAI. The first author is Tom B. Brown, with notable co-authors including Jared Kaplan (a key researcher behind scaling laws), Alec Radford (the primary designer of GPT-1 and GPT-2), Ilya Sutskever (OpenAI co-founder and Chief Scientist), and Dario Amodei (OpenAI VP of Research). That author list later fractured into some of the most important AI companies in the world: Dario Amodei and Jared Kaplan left OpenAI to found Anthropic, and Ilya Sutskever later co-founded Safe Superintelligence Inc. (SSI). The paper's central claim is straightforward: scale a language model up to 175 billion parameters, and it can complete a wide range of tasks without updating any weights — using just a handful of examples — sometimes approaching the performance of models that were specifically fine-tuned. This is not task-level fine-tuning. It is the ability to adapt to tasks at inference time with fixed parameters, purely through context. The paper calls this **in-context learning**. ## 1. The Problem The "pre-train + fine-tune" paradigm established by [BERT](/posts/bert/) was already mainstream by 2020. It worked well, but the paper identified three fundamental issues. First, every new task still requires a labeled dataset. Labeled data is expensive to collect, and many real-world tasks have no corresponding labeled set at all. Second, a fine-tuned model's performance on test benchmarks does not necessarily reflect genuine generalization. The model may have simply learned spurious correlations in the training data — scoring high on the benchmark but collapsing under distribution shift. Third, humans do not learn this way. A human can see one or two examples, hear a natural language instruction, and handle a new task. The NLP systems of that era required thousands of labeled samples to fine-tune for each new task. The paper's starting point: if a model is large enough, can the knowledge it accumulates during pre-training allow it to directly "read" a task description and a few examples, then produce the answer? ## 2. The Core Idea: No Parameter Updates, Just Prompts GPT-3's evaluation methodology differed from every large model before it. It defined three settings, none of which involve gradient updates: **Few-Shot**: give the model a task description plus 10 to 100 examples (the exact number depends on how many fit in the context window), then have it complete a new input. No weight updates, no backpropagation. **One-Shot**: give just one example. This most closely mirrors how humans learn a new task — someone demonstrates once, and you take it from there. **Zero-Shot**: no examples at all, just a natural language instruction. This is the hardest setting, but also the most practical — if the model truly "understands" the task itself, it should not need any examples. ```python from dataclasses import dataclass from typing import Union @dataclass class ZeroShot: instruction: str prompt: str @dataclass class OneShot: instruction: str example: tuple[str, str] prompt: str @dataclass class FewShot: instruction: str examples: list[tuple[str, str]] prompt: str EvalSetting = Union[ZeroShot, OneShot, FewShot] def build_prompt(setting: EvalSetting) -> str: if isinstance(setting, ZeroShot): return f"{setting.instruction}\n{setting.prompt}" if isinstance(setting, OneShot): example_input, example_output = setting.example return f"{setting.instruction}\n{example_input} {example_output}\n{setting.prompt}" lines = [setting.instruction] lines.extend(f"{example_input} {example_output}" for example_input, example_output in setting.examples) lines.append(setting.prompt) return "\n".join(lines) ``` The paper calls this capability **in-context learning**: during pre-training, the model implicitly learns patterns for a wide variety of tasks from massive amounts of text; at inference time, examples are concatenated into the context, and the model "recognizes" the current task during the forward pass and completes it. The paper describes this process using the language of "meta-learning" — pre-training is the outer loop, in-context learning is the inner loop. The distinction from fine-tuning is fundamental. Fine-tuning modifies model parameters to fit a task. In-context learning modifies nothing — the same model, the same weights, switching tasks purely by varying the input text. ## 3. Model Architecture and Scale GPT-3's architecture is not a new invention. Like GPT-2, it is just the decoder portion of the [Transformer](/posts/attention-is-all-you-need/), stacked layer by layer. The only modification: alternating between dense attention and local banded sparse attention (from Sparse Transformer) within the Transformer layers. What is genuinely different is the scale. The paper trained 8 models of varying sizes, spanning three orders of magnitude in parameter count: | Model | Parameters | Layers | Hidden Size | Attention Heads | |-------|-----------|--------|-------------|-----------------| | GPT-3 Small | 125M | 12 | 768 | 12 | | GPT-3 Medium | 350M | 24 | 1024 | 16 | | GPT-3 Large | 760M | 24 | 1536 | 16 | | GPT-3 XL | 1.3B | 24 | 2048 | 24 | | GPT-3 2.7B | 2.7B | 32 | 2560 | 32 | | GPT-3 6.7B | 6.7B | 32 | 4096 | 32 | | GPT-3 13B | 13B | 40 | 5140 | 40 | | **GPT-3 175B** | **175B** | **96** | **12288** | **96** | 175 billion parameters, 96 layers, 96 attention heads, hidden dimension of 12288. Context window of 2048 tokens. This scale was unprecedented at the time — over 100 times larger than GPT-2's 1.5 billion parameters. ```python from dataclasses import dataclass @dataclass(frozen=True) class GPT3Config: n_params: int n_layers: int d_model: int n_heads: int d_head: int d_ff: int n_ctx: int def gpt3_175b() -> GPT3Config: return GPT3Config( n_params=175_000_000_000, n_layers=96, d_model=12_288, n_heads=96, d_head=128, d_ff=49_152, n_ctx=2_048, ) ``` The purpose of training these models was explicit: to validate scaling laws. Earlier work by Kaplan et al. (one of this paper's co-authors) had already shown a smooth power-law relationship between language model loss and parameter count. GPT-3 pushed that hypothesis to 175 billion parameters to see whether in-context learning ability follows the same pattern. The answer is yes: the larger the model, the steeper the improvement in few-shot learning. Zero-shot performance rises steadily with scale, and few-shot performance rises even faster. This means larger models are not just "more accurate" — they are also more efficient at leveraging contextual information. ## 4. Training Data GPT-3 was trained on approximately 300 billion tokens, drawn from five sources: | Dataset | Tokens | Training Mix | |---------|--------|--------------| | Common Crawl (filtered) | 410B | ~60% | | WebText2 | 19B | ~22% | | Books1 | 12B | ~8% | | Books2 | 55B | ~8% | | English Wikipedia | 3B | ~3% | Note a key detail: the sampling proportions are not proportional to dataset size. Higher-quality datasets (WebText2, Books, Wikipedia) were oversampled — WebText2 was seen 2.9 times during training, Wikipedia 3.4 times, while Common Crawl was not even seen once in full (0.44 epochs). The paper deliberately traded a small amount of overfitting for higher-quality training signal. The raw Common Crawl data was 45TB. It went through three processing steps: (1) filtering based on similarity to high-quality reference corpora; (2) document-level fuzzy deduplication; (3) mixing in known high-quality datasets for diversity. After filtering, 570GB remained — roughly 410 billion tokens. All models were trained on V100 GPUs using a high-bandwidth cluster provided by Microsoft. ## 5. Experimental Results The paper evaluated across more than twenty datasets, covering 9 major task categories. Here are several key results. **Language Modeling**: on Penn Tree Bank, GPT-3 few-shot perplexity (a measure of how "surprised" the model is by text — lower is better) reached 20.50, setting a new record. On LAMBADA (which requires predicting the final word based on long-range context), zero-shot accuracy was 76.2%, few-shot 86.4%, substantially surpassing the previous best. **Translation**: GPT-3 was never specifically trained for translation, yet on French-to-English, few-shot BLEU score reached 32.6, exceeding the best unsupervised neural machine translation result. However, English-to-French (25.2 BLEU) still lagged significantly behind fine-tuned models. An interesting finding: GPT-3 is noticeably better at translating into English than out of it, directly reflecting the English-heavy composition of its training data. **Closed-Book QA**: on TriviaQA, few-shot accuracy (exact match) was 71.2%, surpassing fine-tuned models under the same closed-book setting. The model references no documents — it answers purely from knowledge stored in its parameters. **SuperGLUE**: on this comprehensive benchmark, GPT-3's few-shot performance approached some strong fine-tuned baselines, but still trailed the strongest dedicated fine-tuned systems of the time. **Synthetic Tasks**: the paper also designed novel tasks specifically to test in-context learning. For example, giving the model a few examples of "made-up words" (defining a nonexistent word and then using it in a sentence), GPT-3 could correctly learn and use the new word. Three-digit addition was nearly 100% accurate in few-shot (two-digit was also near-perfect), but accuracy dropped sharply at four and five digits. ```python from typing import Callable, Protocol class AutoregressiveModel(Protocol): def forward(self, tokens: list[int]) -> list[list[float]]: ... def in_context_learning( model: AutoregressiveModel, examples: list[tuple[str, str]], query: str, tokenize: Callable[[str], list[int]], decode: Callable[[list[int]], str], sample_from: Callable[[list[float]], int], eos_token: int, ) -> str: prompt_lines = [f"{example_input} {example_output}" for example_input, example_output in examples] prompt_lines.append(query) prompt = "\n".join(prompt_lines) context = tokenize(prompt) output_tokens: list[int] = [] while True: logits = model.forward(context) next_token = sample_from(logits[-1]) if next_token == eos_token: break output_tokens.append(next_token) context.append(next_token) return decode(output_tokens) ``` ## 6. Data Contamination The paper devotes substantial space in Section 4 to a thorny issue: overlap between training data and test data. GPT-3's training data includes vast amounts of internet text, and many test benchmarks are publicly available on the internet. This means the model may have "seen" the test questions during training. The team attempted to remove these overlaps before training, but due to a bug in the processing pipeline, some overlaps were not fully cleaned. Retraining from scratch was too expensive to be practical. Their approach: for each benchmark, construct a "clean subset" (removing all samples with 13-gram overlaps against the training data), then compare model performance on the full set versus the clean subset. The conclusion: for most benchmarks, contamination had minimal impact on results. However, PIQA and Winograd showed suspicious performance drops, and the paper flagged those results with asterisks. This level of honesty was quite rare at the time. Most papers avoid discussing data contamination entirely. GPT-3 not only proactively investigated the issue but also developed systematic detection tools. That itself is a contribution to subsequent research. ## 7. Limitations The paper's discussion of its own limitations in Section 5 is remarkably candid. **Text Coherence**: GPT-3 still exhibits semantic repetition, self-contradiction, and even nonsensical sentences at the document level. Generation quality is much better than GPT-2, but long-form coherence remains insufficient. **Commonsense Physics**: GPT-3 performs poorly on commonsense physics questions like "If you put cheese in a refrigerator, will it melt?" It can handle linguistic reasoning, but its understanding of the physical world remains superficial. **The Cost of Unidirectionality**: as an autoregressive model, GPT-3 can only look left-to-right. The paper acknowledges that on tasks requiring bidirectional context (such as determining whether the same word in two sentences carries the same meaning), GPT-3's few-shot performance falls short of fine-tuned bidirectional models. This indicates that such tasks are not GPT-3's strength under its autoregressive setup; the unidirectional modeling objective introduces a structural bias. **Sample Efficiency**: GPT-3 saw approximately 300 billion tokens during pre-training, far exceeding the amount of text a human encounters in a lifetime. The paper explicitly notes that even though few-shot learning is efficient at inference time, the data requirements for pre-training remain enormous. **Inference Cost**: a 175-billion-parameter model is expensive to run and difficult to deploy. The paper mentions distillation (using a large model's outputs to train a smaller model) as a possible direction, but notes it has not yet been attempted at the hundred-billion-parameter scale. ## 8. Societal Impact The paper dedicates an entire section (Section 6) to societal impact, covering three areas. **Misuse Risks**: human evaluators could identify GPT-3-generated news articles at only about chance level (~52% accuracy). The stronger the model, the harder its fabricated text is to detect. The team reported that they were monitoring forums and chat groups to track trends in malicious use. **Bias**: the paper ran extensive experiments testing GPT-3's biases across gender, race, and religion. For example, in occupation-gender association tests, GPT-3 was more likely to associate "nurse" with female and "banker" with male. In religion-sentiment associations, "Islam" co-occurred more frequently with violence-related words. The paper acknowledges these biases originate from the training data but offers no solution. **Energy Consumption**: training GPT-3 requires massive compute, and the paper cites estimates but does not disclose specific energy figures. However, it points out that once trained, the model can be applied to many different tasks, making it more energy-efficient than training a separate model for each task. ## 9. My Takeaways After reading this paper, a few things stand out. First, GPT-3 demonstrated something important: scale can push in-context learning past the usability threshold. A 175-billion-parameter model is not simply "a bigger GPT-2" — its in-context learning performance exceeds smaller models by an order of magnitude. The model completes new tasks with no parameter updates, relying solely on a few examples in the context. This capability was not explicitly hand-designed; it emerged gradually as scale increased, and only at GPT-3's scale did it become clear and practical enough to matter. BERT proved the value of pre-training. GPT-3 proved the value of scale. Second, the paper's writing approach is worth noting. 31 authors, 75 pages, deploying a massive number of experiments to answer a simple question: are larger models better at leveraging a few examples? They did not shy away from limitations — text coherence, commonsense reasoning, data contamination, bias — all discussed head-on. That level of rigor has, ironically, become increasingly rare in later large model papers. Third, this paper's author list reads like a history of the AI industry's fracturing. Dario Amodei and Jared Kaplan later founded Anthropic (the company behind Claude), and Ilya Sutskever left OpenAI to co-found SSI. In 2020, these people were still on the same team co-authoring a paper; within two years, they had diverged in different directions. The paper's discussion of societal impact and safety risks may well have been a foreshadowing of those later disagreements. Fourth, from a technical evolution standpoint, GPT-3 marks the turning point from "pre-train + fine-tune" to "pre-train + prompt." BERT's approach was: learn general knowledge first, then fine-tune parameters for each task. GPT-3 said: if the model is large enough, the fine-tuning step can be skipped — just tell the model in natural language what you want it to do. This idea later evolved into the core interaction paradigm of products like ChatGPT and Claude: the user asks a question in natural language, and the model answers directly. From Seq2Seq's encode-decode, to [Bahdanau attention](/posts/neural-machine-translation-by-jointly-learning-to-align-and-translate/)'s "where to look," to the [Transformer](/posts/attention-is-all-you-need/)'s "look everywhere at once," to [BERT](/posts/bert/)'s "learn first, then fine-tune," to GPT-3's "scale up until fine-tuning is unnecessary" — each step reduced the need for human intervention and increased the model's ability to handle tasks on its own. GPT-3 is not the endpoint. But it was the first time people seriously considered a question: if we keep making models bigger, what else will emerge? The answer to that question is everything that came after. --- **Paper Reading Series** - [*Sequence to Sequence Learning with Neural Networks*](/posts/sequence-to-sequence-learning-with-neural-networks/) — Establishing the encoder-decoder paradigm - [*Neural Machine Translation by Jointly Learning to Align and Translate*](/posts/neural-machine-translation-by-jointly-learning-to-align-and-translate/) — The origin of attention - [*Attention Is All You Need*](/posts/attention-is-all-you-need/) — Attention takes center stage: the birth of the Transformer - [*BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*](/posts/bert/) — Establishing the pre-training paradigm - [*Scaling Laws for Neural Language Models*](/posts/scaling-laws-for-neural-language-models/) — The mathematics of scale - [*Training Compute-Optimal Large Language Models*](/posts/training-compute-optimal-large-language-models/) — How to spend your compute budget wisely

OpenClaw Deep Dive: Ecosystem Analysis 🦞

Tue, 03 Feb 2026 08:09:32 GMT

![OpenClaw](/images/openclaw-logo-text-dark.webp) ## 0. A Few Terms First If this is your first time reading an ecosystem analysis like this, 5 quick terms will help: - `ecosystem`: not just one repo, but a set of products and tools growing around the same core capability - `runtime`: the long-running core process that actually runs the Agent and orchestrates work - `skill marketplace`: the place where new Agent capabilities are discovered and installed - `workflow engine`: a system that packages repeated multi-step tasks into reusable flows - `flywheel`: a growth loop where better supply attracts more usage, which in turn attracts more supply ## 1. Start With the Person To understand a project, start with the person behind it. Peter Steinberger, Austrian. Founded PSPDFKit in 2011, building low-level PDF tech. Clients included Apple and Dropbox, serving over 1 billion devices. The outcome? Public reports put it at "nine-figure territory." Then he stepped back from the front lines. Then he burned out. His own words: "staring at the screen, unable to write code." He bought a one-way ticket to Madrid and completely disconnected for a while. By mid-2025, he wrote on his blog: the spark was back. AI had moved past the demo stage -- it could produce real product prototypes now. He used about an hour of prompting to generate the project's skeleton, published it in November, and named it Clawdbot -- a nod to Anthropic's Claude (Claw = claw), with a lobster as the mascot. In late January 2026, Anthropic sent a trademark warning -- the name was too close to Claude. Within three days it went from Clawdbot to Moltbot (Molt = molting) to OpenClaw. The renaming itself blew up -- 34,000 new stars in 48 hours. A person who has produced nine-figure results chose to pour his energy into an MIT-licensed open-source project. Whatever the motivation, that choice alone deserves a serious look. ## 2. Not Just a Project -- It's Growing an Ecosystem For most viral open-source projects, "ecosystem" basically means: a roadmap in the docs plus a few placeholder repos. OpenClaw is different. It has grown real product layering: | Component | What it does | Stars | |-----------|-------------|-------| | **OpenClaw** | Core Agent runtime -- the brain | 140k+ | | **ClawHub** | Skill marketplace -- App Store for Agents | 5.4k | | **Lobster** | Workflow engine -- packages repetitive tasks into one-click pipelines | ~800 | | **acpx** | Headless CLI tool | ~780 | | **openclaw-ansible** | Automated deployment -- one command to set everything up | ~490 | | **nix-openclaw** | Nix declarative config | ~530 | Runtime, skill marketplace, workflow engine, deployment tools -- each does its own thing, with clean responsibility boundaries. This isn't the kind of crude expansion where everything gets crammed into one repo. It's layering by design. ## 3. A Few Key Judgments ### Channel Coverage: Not Showing Off -- It's "You Don't Have to Do Anything" WhatsApp, Telegram, Slack, Discord, Signal, iMessage, Feishu, LINE, Matrix... over twenty platforms with direct integration. What does that mean? You don't need to download a new app to use an AI assistant. You don't need to learn a new interface. You don't need to change any habits. It just shows up in the chat you're already using. Send a message on WhatsApp and you're good -- same as texting a friend. More channels means closer to real usage scenarios, and a lower barrier for first-time use. This kind of "low-friction access" doesn't just mean convenience -- it's a crushing adoption advantage. Sure, there are protocol adaptation and maintenance costs under the hood, but that's the price, not the point. The point is it turns the AI assistant from "a new tool you have to go open" into "a capability that's already in your chat." ### ClawHub: Right Direction, Still Too Early The skill marketplace uses vector search for semantic matching -- you don't have to browse category directories, just tell it "I want a skill that can send emails" and it finds one for you. Great design intent. But whether the skill marketplace flywheel actually spins depends on two things: enough quality skills, and good enough discovery. The latter is already there. The former? Third-party skill publishing frequency, install volume, update activity, review throughput -- none of this has a public dashboard yet. The flywheel has started turning, but it's far from self-sustaining. ### Lobster: Solving a Hidden Pain Point What's the biggest hidden cost of AI Agents? Redundant planning. Every time you ask an Agent to do a multi-step task, it rethinks the whole thing from scratch: what to do first, what to do next, how to do it. That's how tokens get burned. Lobster's approach: package high-frequency operations into one-click pipelines -- build once, call repeatedly, no replanning needed. It also has an approval gate: critical steps pause and wait for your go-ahead, preventing the Agent from running wild. This shows the team has at least thought seriously about how much autonomy an Agent should have. ### Security: The Unavoidable Hurdle Kaspersky audited 512 vulnerabilities, 8 critical (the audit was done when it was still called Clawdbot). Cisco's security team straight-up called it a "security nightmare." Gary Marcus publicly called it "a disaster waiting to happen." The problem is structural: the more permissions you give an Agent, the more it can do, but the larger the attack surface. Prompt injection can trick it into doing bad things. Skills have already been caught exfiltrating data to external servers. OpenClaw isn't unaware -- pairing codes, allowlists, sandboxing, command approval, layer after layer tightening up. But the structural tension of "high-privilege Agent + third-party skills + twenty-plus entry points" isn't something you fix with a few bug patches. This is a long war. ## 4. Risks -- Can't Not Talk About Them Done with the good stuff. Now the bad. **Security debt is not technical debt -- it's trust debt.** One high-profile security incident hurts not just OpenClaw, but every project on the self-hosted AI Agent path. The entire narrative gets dragged underwater. **The business model is a question mark.** MIT license, no subscription, users bring their own API keys. Peter has publicly mentioned the project's monthly server costs are in the ten-to-twenty-thousand-dollar range. An open-source project running on sponsorships -- sustainability depends on whether hype can turn into money. No clear path yet. **Growth quality is questionable.** The January 30 explosion was tightly linked to the viral rise of Moltbook (an AI Agent social network). Of the stars brought by viral spread, how many will become real ecosystem contributors? Stars ≠ code contributions ≠ ecosystem depth. **Single-point dependency.** 18,000+ commits, but core roadmap and product judgment still heavily depend on one person -- the founder. Maintainers have been added from the community, but whether it can go from "one person's project" to "a community's platform" is the real watershed ahead. ## 5. Conclusion What's truly scarce about OpenClaw isn't the hype -- anyone can have hype for a while. What's scarce is that it has already grown from a single project into the beginnings of an ecosystem: with layering, division of labor, and real product form. But hype will eventually fade. After the tide goes out, four things will determine its fate: 1. **Has security tightened up?** Have pairing, allowlists, sandboxing, and approval gates moved from "optional" to "default"? 2. **Is the skill ecosystem spinning?** Forget star counts -- are high-quality skills being published, installed, and reviewed in a positive feedback loop? 3. **Is the non-founder contribution share growing?** This determines whether it's a "star project" or a "sustainable platform." 4. **Is the governance structure clear?** This determines whether it can evolve from a hype-driven project into long-term infrastructure. I'll keep tracking this.

Paper Reading: BERT — Pre-training of Deep Bidirectional Transformers for Language Understanding

Sat, 31 Jan 2026 08:52:21 GMT

On October 11, 2018, the Google AI Language team uploaded a paper to arXiv (a preprint server where researchers can publish papers without waiting for journal peer review): [*BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*](/papers/1810.04805v2.pdf). The authors are Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, all from Google. Devlin had previously worked at Microsoft Research before joining Google, where he led the design and implementation of BERT. BERT stands for Bidirectional Encoder Representations from Transformers. It did something remarkably bold for its time: first do general-purpose pre-training on massive amounts of unlabeled text, then add just one output layer and fine-tune on a specific task to achieve state-of-the-art results. This "pre-train, then fine-tune" paradigm later became the standard approach across all of NLP. The GPT series followed a similar idea but took a different path — unidirectional generation. BERT chose bidirectional understanding. The two paths each spawned vast families of models. ## 1. The Problem In 2018, NLP had an awkward status quo: every task required its own specially designed model architecture. Question answering needed one model, sentiment analysis needed another, named entity recognition yet another. Labeled data for each task was scarce, and models trained on one task were hard to transfer to others. There had already been attempts at pre-training. ELMo used bidirectional LSTMs to learn contextual representations, but it merely "bolted" pre-trained features onto task-specific architectures — the architecture itself was still task-specific. OpenAI GPT used the Transformer for pre-training and fine-tuning, but it could only look left-to-right (unidirectional) — each token could only attend to tokens before it, never after. The paper argued that unidirectional language models have significant limitations on language understanding tasks that require deep bidirectional context. For example: > "He picked up the _____ and started playing." Looking only at the left context ("He picked up the"), the answer could be anything. But seeing the right context ("and started playing"), you immediately know it is some kind of musical instrument. For many language understanding tasks, bidirectional context is naturally more advantageous. ## 2. The Core Idea: Mask Some Words, Make the Model Guess BERT's solution is intuitive: since a bidirectional language model cannot be trained the traditional way (each word would indirectly "see itself"), change the training objective. **Masked Language Model (MLM)**: randomly mask 15% of the input tokens — specifically, replace them with a special \[MASK\] token — then have the model predict the masked words from context. This idea comes from the Cloze task in psychology (proposed by Taylor in 1953), just like the fill-in-the-blank exercise above. After masking, the model must use both left and right context to make predictions, and bidirectional understanding emerges naturally. But replacing all selected tokens with \[MASK\] introduces a problem: \[MASK\] never appears during fine-tuning, creating a mismatch between pre-training and fine-tuning. The paper's solution: of the selected 15% of tokens, 80% are replaced with \[MASK\], 10% are replaced with a random token, and 10% are left unchanged. This way, the model cannot simply rely on "I see \[MASK\] so I need to predict" — it must maintain understanding at every position. ```python import random from typing import Optional, Sequence def mask_tokens( tokens: Sequence[str], mask_prob: float = 0.15, vocab: Optional[Sequence[str]] = None, ) -> tuple[list[str], list[int], list[str]]: if vocab is None: vocab = tokens masked = list(tokens) positions: list[int] = [] labels: list[str] = [] for i, token in enumerate(tokens): if random.random() < mask_prob: positions.append(i) labels.append(token) r = random.random() if r < 0.8: masked[i] = "[MASK]" elif r < 0.9: masked[i] = random.choice(vocab) return masked, positions, labels ``` ## 3. The Second Pre-training Task: Next Sentence Prediction Many NLP tasks (such as question answering and natural language inference) require understanding the relationship between two sentences, but language models do not directly model such relationships. The paper added a second pre-training task: **Next Sentence Prediction (NSP)**. The model is given two sentences A and B — 50% of the time B is the actual next sentence after A, and 50% of the time B is randomly drawn from the corpus. The model must judge whether B actually follows A. The task design is simple, but the paper's ablation study (removing one component at a time to observe the effect) showed that removing NSP noticeably hurt performance on question answering and natural language inference tasks; however, later work (such as RoBERTa) reached different conclusions about the necessity of NSP. ```python from dataclasses import dataclass @dataclass class PretrainingExample: tokens: list[str] segment_ids: list[int] masked_positions: list[int] masked_labels: list[str] is_next: bool ``` ## 4. Model Architecture BERT's architecture is not a new invention. It is simply the encoder portion of the [Transformer](/posts/attention-is-all-you-need/), stacked layer by layer. The paper specifies two sizes: - **BERT_BASE**: 12 layers, hidden size 768, 12 attention heads, 110M parameters - **BERT_LARGE**: 24 layers, hidden size 1024, 16 attention heads, 340M parameters BERT_BASE has roughly the same parameter count as OpenAI GPT, enabling direct comparison. The most critical difference between the two is just one thing: GPT uses unidirectional attention (each token can only see tokens to its left), while BERT uses bidirectional attention (each token can see all positions). The input representation is the sum of three components: - **Token Embedding**: WordPiece tokenization, 30,000 vocabulary - **Segment Embedding**: marks whether a token belongs to sentence A or sentence B - **Position Embedding**: tells the model the position of each token (BERT uses learned position embeddings, not sinusoidal) Every input sequence begins with a special \[CLS\] token, whose final-layer hidden state is used for sentence-level classification (e.g., NSP, sentiment analysis). Two sentences are separated by \[SEP\]. ```python import torch from torch import nn class BertEmbeddings(nn.Module): def __init__( self, vocab_size: int, hidden_size: int, max_positions: int, type_vocab_size: int = 2, dropout: float = 0.1, ) -> None: super().__init__() self.token_embedding = nn.Embedding(vocab_size, hidden_size) self.segment_embedding = nn.Embedding(type_vocab_size, hidden_size) self.position_embedding = nn.Embedding(max_positions, hidden_size) self.layer_norm = nn.LayerNorm(hidden_size, eps=1e-12) self.dropout = nn.Dropout(dropout) def forward( self, token_ids: torch.Tensor, segment_ids: torch.Tensor, ) -> torch.Tensor: position_ids = torch.arange(token_ids.size(1), device=token_ids.device).unsqueeze(0) position_ids = position_ids.expand_as(token_ids) embeddings = ( self.token_embedding(token_ids) + self.segment_embedding(segment_ids) + self.position_embedding(position_ids) ) embeddings = self.layer_norm(embeddings) return self.dropout(embeddings) class BertModel(nn.Module): def __init__( self, vocab_size: int, hidden_size: int = 768, max_positions: int = 512, num_layers: int = 12, num_heads: int = 12, dropout: float = 0.1, ) -> None: super().__init__() self.embeddings = BertEmbeddings(vocab_size, hidden_size, max_positions, dropout=dropout) encoder_layer = nn.TransformerEncoderLayer( d_model=hidden_size, nhead=num_heads, dim_feedforward=4 * hidden_size, dropout=dropout, activation="gelu", batch_first=True, ) self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers) def forward( self, token_ids: torch.Tensor, segment_ids: torch.Tensor, ) -> torch.Tensor: hidden = self.embeddings(token_ids, segment_ids) return self.encoder(hidden) ``` ## 5. Fine-tuning: One Model for Every Task The most elegant aspect of BERT is the simplicity of fine-tuning. Once pre-training is complete, the procedure is nearly identical regardless of the downstream task: add one task-specific output layer on top of BERT, then fine-tune all parameters with a small amount of labeled data. - **Text classification** (sentiment analysis, natural language inference): take the output vector at the \[CLS\] position and feed it to a linear classifier - **Question answering** (given a passage, find the answer span): apply two linear transformations to each token's output vector, predicting the start and end positions of the answer - **Sequence labeling** (named entity recognition): attach a classifier to each token's output vector, predicting labels token by token Pre-training may take days, but fine-tuning typically takes only minutes to hours (most tasks under 1 hour on a single TPU). This efficiency gap is the core appeal of the "pre-train + fine-tune" paradigm. ## 6. Experimental Results The paper ran experiments on 11 NLP tasks, setting new records on all of them. **GLUE benchmark** (General Language Understanding Evaluation, 8 sub-tasks): - BERT_LARGE averaged 80.5%, a 7.7 percentage point improvement over the previous best (OpenAI GPT) - On the largest sub-task MNLI, a 4.6% improvement **SQuAD v1.1** (reading comprehension QA, Test F1): - BERT_LARGE single model + TriviaQA data: F1 91.8, surpassing human performance (91.2) - BERT_LARGE ensemble + TriviaQA data: F1 93.2 **SQuAD v2.0** (includes unanswerable questions): - F1 of 83.1, a 5.1 point improvement over the previous best system **SWAG** (commonsense reasoning): - Accuracy of 86.3%, an 8.3 point improvement over OpenAI GPT The paper also ran ablation experiments on model size and found an important conclusion: larger models performed better on all tasks, even on tasks with very little labeled data (as few as 3,600 examples). This ran counter to the prevailing intuition that small datasets would cause large models to overfit (memorize training data and perform poorly on new data), suggesting that pre-trained knowledge effectively mitigates this risk. ## 7. Training Details **Pre-training data**: BooksCorpus (800M words) + English Wikipedia (2,500M words), using only text passages and discarding lists, tables, and headers. The paper emphasized the importance of using document-level corpora rather than shuffled sentence-level corpora, in order to capture long-range contextual relationships. **Tokenization**: WordPiece with a vocabulary of 30,000. WordPiece splits uncommon words into smaller subword units — for example, "playing" might be split into "play" + "##ing". **Optimizer**: Adam, learning rate 1e-4, with linear warmup over the first 10,000 steps followed by linear decay. Batch size of 256 sequences, maximum sequence length of 512. **Hardware**: BERT_BASE was trained on 4 Cloud TPUs (16 TPU chips) for 4 days. BERT_LARGE was trained on 16 Cloud TPUs (64 TPU chips) for 4 days. **Dropout**: 0.1 across all layers. The activation function is GELU (Gaussian Error Linear Unit), rather than the original Transformer's ReLU. ## 8. My Takeaways After reading this paper, a few things stand out. First, BERT's real contribution is not the model architecture (it is just the Transformer encoder) but the training method. The masked language model idea looks simple, but it elegantly solves a fundamental contradiction: how to leverage bidirectional context without letting the model "cheat." The 80/10/10 masking strategy is even more carefully designed, addressing the mismatch between pre-training and fine-tuning. Second, the divergence between BERT and GPT is already clear in this paper. GPT's autoregressive objective is more naturally suited to generation; BERT's bidirectional encoding is better suited to discriminative language understanding tasks. GPT later scaled up toward stronger generation capabilities, while BERT spawned a family of understanding-oriented models including RoBERTa, ALBERT, and DeBERTa. Both lines continue to serve their respective domains. Third, the impact of the "pre-train + fine-tune" paradigm extends far beyond NLP. Computer vision later made a wholesale shift toward the same approach (ViT, MAE), and even multimodal models (CLIP, GPT-4V) build on large-scale pre-training with fine-tuning or prompting. BERT was not the first to do pre-training, but it was the first to push pre-training, in such a concise way, from a useful trick into the mainstream working paradigm of NLP. Fourth, when rewriting BERT's input processing in real Python, you can feel how clean the design is. \[CLS\] + sentence A + \[SEP\] + sentence B + \[SEP\], with three embeddings summed together — the entire pipeline can handle classification, question answering, and sequence labeling with a single unified codebase. This "one model for every task" simplicity is where its real power lies. There is one word in this paper's title that matters most: Pre-training. Before BERT, every NLP task was learning from scratch. BERT proved something: general knowledge about language can be learned first, then transferred to virtually any task. That idea changed how an entire field works. --- **Paper Reading Series** - [*Sequence to Sequence Learning with Neural Networks*](/posts/sequence-to-sequence-learning-with-neural-networks/) — Establishing the encoder-decoder paradigm - [*Neural Machine Translation by Jointly Learning to Align and Translate*](/posts/neural-machine-translation-by-jointly-learning-to-align-and-translate/) — The origin of attention - [*Attention Is All You Need*](/posts/attention-is-all-you-need/) — Attention takes center stage: the birth of the Transformer - [*Scaling Laws for Neural Language Models*](/posts/scaling-laws-for-neural-language-models/) — The mathematics of scale - [*Language Models are Few-Shot Learners*](/posts/language-models-are-few-shot-learners/) — Larger models, better at eliciting abilities from context - [*Training Compute-Optimal Large Language Models*](/posts/training-compute-optimal-large-language-models/) — How to spend your compute budget wisely

Paper Reading: Sequence to Sequence Learning with Neural Networks

Sat, 24 Jan 2026 08:41:08 GMT

On September 10, 2014, three Google researchers uploaded a paper to arXiv (a preprint server where researchers can publish papers without waiting for journal peer review): [*Sequence to Sequence Learning with Neural Networks*](/papers/1409.3215v3.pdf). The authors are Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, all from Google. Sutskever was one of the authors of AlexNet, collaborating with Alex Krizhevsky and Geoffrey Hinton on the paper that ignited the deep learning revolution; he later became a co-founder of OpenAI. Vinyals went on to lead AlphaStar (DeepMind's StarCraft AI) at DeepMind. Quoc V. Le drove AutoML and other research at Google. This paper did something deceptively simple: use one neural network to read a sentence and compress it into a vector, then use another neural network to generate a translation from that vector. The input and output can differ in length, language, and structure. This framework has a name: "Sequence to Sequence" (Seq2Seq). It established the encoder-decoder paradigm. Later, [Bahdanau added attention on top of it](/posts/neural-machine-translation-by-jointly-learning-to-align-and-translate/), and then [Vaswani et al. rewrote the entire architecture with the Transformer](/posts/attention-is-all-you-need/). But the starting point was this paper. ## 1. The Problem By 2014, deep neural networks had already achieved breakthroughs in tasks like image recognition, but for tasks like machine translation -- directly mapping a variable-length sequence to another variable-length sequence -- neural networks were still struggling. An English sentence might be 5 words, and its French translation 7 words. The input and output differ in length, with no simple one-to-one correspondence. The conventional solution was to piece together a large number of hand-designed rules and statistical features into a complex translation pipeline (Statistical Machine Translation, SMT). It worked, but each component had to be tuned separately, and end-to-end optimization was difficult. The paper proposed a simpler idea: can a single end-to-end neural network map directly from a source language sequence to a target language sequence? ## 2. Core Architecture: Encoder-Decoder The paper's approach can be summarized in one sentence: **one LSTM reads, another LSTM writes.** LSTM (Long Short-Term Memory) is a special type of RNN designed to handle long-range dependencies. Standard RNNs tend to "forget" earlier content as sequences get longer. LSTMs mitigate this through gating mechanisms that decide which information to keep and which to discard. The specific workflow: 1. The **encoder** (a 4-layer deep LSTM) reads the source sentence from start to finish, compressing the entire sentence into a set of fixed-length final states, which are handed to the decoder as its starting point 2. The **decoder** (another 4-layer deep LSTM) starts from this state and generates the target language translation one word at a time, until it outputs the end-of-sentence symbol \ The probability formula from the paper: $$ p(y_1, \ldots, y_{T'} \mid x_1, \ldots, x_T) = \prod_t p(y_t \mid v, y_1, \ldots, y_{t-1}) $$ In plain language: given a source sentence x, the probability of generating target sentence y equals the product of the probability of generating each next word at every step. Each step's prediction depends on two things: the vector v compressed by the encoder, and all previously generated words. ```python import torch from torch import nn class Seq2Seq(nn.Module): def __init__(self, vocab_size: int, hidden_size: int) -> None: super().__init__() self.embedding = nn.Embedding(vocab_size, hidden_size) self.encoder = nn.LSTM(hidden_size, hidden_size, num_layers=4, batch_first=True) self.decoder = nn.LSTM(hidden_size, hidden_size, num_layers=4, batch_first=True) self.output_proj = nn.Linear(hidden_size, vocab_size) def encode( self, source_tokens: torch.Tensor, ) -> tuple[torch.Tensor, tuple[torch.Tensor, torch.Tensor]]: embedded = self.embedding(source_tokens) outputs, state = self.encoder(embedded) return outputs, state def decode( self, encoder_state: tuple[torch.Tensor, torch.Tensor], max_steps: int, bos_token_id: int, eos_token_id: int, ) -> list[int]: prev_token = torch.tensor([[bos_token_id]], dtype=torch.long, device=encoder_state[0].device) state = encoder_state generated: list[int] = [] for _ in range(max_steps): embedded = self.embedding(prev_token) output, state = self.decoder(embedded, state) logits = self.output_proj(output[:, -1, :]) next_token_id = int(logits.argmax(dim=-1).item()) if next_token_id == eos_token_id: break generated.append(next_token_id) prev_token = torch.tensor([[next_token_id]], dtype=torch.long, device=logits.device) return generated ``` The architecture itself is not complicated. The paper's contribution was not in inventing a new component, but in proving that this simple framework actually worked -- and worked well enough to compete with carefully tuned traditional systems. ## 3. Three Key Design Decisions The paper identified three design choices with major impact on performance: **First, use two separate LSTMs.** The encoder and decoder do not share parameters. This slightly increases the parameter count, but allows the model to better handle the distinct characteristics of source and target languages. The paper noted this also makes it possible to train on multiple language pairs simultaneously. **Second, use deep LSTMs.** The paper used 4-layer LSTMs, with each additional layer reducing perplexity by nearly 10%. Shallow LSTMs (1-2 layers) performed significantly worse. Depth gave the model a larger representational space. **Third, reverse the source sentence.** This was the paper's most surprising finding. Reversing the source sentence "a, b, c" to "c, b, a" before feeding it to the encoder bumped the BLEU score from 25.9 to 30.6 -- an improvement of nearly 5 points. Why does reversal help? The paper's explanation: in normal order, the first word of the source sentence is far from the first word of the target sentence (the entire source sentence sits in between). After reversal, the first few words of the source and target sentences are temporally close, creating more "short-range dependencies" for the gradient (the signal the model uses to adjust its parameters), making optimization easier. ```python import torch def reverse_source(source_tokens: list[int]) -> list[int]: return list(reversed(source_tokens)) source_sentence = [11, 23, 37, 42] reversed_source = reverse_source(source_sentence) source_tensor = torch.tensor([reversed_source], dtype=torch.long) ``` This trick is so simple it barely seems like a legitimate research contribution, but it genuinely worked, and it revealed a deeper issue: RNNs are sensitive to the distance between elements in a sequence -- the closer, the easier to learn. This problem was later solved fundamentally by the attention mechanism. ## 4. Experimental Results The paper ran experiments on the WMT '14 English-to-French translation task. Key numbers: - **Single reversed LSTM**, beam size 12: 30.59 BLEU - **Ensemble of 5 reversed LSTMs**, beam size 2: 34.50 BLEU - **Ensemble of 5 reversed LSTMs**, beam size 12: **34.81 BLEU** - **Conventional phrase-based translation system** (Moses baseline): 33.30 BLEU In the experimental setup reported by the paper, the ensemble of 5 LSTMs surpassed the conventional phrase-based system with 34.81 versus 33.30. Considering the LSTM had a vocabulary of only 80,000 words (outputting UNK for any out-of-vocabulary word) while the conventional system's vocabulary was virtually unlimited, this result is quite compelling. The paper also used the LSTM to re-rank the conventional system's 1000-best candidate list, pushing the BLEU score further to 36.5, approaching the best published result at the time (37.0). Another noteworthy finding: compared to other neural methods at the time, the LSTM showed less severe performance degradation on long sentences. This contrasted with the steep long-sentence performance drops reported by other researchers, and the paper attributed this to the source reversal strategy. ## 5. What the Model "Understands" The paper also ran an interesting visualization experiment. Different sentences were fed into the encoder, the final hidden state vectors were extracted, and PCA was used to project them onto a 2D plane. The results showed: - Sentences with similar meaning clustered together in the vector space - Active and passive voice sentences ("I gave her a card" vs "I was given a card by her") landed in nearby positions - Sentences with different word order but the same meaning were also correctly clustered This at least suggests that the encoder's learned representations go beyond simple bag-of-words statistics (mixing words together regardless of order) and contain a substantial amount of syntactic and semantic information. ## 6. Training Details **Model specifications**: 4-layer LSTM, 1000 units per layer, word embedding dimension of 1000, total parameter count of 384 million. Of these, 64 million are pure recurrent connection parameters. **Hardware**: 8 GPUs. One GPU per LSTM layer, with the remaining 4 GPUs used to parallelize the softmax (the vocabulary of 80,000 words makes softmax computation expensive). Training took roughly 10 days. **Optimizer**: SGD without momentum, initial learning rate of 0.7. After 5 epochs, the learning rate was halved every half epoch, for a total of 7.5 epochs. **Gradient clipping**: when the L2 norm of the gradient exceeded a threshold of 5, it was scaled down proportionally. This prevents gradient explosion (gradient values suddenly becoming extremely large, causing parameter updates to go haywire). **Batch optimization**: sentences of similar length were grouped into the same batch, preventing short sentences from wasting compute cycles while "waiting" for long sentences. This yielded a 2x training speedup. ## 7. My Takeaways After reading this paper, a few things stand out. First, this paper had great ambition but a simple method. One LSTM reads, another LSTM writes, and all information passes through a single vector in between. No attention, no complex alignment mechanism, not even any prior assumptions about language structure. And then it actually worked, with results strong enough to compete with carefully tuned traditional systems. The lesson: given sufficient data and compute, simple end-to-end methods can be surprisingly powerful. Second, the source reversal finding is quite instructive. It is not an elegant solution -- more of a hack. But it revealed a fundamental limitation of RNNs: sensitivity to the distance between elements in a sequence. Bahdanau's attention mechanism let the model "skip around," no longer constrained by distance. The Transformer went further, abandoning sequential processing entirely, making the distance between any two positions always 1. From reversal to attention to Transformer -- three generations of solutions to the same problem. Third, this paper and Bahdanau's paper were published almost simultaneously (both in September 2014). Sutskever established the encoder-decoder paradigm; Bahdanau identified the fixed-length vector bottleneck and solved it with the attention mechanism. The two papers are like two sides of the same coin: one is the framework, the other is the fix for the framework's biggest flaw. Fourth, rewriting this in real Python, you can feel how minimal the architecture is. The encoder just loops through the input; the decoder just loops out the output. But precisely because of this simplicity, its ceiling is obvious: all information must squeeze through a fixed-length vector. This bottleneck becomes especially visceral when you are writing the code yourself. How much information can a single vector hold? That is the implicit question of this paper. For longer, more complex sentences -- not enough. And so, later came attention, and later came the Transformer. --- **Paper Reading Series** - [*Neural Machine Translation by Jointly Learning to Align and Translate*](/posts/neural-machine-translation-by-jointly-learning-to-align-and-translate/) — The origin of attention - [*Attention Is All You Need*](/posts/attention-is-all-you-need/) — Attention takes center stage: the birth of the Transformer - [*BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*](/posts/bert/) — Establishing the pre-training paradigm - [*Scaling Laws for Neural Language Models*](/posts/scaling-laws-for-neural-language-models/) — The mathematics of scale - [*Language Models are Few-Shot Learners*](/posts/language-models-are-few-shot-learners/) — Larger models, better at eliciting abilities from context - [*Training Compute-Optimal Large Language Models*](/posts/training-compute-optimal-large-language-models/) — How to spend your compute budget wisely

Clawdbot: A Decentralized Open-Source AI Project Worth Watching

Fri, 16 Jan 2026 08:34:57 GMT

Most mainstream AI assistants today are centralized. Your conversations, your data, your context -- all of it ends up on someone else's servers. You use AI, but you don't truly own it. [Clawdbot](https://github.com/clawdbot/clawdbot), an open-source project, goes in exactly the opposite direction. What it's trying to do is not build yet another smarter chatbot, but put AI capabilities back in the user's hands: running on your own machine, plugged into the chat tools you already use, keeping data, context, and control with you. I went through its codebase: over 200,000 lines of TypeScript, covering native apps for macOS, iOS, and Android, plus more than 50 skill modules. ![Chatting with Clawd on WhatsApp](/images/whatsapp-clawd.webp) ## 0. A Few Terms First If this kind of project is new to you, 5 quick terms will make the rest easier to follow: - `centralized AI assistant`: a service where your chat history, memory, and control mostly live on the vendor's servers - `self-hosted`: software you run on your own machine or server, so deployment and data ownership stay with you - `AI Agent`: not just a chatbot, but a system that can remember context, call tools, and carry out tasks - `chat channels`: the messaging apps you already use, like WhatsApp, Telegram, Slack, and iMessage - `open-source`: code that is public, so people can inspect it, modify it, and deploy it themselves This is no weekend Hackathon side project -- it's a system built with a long-term product mindset. In many details, you can tell the author has taste when it comes to product trade-offs. What's even more impressive is its sense of boundaries. The capabilities that should be there are there; the things that shouldn't be crammed in aren't. No feature bloat just to "look bigger," none of the showoff energy or loss of control you see in many AI projects. That kind of restraint is harder than "doing a little of everything," and it says a lot. So I'd say Clawdbot is worth watching, not just because it's a well-crafted open-source project, but because it represents a rare yet increasingly important direction: Not plugging everyone into the same AI platform, but letting everyone own their own AI system. The emergence of products like Clawdbot feels like the first wave hitting the shore before the Age of Exploration begins. The wave itself doesn't announce anything. But it's already telling you: the tide is coming. That era where everyone has their own personal AI may be closer than we think.

Paper Reading: Neural Machine Translation by Jointly Learning to Align and Translate

Sun, 11 Jan 2026 08:26:19 GMT

On September 1, 2014, three researchers uploaded a paper to arXiv (a preprint server where researchers can publish papers without waiting for journal peer review): [*Neural Machine Translation by Jointly Learning to Align and Translate*](/papers/1409.0473v7.pdf). The three were Dzmitry Bahdanau, KyungHyun Cho, and Yoshua Bengio, from the University of Montreal. Yoshua Bengio is one of the "three godfathers" of deep learning, alongside Geoffrey Hinton and Yann LeCun; the three shared the 2018 Turing Award. Bahdanau was still a PhD student at the time. The core contribution of this paper can be summarized in one thing: teaching a translation model to look back at different parts of the source sentence when generating each word. It sounds obvious in hindsight, but in the neural machine translation research of the time, this was a genuinely novel idea. It has a name: the "attention mechanism." Three years later, eight people at Google pushed this idea to its logical extreme and wrote [*Attention Is All You Need*](/posts/attention-is-all-you-need/). So if you want to understand the Transformer, this paper is one of its most important predecessors. ## 1. The Problem The standard architecture for neural machine translation in 2014 was the encoder-decoder. The encoder, a Recurrent Neural Network (RNN), reads the source sentence from start to finish and compresses the entire sentence into a single fixed-length vector (think of it as a list of numbers with a fixed count). The decoder, another RNN, starts from this vector and generates the translation one word at a time. The problem is obvious: whether the source sentence is 5 words or 50 words, the encoder must squeeze it into a vector of the same length. Short sentences are fine, but long sentences lose information. It is like asking someone to read an entire page and then summarize it in a single sentence -- the longer the page, the more gets lost. The paper demonstrated this experimentally: when sentence length exceeded 30 words, the translation quality of the conventional encoder-decoder dropped sharply. This is the "fixed-length bottleneck." ## 2. The Core Idea: Stop Compressing, Let the Decoder Look for Itself The paper's solution is intuitive: if compressing the entire sentence into a single vector loses information, then stop compressing. The encoder retains the annotation vector at every position (formed by concatenating the forward and backward hidden states of a bidirectional RNN -- think of it as the intermediate result produced after processing each word), and the decoder, when generating each target word, decides for itself which parts of the source sentence to focus on. This is the heart of the attention mechanism: **instead of forcing all information through a single bottleneck, let the model learn to look back and find what it needs, when it needs it.** Specifically, it works in three steps: **Step 1: Scoring.** Before generating the i-th target word, the decoder compares its current state s_{i-1} with each encoder position's hidden state h_j, producing an "alignment score" e_{ij}. The higher the score, the more important position j in the source sentence is for generating the current target word. The scoring function used in the paper: $$ e_{ij} = a(s_{i-1}, h_j) = v_a^T \tanh(W_a s_{i-1} + U_a h_j) $$ This is called "additive attention." The decoder state and encoder state each undergo a linear transformation (multiply by a matrix), the results are added together, passed through tanh (a function that squashes values to between -1 and 1), and then dot-producted with a vector v_a to produce a scalar score. **Step 2: Normalization.** Softmax converts all position scores into probabilities that sum to 1: $$ \alpha_{ij} = \operatorname{softmax}(e_{ij}) = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})} $$ **Step 3: Weighted sum.** These probabilities are used to compute a weighted sum of the encoder's hidden states, producing a "context vector" c_i: $$ c_i = \sum_j \alpha_{ij} h_j $$ This context vector is the key information the decoder extracts from the source sentence when generating the i-th word. The context vector is different for each generated word, because the model focuses on different source positions each time. In Python (using PyTorch): ```python import torch from torch import nn def bahdanau_attention( decoder_state: torch.Tensor, encoder_outputs: torch.Tensor, w_a: nn.Linear, u_a: nn.Linear, v_a: nn.Linear, ) -> tuple[torch.Tensor, torch.Tensor]: decoder_features = w_a(decoder_state).unsqueeze(1) encoder_features = u_a(encoder_outputs) scores = v_a(torch.tanh(decoder_features + encoder_features)).squeeze(-1) weights = torch.softmax(scores, dim=-1) context = torch.sum(weights.unsqueeze(-1) * encoder_outputs, dim=1) return context, weights ``` Unlike the "dot-product attention" used later in the Transformer (where Q and K are directly dot-producted), this paper uses "additive attention" (each is linearly transformed first, then added together). The two approaches have different characteristics, but dot-product attention is better suited for efficient matrix multiplication; combined with the Transformer's removal of RNN's sequential dependency, attention finally became a core operator that could be massively parallelized. ## 3. The Encoder: Bidirectional RNN A unidirectional RNN reads the sentence left to right, outputting a summary vector only after the last word. The problem: each position's hidden state mainly carries left-side context and cannot see what is to the right. The paper solves this with a bidirectional RNN (BiRNN). One RNN reads left to right, another reads right to left, and then the hidden states from both directions are concatenated. This way, each position's hidden state contains context from both the left and the right. ```python import torch from torch import nn class BidirectionalRNN(nn.Module): def __init__(self, input_size: int, hidden_size: int) -> None: super().__init__() self.rnn = nn.GRU( input_size=input_size, hidden_size=hidden_size, bidirectional=True, batch_first=True, ) def forward(self, inputs: torch.Tensor) -> torch.Tensor: outputs, _ = self.rnn(inputs) return outputs ``` In the paper, each direction has 1000 hidden units, concatenated to 2000 dimensions. This doubles the parameters compared to a unidirectional RNN, but in return every position can see the full context. ## 4. The Decoder: Realigning at Every Step Putting the encoder and attention mechanism together, the decoder's workflow becomes clear: 1. The encoder reads the source sentence with a bidirectional RNN, retaining the hidden state (annotation vector) at every position 2. The decoder begins generating the translation, and before generating each word: - Computes attention weights using the current state and all annotation vectors - Produces a context vector via weighted sum - Combines the context vector, the previously generated word, and the current state to predict the next word ```python import torch from torch import nn class AttentionDecoder(nn.Module): def __init__(self, embedding_dim: int, hidden_size: int, vocab_size: int) -> None: super().__init__() self.rnn = nn.GRU( input_size=embedding_dim + 2 * hidden_size, hidden_size=hidden_size, batch_first=True, ) self.w_a = nn.Linear(hidden_size, hidden_size, bias=False) self.u_a = nn.Linear(2 * hidden_size, hidden_size, bias=False) self.v_a = nn.Linear(hidden_size, 1, bias=False) self.output_proj = nn.Linear(hidden_size, vocab_size) def decode_step( self, prev_word: torch.Tensor, prev_state: torch.Tensor, encoder_outputs: torch.Tensor, ) -> tuple[torch.Tensor, torch.Tensor]: context, _ = bahdanau_attention( prev_state.squeeze(0), encoder_outputs, self.w_a, self.u_a, self.v_a, ) rnn_input = torch.cat([prev_word, context.unsqueeze(1)], dim=-1) output, new_state = self.rnn(rnn_input, prev_state) logits = self.output_proj(output[:, -1, :]) return logits, new_state ``` The key point: every time the decoder generates a target word, it recomputes the attention distribution. When translating the first word, it might focus on the beginning of the source sentence; when translating the last word, it might focus on the end. This dynamic alignment capability is something the previous fixed-vector architecture simply could not do. ## 5. Experimental Results The paper ran experiments on the English-to-French translation task (using the WMT '14 dataset), measuring performance with BLEU scores (a standard metric for machine translation, measuring how close the machine output is to human translation, with a maximum of 100). Key comparisons: - **RNNencdec-50** (conventional encoder-decoder, trained on sentences up to 50 words): 26.71 BLEU - **RNNsearch-50** (model with attention, trained on sentences up to 50 words): **34.16 BLEU** - **Moses** (the strongest conventional phrase-based translation system at the time): 33.30 BLEU An improvement of 7.45 points. In the experimental setup reported by the paper, the attention-based neural model had matched or even surpassed the dominant conventional phrase-based translation system. The more critical finding is in the paper's Figure 2: as sentence length increased, the conventional encoder-decoder's BLEU score dropped steeply, while the attention-based model was barely affected. This directly validated the paper's core hypothesis: the fixed-length vector is the bottleneck, and the attention mechanism can bypass it. The paper also visualized the attention weights. In English-to-French translation, the attention weights nearly formed a diagonal line, showing that the model had automatically learned that "English word 1 corresponds to French word 1, English word 2 corresponds to French word 2." When word order differed (for instance, French adjectives placed after nouns), the attention weights shifted accordingly. The model learned all of this without any manual alignment annotations. ## 6. My Takeaways After reading this paper, a few things stand out. First, the problem this paper solves is extremely clear: the encoder compresses the entire sentence into a single vector, and long sentences lose information. The solution is equally intuitive: stop compressing, and let the decoder look for itself. Good research is often like this -- the problem is clear, and the solution follows naturally. Second, attention in this paper is still a supporting role to the RNN. The encoder is still recurrent (bidirectional RNN), the decoder is still recurrent, and attention merely bridges the two. Three years later, Vaswani et al. asked a far more radical question: if attention works so well, can we throw away the RNN entirely and keep only attention? The answer was the Transformer. Third, when rewriting this paper's attention mechanism in real Python, you will notice that its computation is considerably more complex than the Transformer's Scaled Dot-Product Attention. Additive attention requires extra weight matrices W_a, U_a, v_a, while dot-product attention only needs Q and K to be directly multiplied and scaled. Going from "addition" to "multiplication" seems like a small step, but in practice it dramatically simplified the computation and made it far more suitable for efficient matrix operations. Fourth, Bahdanau was a PhD student at the time, and Bengio was his advisor. A PhD student's paper ended up defining the core component of AI research for the next decade. The attention mechanism started here, was amplified by the Transformer, and ultimately became the foundation of GPT, BERT, and LLaMA. This paper did not invent any complicated mathematics. It simply asked a straightforward question: why can't the decoder look back? Then it let the decoder look back. And that look changed an entire era. --- **Paper Reading Series** - [*Sequence to Sequence Learning with Neural Networks*](/posts/sequence-to-sequence-learning-with-neural-networks/) — Establishing the encoder-decoder paradigm - [*Attention Is All You Need*](/posts/attention-is-all-you-need/) — Attention takes center stage: the birth of the Transformer - [*BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*](/posts/bert/) — Establishing the pre-training paradigm - [*Scaling Laws for Neural Language Models*](/posts/scaling-laws-for-neural-language-models/) — The mathematics of scale - [*Language Models are Few-Shot Learners*](/posts/language-models-are-few-shot-learners/) — Larger models, better at eliciting abilities from context - [*Training Compute-Optimal Large Language Models*](/posts/training-compute-optimal-large-language-models/) — How to spend your compute budget wisely

Paper Reading: Attention Is All You Need

Tue, 06 Jan 2026 08:18:46 GMT

On June 12, 2017, eight people uploaded a paper to arXiv (a preprint server where researchers can publish papers without waiting for journal peer review), with a title of just five words: [*Attention Is All You Need*](/papers/1706.03762v7.pdf). The eight were Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin, most of them working at Google Brain and Google Research at the time. After the paper came out, the group scattered. Noam Shazeer left Google to start Character.AI, then was later bought back by Google at a premium. Aidan Gomez started Cohere before he even finished his PhD at the University of Toronto, building enterprise-scale large language models. Llion Jones moved to Japan and founded Sakana AI. Illia Polosukhin took a path no one saw coming -- he started NEAR Protocol, a blockchain project. Ashish Vaswani and Niki Parmar teamed up to co-found Adept AI, then later started Essential AI together. Jakob Uszkoreit founded Inceptive, using AI to design RNA-based medicines. Łukasz Kaiser joined OpenAI and contributed to the development of the GPT series. Eight authors, seven companies, spanning AI, blockchain, and biotech. Nearly nine years later, ChatGPT, Claude, DeepSeek, Qwen -- the underlying architecture of these AI products can almost all be traced back to those 15 pages. This post is my understanding after reading the paper, with real Python code examples. It is not a translation, not a summary. You do not need a technical background to follow along. ## 1. The One-Sentence Version Before the Transformer, AI processed language the way a person reads a book by running a finger under each word, one at a time. By the time you reach word 100, what word 1 said has already faded. The longer the sentence, the worse the forgetting. That was the fundamental bottleneck of Recurrent Neural Networks (RNNs, an earlier AI architecture). The authors asked a simple question: **why do we have to read in order?** Unlike RNNs, which must process tokens step by step, the Transformer processes an entire input in parallel, directly modeling the relationship between any two positions. No queuing, no waiting for the previous word to finish before looking at the next one. The paper calls this core capability "attention." The title is not saying "the model literally contains nothing but attention." It is saying: in sequence modeling, attention has been promoted to the lead role for the first time, no longer needing recurrence or convolution (a method that extracts local features through sliding windows) as its backbone. ## 2. What Attention Actually Does Imagine you walk into a noisy bar where twenty people are talking at once. Your brain does not split its attention evenly across every voice. Someone calls your name, and your ears lock onto that direction instantly. Every other sound fades to background noise. The Transformer does the same thing for every word. The paper defines three roles: - **Query**: what this word is looking for. Like your ears searching for "who just called my name?" - **Key**: what this word can offer. Like the vocal signature of each person in the bar - **Value**: the actual content this word carries. Like the specific words that person is saying Each word's Query is matched against every other word's Key. High match scores pull more information from that word's Value. Low match scores are effectively ignored. The formula the paper gives is called Scaled Dot-Product Attention: $$ \operatorname{Attention}(Q, K, V) = \operatorname{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Do not panic at the formula. Let's break it down step by step: - **QK^T**: the dot product of Q and K. What is a dot product? Multiply two lists of numbers element-wise, then sum the results. For example, [1, 2] and [3, 4] gives 1x3 + 2x4 = 11. The larger the result, the more related two words are. This step computes a "match score" for every pair of words - **/ sqrt(d_k)**: divide by a scaling factor. d_k is the length of the vector (think of a vector as "a list of numbers that describes something" -- for instance, 64 numbers describing the meaning of a word). Why divide? Because the longer the number list, the larger the dot product tends to be. Without scaling, higher dimensions cause higher variance in the dot products, pushing softmax into saturation (almost all probability mass on a single word), which shrinks gradients (the signal the model uses to adjust its own parameters) and destabilizes training - **softmax**: converts a set of scores into probabilities that sum to 1. For example, if three words have scores [10, 2, 1], softmax turns them into roughly [0.99, 0.007, 0.003]. The highest-scoring word captures nearly all the attention; the rest are pushed close to zero - **x V**: use those probabilities to take a weighted combination of each word's actual content. High-probability words contribute more, low-probability words contribute less. The final output is a new vector that fuses the key information together In Python (using PyTorch): ```python import math import torch def scaled_dot_product_attention( query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, ) -> torch.Tensor: d_k = key.size(-1) scores = query @ key.transpose(-2, -1) / math.sqrt(d_k) weights = torch.softmax(scores, dim=-1) return weights @ value ``` That is just a few lines of code. Many of the capabilities that later reshaped the industry were built on top of this handful of operations. ## 3. Multi-Head Attention: Looking from Multiple Angles at Once A single attention head tends to latch onto one type of relationship pattern. But language packs multiple layers of meaning into a single sentence. Take "The cat sat on the mat yesterday" as an example: - "cat" and "sat" have a subject-verb relationship (who did what) - "yesterday" and "sat" have a temporal relationship (when it happened) - "on" and "mat" have a spatial relationship (where it happened) Asking one head to juggle all these layers at once is a tall order. The paper's solution is the multi-head mechanism: dispatch 8 heads to run in parallel, giving the model the chance to observe a sentence from different subspaces simultaneously, then concatenate their findings at the end. The formula from the paper: $$ \operatorname{MultiHead}(Q, K, V) = \operatorname{Concat}(\text{head}_1, \ldots, \text{head}_h)\, W^O $$ Breaking it down: - **head_1, ..., head_h**: 8 heads each independently run one attention computation, producing 8 separate results - **Concat**: concatenate all 8 results end-to-end into one long vector - **W^O**: a linear transformation (think "multiply by a matrix") that projects the concatenated long vector back to the original dimension. Like a manager listening to reports from 8 investigators and producing one consolidated conclusion ```python import math import torch from torch import nn class MultiHeadAttention(nn.Module): def __init__(self, d_model: int, num_heads: int) -> None: super().__init__() if d_model % num_heads != 0: raise ValueError("d_model must be divisible by num_heads") self.num_heads = num_heads self.d_head = d_model // num_heads self.q_proj = nn.Linear(d_model, d_model) self.k_proj = nn.Linear(d_model, d_model) self.v_proj = nn.Linear(d_model, d_model) self.out_proj = nn.Linear(d_model, d_model) def _split_heads(self, x: torch.Tensor) -> torch.Tensor: batch_size, seq_len, _ = x.shape x = x.view(batch_size, seq_len, self.num_heads, self.d_head) return x.transpose(1, 2) def forward( self, query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, ) -> torch.Tensor: q = self._split_heads(self.q_proj(query)) k = self._split_heads(self.k_proj(key)) v = self._split_heads(self.v_proj(value)) scores = q @ k.transpose(-2, -1) / math.sqrt(self.d_head) weights = torch.softmax(scores, dim=-1) heads = weights @ v batch_size, _, target_len, _ = heads.shape merged = heads.transpose(1, 2).contiguous() merged = merged.view(batch_size, target_len, self.num_heads * self.d_head) return self.out_proj(merged) ``` The paper's parameters: the model uses 512 numbers to describe each word (d_model = 512), with 8 heads, each getting 64 numbers (512 / 8 = 64). The total computation of 8 heads is roughly the same as a single 512-dimensional head, but the expressive power is far greater. Same cost, multi-perspective understanding. A very good trade. ## 4. Positional Encoding: Telling the Model About Word Order The Transformer processes the entire sentence in parallel, which is fast, but the trade-off is that it loses word order. Without additional position information, the attention mechanism alone cannot tell the difference between "the cat ate the fish" and "the fish ate the cat." That clearly will not do. The fix: generate a unique "address code" for each position and add it to the word's vector. The model no longer sees just "cat" and "fish" -- it sees "cat at position 1" and "fish at position 3." The paper uses sine and cosine functions to generate this encoding: $$ \operatorname{PE}(pos, 2i) = \sin\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right) $$ $$ \operatorname{PE}(pos, 2i + 1) = \cos\left(\frac{pos}{10000^{2i / d_{\text{model}}}}\right) $$ The formula looks intimidating, but the core idea is intuitive: - **pos**: the word's position in the sentence (1st, 2nd, 3rd, ...) - **i**: which dimension of the vector. Even positions use sin, odd positions use cos - **10000^(2i/d_model)**: a scaling factor that changes with the dimension. Low dimensions oscillate fast, high dimensions oscillate slowly. Like a clock: the second hand completes a full rotation in one minute, while the hour hand takes twelve hours. Different "hands" cover different time scales, and together they can pinpoint any moment precisely The end result: each position gets a unique numerical fingerprint, and the model uses this fingerprint to distinguish word order. ```python import math import torch def positional_encoding(seq_len: int, d_model: int) -> torch.Tensor: positions = torch.arange(seq_len, dtype=torch.float32).unsqueeze(1) div_term = torch.exp( torch.arange(0, d_model, 2, dtype=torch.float32) * (-math.log(10000.0) / d_model) ) encoding = torch.zeros(seq_len, d_model) encoding[:, 0::2] = torch.sin(positions * div_term) encoding[:, 1::2] = torch.cos(positions * div_term) return encoding ``` Why sine and cosine specifically? Because they have an elegant mathematical property: the relationship between the encodings of two positions separated by a fixed distance is the same regardless of whether those positions are at the start or the end of the sentence. The model does not need to memorize the relationship between "position 3 and position 8" -- it only needs to learn what "5 positions apart" means. The paper's team also tried letting the model learn positional encodings on its own, and the results were similar, but the sinusoidal version has one extra advantage: it can handle sentences longer than any seen during training. ## 5. Encoder and Decoder The full Transformer architecture is split into two halves. The **encoder** (6 stacked layers) is responsible for understanding the input. Each layer contains two sub-layers: one multi-head self-attention, one feed-forward network. Each sub-layer has two protective mechanisms: - **Residual connection**: add the sub-layer's input directly to its output, i.e., x + Sublayer(x). Why? Imagine applying a filter to a photo. If the filter turns out badly, the residual connection ensures you can still see the original image. In deep networks, information gets transformed at every layer, and by the sixth layer it may be unrecognizable. Residual connections let the original signal take a "shortcut" straight to deeper layers, preventing information from being lost in transit - **Layer normalization** (LayerNorm): rescales values to a uniform range, preventing some numbers from exploding to infinity while others vanish to zero. Similar to standardizing exam scores -- no matter how different the raw scores are, standardization puts them on a comparable scale The **decoder** (6 stacked layers) is responsible for generating the output. Its structure resembles the encoder, but with two critical additions: First, **cross-attention**: as the decoder generates each word, it looks back at the encoder's output. In a translation scenario, this is like writing English while glancing back at the Chinese source text. Second, **masking**: when generating the 3rd word, the model is only allowed to see the first 2 words. The 4th position and beyond are blocked (attention scores set to negative infinity, which becomes zero after softmax). The logic is simple: when you are writing an essay, the next word has not been written yet -- you cannot peek ahead. ```python from typing import Optional import torch from torch import nn class Transformer(nn.Module): def __init__( self, vocab_size: int, d_model: int = 512, num_heads: int = 8, num_layers: int = 6, d_ff: int = 2048, dropout: float = 0.1, ) -> None: super().__init__() self.embedding = nn.Embedding(vocab_size, d_model) encoder_layer = nn.TransformerEncoderLayer( d_model=d_model, nhead=num_heads, dim_feedforward=d_ff, dropout=dropout, batch_first=True, ) decoder_layer = nn.TransformerDecoderLayer( d_model=d_model, nhead=num_heads, dim_feedforward=d_ff, dropout=dropout, batch_first=True, ) self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers) self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers) self.output_proj = nn.Linear(d_model, vocab_size) def forward( self, src_token_ids: torch.Tensor, tgt_token_ids: torch.Tensor, tgt_mask: Optional[torch.Tensor] = None, ) -> torch.Tensor: memory = self.encoder(self.embedding(src_token_ids)) hidden = self.decoder(self.embedding(tgt_token_ids), memory, tgt_mask=tgt_mask) return self.output_proj(hidden) ``` There is one more component that is easy to overlook: the feed-forward network. Its formula is FFN(x) = max(0, xW1 + b1)W2 + b2. In plain language: take each word's 512-dimensional vector and expand it to 2048 dimensions (multiply by a matrix and add a bias), run it through ReLU (all negative numbers become zero, positive numbers stay), then compress it back to 512 dimensions. The ReLU step is the key: it introduces "nonlinearity," allowing the model to learn complex patterns that a straight line could never capture. If every operation were linear, stacking multiple layers would still be mathematically equivalent to a single layer. Nonlinearity is the prerequisite for modeling complexity. ## 6. Training Details With the architecture designed, how do you train it? The paper put real thought into this as well. **Hardware**: 8 NVIDIA P100 GPUs. The base model trained for 12 hours (100,000 steps), the large model for 3.5 days (300,000 steps). By today's standards, that cost is remarkably low. **Optimizer**: Adam (an algorithm that automatically adjusts model parameters), but with a clever learning rate schedule. The learning rate determines how big a step the model takes with each update. Steps too large risk overshooting the optimum; steps too small waste time. The paper's strategy: ramp up over the first 4,000 steps (warmup), avoiding overly aggressive updates at the start; after 4,000 steps, gradually decay on a schedule, stabilizing the later stages of training. Rise then fall -- bold exploration in the first half, fine-tuning in the second. **Regularization**: two techniques. The first is Dropout: randomly disable 10% of neurons (think of them as computational nodes in the network) during training, forcing the model to not rely on any single pathway and learn more robust features. The second is label smoothing (epsilon = 0.1): instead of telling the model "the probability of the correct answer is 100%," you say "90% on the correct answer, 10% spread across the other options." This actually makes the model worse on one metric (perplexity, which measures how "uncertain" the model is), but translation quality improves. Intuitively, a model that admits it is not 100% sure is more reliable than one that is overconfident. **Results**: the paper uses BLEU scores (a standard metric for machine translation, measuring how close the machine output is to human translation, with a maximum of 100) to evaluate performance. English-to-German: 28.4. English-to-French: 41.8. Both set new records at the time. Training cost was one to two orders of magnitude lower than previous approaches. Faster, stronger, cheaper. ## 7. My Takeaways After reading this paper, a few things stand out. First, the core insight of this paper is remarkably concise: throw away the baggage of sequential processing and let the attention mechanism directly model the relationship between any two positions. Self-Attention, residual connections, Layer Normalization -- none of these were new inventions. The real breakthrough was not inventing new tools, but the authors' willingness to bet that "these simple building blocks, assembled together, are enough" -- and then proving themselves right with experiments. Second, writing it out in real Python gave me a deeper understanding of every design decision. When you write Scaled Dot-Product Attention yourself, you feel viscerally why that sqrt(d_k) scaling matters. When you implement masking, you understand exactly where the autoregressive generation constraint comes from. Reading the paper ten times is not worth as much as writing it once yourself. Third, what truly struck me was not how many models it later spawned, but the fact that it reframed the problem back in 2017: from "how do we remember a sentence in order" to "how do we let every position directly find the information it needs most." GPT, BERT, T5, LLaMA -- all of them are products of that reframing. How far a sufficiently good architecture can go depends on how many people are willing to keep building on it. This paper gave us that architecture. Attention Is All You Need. --- **Paper Reading Series** - [*Sequence to Sequence Learning with Neural Networks*](/posts/sequence-to-sequence-learning-with-neural-networks/) — Establishing the encoder-decoder paradigm - [*Neural Machine Translation by Jointly Learning to Align and Translate*](/posts/neural-machine-translation-by-jointly-learning-to-align-and-translate/) — The origin of attention - [*BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding*](/posts/bert/) — Establishing the pre-training paradigm - [*Scaling Laws for Neural Language Models*](/posts/scaling-laws-for-neural-language-models/) — The mathematics of scale - [*Language Models are Few-Shot Learners*](/posts/language-models-are-few-shot-learners/) — Larger models, better at eliciting abilities from context - [*Training Compute-Optimal Large Language Models*](/posts/training-compute-optimal-large-language-models/) — How to spend your compute budget wisely

👋 Hello World

Thu, 01 Jan 2026 08:07:13 GMT

Welcome to Astro-Theme-Aither. This is an AI-native blog theme built on one belief: text itself is beautiful. A unified sans-serif system font stack, Apple HIG typography parameters, and a layout that stays out of your way. Everything here serves a single goal — making your words look and feel beautiful. ## Why Another Blog Theme The web is full of blog themes, so a fair question is: why build another one? The answer comes down to priorities. Most themes optimize for visual impact — large hero images, complex layouts, animated transitions. These look stunning in a demo but get in the way when someone actually sits down to read a 2,000-word article. Astro-Theme-Aither starts from a different premise. The content is the product. The theme's job is to present that content with the care it deserves: Apple HIG body text parameters (17px / 1.47 / -0.022em), generous whitespace, and a vertical rhythm that makes long-form reading comfortable rather than exhausting. This philosophy extends to the technical decisions too. The theme uses Astro's islands architecture — only interactive components (theme switcher, language switcher, locale detection, mobile nav) load JavaScript. Everything else is static HTML and CSS. No layout shifts, no loading spinners. The page loads, and you read. ## Get Started Getting up and running takes just a few minutes: 1. **Clone the repository** — use the GitHub template button or clone directly with `git clone` 2. **Install dependencies** — run `pnpm install` to pull in all packages 3. **Configure your site** — edit `src/config/site.ts` to set your site title, description, and nav links 4. **Set up services** — copy `.env.example` to `.env` and fill in your API keys (GA, Crisp, Giscus) 5. **Replace sample content** — swap the posts in `src/content/posts/` with your own Markdown files 6. **Start developing** — run `pnpm dev` to launch the local dev server with hot reloading 7. **Deploy** — push to `gh-pages` and let the included GitHub Pages workflow publish the site ### Project Structure ``` src/ ├── components/ # Reusable Astro & React components ├── config/ # Site configuration (site.ts) ├── content/ # Your Markdown posts (organized by locale) ├── i18n/ # Translations and locale utilities ├── layouts/ # Page layouts (Layout.astro) ├── lib/ # Shared utilities (posts, formatter, markdown-endpoint) ├── pages/ # Route pages (per locale) └── styles/ # Global CSS with Tailwind v4 @theme tokens ``` Each directory has a clear responsibility. Components are small and composable. Layouts handle the document shell. Pages define routes. Content holds your writing organized by locale. ### Writing Your First Post Create a new `.md` file in `src/content/posts/en/` with the following frontmatter: ```markdown --- title: Your Post Title date: "2026-01-15T16:27:43+08:00" category: General description: A brief summary for SEO and social previews tags: [topic, another] pinned: false --- Your content starts here. ``` The `title`, `date`, and `category` fields are required. Use ISO 8601 for `date`, including seconds and a timezone offset, for example `2026-01-15T16:27:43+08:00`. The `description` field is strongly recommended because it populates the meta description tag and Open Graph previews. Tags are optional. Set `pinned: false` to pin a post to the top of the list. For multilingual content, create the same file in each locale directory (`zh-hans/`, `zh-hant/`, `ko/`) with translated content. ## What You Get Out of the box, you have a production-ready blogging platform with every feature you need and none of the bloat you don't. ### Content Features - **RSS feed** — automatically generated at `/rss.xml` - **Sitemap** — auto-generated via `@astrojs/sitemap` - **SEO meta tags** — Open Graph, Twitter cards, and canonical URLs on every page - **JSON-LD** — Article structured data for AI and search engines - **Dark mode** — Light / Dark / System toggle with circular reveal animation via View Transitions API - **i18n** — multi-language support with automatic browser language detection - **Post pinning** — pin important posts to the top of the list - **Pagination** — file-based SSG pagination with page number navigation ### AI-Native Features - **llms.txt** — AI agent content index at `/llms.txt` - **llms-full.txt** — full-text content for AI consumption at `/llms-full.txt` - **Markdown endpoints** — append `.md` to any post URL for clean Markdown output - **robots.txt** — explicitly welcomes AI crawlers (GPTBot, ClaudeBot, PerplexityBot) ### Developer Features - **TypeScript throughout** — strict mode, fully typed components and utilities - **Content Collections** — type-safe Markdown with frontmatter validation at build time - **Tailwind CSS v4** — `@theme` design tokens for easy customization - **Validation workflow** — content coverage checks plus agent protocol smoke tests through `pnpm validate` - **Deploy** — GitHub Pages workflow included - **Google Analytics** — optional, via environment variable - **Crisp Chat** — optional live chat, via environment variable - **Giscus Comments** — optional GitHub Discussions powered comments ### Performance Because the theme outputs static HTML with minimal JavaScript islands, performance is excellent by default. You should expect Lighthouse scores of 100 across the board — Performance, Accessibility, Best Practices, and SEO. ## Customization - **Colors** — edit CSS custom properties in `src/styles/global.css` - **Fonts** — swap font-family values in the Tailwind theme configuration - **Navigation** — update nav links in `src/config/site.ts` - **Services** — set environment variables in `.env` for GA, Crisp, and Giscus - **Languages** — add new locales in `src/i18n/` and create corresponding page routes For deeper changes, the component architecture is deliberately simple. Each component does one thing, reads its props, and renders HTML. ## A Note on Design Philosophy The visual simplicity of this theme is intentional, but it is not the same as engineering simplicity. Under the hood, the theme handles a surprising number of concerns: Apple HIG typography parameters, accessible color contrast ratios in both light and dark modes, View Transitions API animations, automatic browser language detection, proper semantic HTML structure, AI-friendly content endpoints, and careful attention to the reading experience on screens ranging from phones to ultrawide monitors. Good design is invisible. When you read an article on this theme and simply enjoy the writing without noticing the theme at all — that is the design working exactly as intended. Happy writing.