
What sits between your prompt and the code that appears on your screen is not a language model. It is a system, and the system has more moving parts than most users suspect. This article traces those parts through Claude Code, OpenAI Codex, and the open-source alternatives, with data from academic papers, source code analysis, and production benchmarks. Framework: set-core. Author: setcode.dev.
In Part 2 (coming tomorrow), we follow the evidence to a harder question: why so many teams see nothing from these tools, while a handful build million-line codebases with zero human-written code.
Bottom line
A frontier coding agent has three layers, each contributing something a local Ollama model cannot replicate on its own.
Layer 1: the model. Not a base LLM. Months of post-training: RLHF from production developers (not crowdsource annotators), tool-use fine-tuning on tens of thousands of structured examples, Constitutional AI alignment, inference-time compute that provably expands computational capacity. Safety classifiers consume a meaningful share of total inference compute. You do not download this from Hugging Face.
Layer 2: the scaffold. Claude Code is 512,000 lines of TypeScript. By code volume, 1.6% calls the model; the other 98.4% is deterministic infrastructure: 54 tools, an 8-layer permission system, a 5-layer context compaction pipeline, 2,592 lines of bash security checks, and a system prompt conditionally assembled from 110+ sections. This is not 1.6% importance: the model makes every decision, the infrastructure makes the system reliable. But the infrastructure matters: swapping only the scaffold produces a 22+ point swing on the same benchmark with the same model. The same model in a thin wrapper (pi.dev, Aider) works for interactive use. The 512K lines are what you need to remove the human from the loop.
Layer 3: the infrastructure. Prompt caching with KV-cache tensors in GPU VRAM (92% prefix reuse, ~$0.50 vs $5.00 per million tokens). PagedAttention eliminating 60-80% of memory waste. Continuous batching at 10-20x throughput. Speculative decoding cutting latency in half. MoE expert parallelism across GPU clusters. None of this is visible to the user. All of it is necessary for the system to work at the speed and price point it does.
The question we set out to answer (“can you replace a frontier model with a local one?”) turns out to be three questions, one per layer. The answer to each is different.
The model is not what you think
When you call Claude Opus 4.6 or GPT-5.5, you are not calling a pretrained language model. You are calling the output of a multi-stage post-training pipeline that has been running for years.
What’s baked into the weights
RLHF with production coders. Anthropic’s RLHF pipeline uses “specialized annotation for code… requiring feedback from people who write production code, not just people who can recognize valid syntax” (Anthropic RLHF paper). Questions about edge cases, error handling, scalability, and idiomatic patterns require working developers to evaluate. The annotator agreement rate is 63%, barely above chance, which tells you how subjective code quality judgments are, and how much data you need to make the signal useful. The pipeline runs on a weekly cadence with fresh feedback, iterated online rather than batch.
Anthropic partnered with Surge AI for data collection. OpenAI’s GPT-4 Technical Report reveals almost nothing about their process (arXiv:2303.08774). DeepSeek took a different path entirely: GRPO bypasses supervised fine-tuning and uses compiler test-suite pass/fail as a binary reward signal. No human annotators. Monte Carlo rollouts estimate advantages instead of a learned critic. The result, DeepSeek R1, achieves competitive performance at a fraction of the training cost.
Tool-use fine-tuning. Frontier models are specifically trained to produce structured JSON tool calls. The ToolACE paper (ICLR 2025) describes generating 26,507 synthetic APIs and training on 5,000-50,000 prompt-response pairs per model. The training data must “slightly exceed the model’s current capability” to be effective; too easy and the model learns nothing, too hard and it collapses. Some tool-use behaviors start as system prompt instructions and later get baked into weights via post-training. The line between “prompted” and “trained” is deliberately blurry.
Constitutional AI and alignment. Anthropic’s Constitutional AI uses a two-phase process: the model generates responses, self-critiques against a written constitution, revises, and the revised outputs become training data. The second phase (RLAIF) has the model evaluate its own outputs pairwise, generating synthetic preference data for reward model training. Anthropic’s Synthetic Document Fine-tuning paper showed that training on just 14 million tokens of fictional stories portraying aligned AI behavior reduced misalignment from 96% to 0%. The leaked “soul document” (a 14,000-token training document defining Claude’s personality) was confirmed authentic by Anthropic (The Decoder).
None of this is reproducible from the base weights on Hugging Face. An open-weight model like Qwen 3.6-27B has its own post-training, but the budget, the data quality, and the iteration velocity differ by orders of magnitude. Anthropic has hundreds of people working on RLHF data quality. DeepSeek has a different advantage: radical cost efficiency. Both produce results that a single team downloading Ollama cannot match.
What happens at inference time
Extended thinking is not a formatting trick. A 2024 ICLR paper proved that “with T steps of chain-of-thought, constant-depth transformers can solve any problem solvable by boolean circuits of size T.” CoT expands the computational class the model can reach. This is genuine computation, not a formatting trick.
In practice, frontier models deploy something closer to MCTS-like search: branching into multiple solution paths, evaluating which branches look promising, allocating more compute to the promising ones, backtracking on dead ends. OpenAI’s o3 discovered strategies like “write a brute-force solution, then use it to verify an optimized solution.” No one programmed that; the model found it through reinforcement learning (Interconnects analysis).
Token efficiency varies. Claude 3.7 Sonnet spends roughly 3x more tokens than o3-mini, which spends 2x more than o1, while all three perform in a similar accuracy range. The Think tool, distinct from extended thinking, lets Claude pause mid-response to reason about new information from tool results. On tau-bench, this single addition improved performance by 54%. On SWE-bench, 1.6% (p < .001). Small absolute numbers, large when compounded across a multi-step agent loop.
What runs on the server you don’t see
Anthropic deploys a fleet of safety classifiers (“prompted or specially fine-tuned Claude models”) that run in parallel, each monitoring a different harm category (Building Safeguards for Claude). A significant fraction of total compute goes to safety reasoning. Response steering can inject additional instructions into Claude’s system prompt mid-conversation. False positive rates have been reduced 10x since classifiers were first deployed.
On the serving side: PagedAttention (vLLM) solves KV cache memory fragmentation; traditional approaches waste 60-80% of memory. Continuous batching delivers 10-20x throughput over static batching. Speculative decoding uses a small draft model to generate candidate tokens verified by the large model in parallel, cutting latency 2-3x. GPT-5.5 reportedly rewrote OpenAI’s own serving infrastructure before launch, writing custom load-balancing heuristics that increased generation speed by 20%.
Prompt caching stores KV-cache tensors directly in GPU VRAM, hash-indexed. Claude Code achieves a 92% prefix cache reuse rate. At $0.50 per million cached tokens versus $5.00 for uncached, this is a 10x cost reduction on repeat context. The 5-minute cache TTL is the constraint, and one we’ve written about before.
The scaffold: 98.4% infrastructure, 1.6% AI logic, but not 1.6% importance
Boris Cherny, Head of Claude Code at Anthropic, described the system as “the thinnest possible wrapper over the model” (Pragmatic Engineer). He also noted that roughly 90% of Claude Code’s codebase was written by Claude itself.
A VILA-Lab analysis of the leaked v2.1.88 source (512,000 lines of TypeScript) measured that only 1.6% of the code is AI decision logic. The remaining 98.4% is deterministic infrastructure spread across five layers, seven safety systems, and 54 tools. The core is an async function* queryLoop() generator, a reactive while-loop implementing the ReAct pattern. The agent calls the model, gets tool-use instructions, executes tools, feeds results back, repeats.
That loop is 1,729 lines. Everything else is what makes the loop survivable in production.
A critical distinction: 1.6% is a code volume metric, not an importance metric. The 1.6% of code that calls the model is where every consequential decision happens: what approach to take, what code to write, how to recover from an error, when to stop. The 98.4% provides the tools, permissions, context, and safety rails, but makes no decisions on its own. If you replaced Opus with Haiku inside the same 512K-line scaffold, the infrastructure would still work perfectly. The decisions would be worse. The code output would degrade. The gates would catch more failures. The retry budget would exhaust faster.
This is the F1 car analogy: the chassis, aerodynamics, suspension, and electronics are 98% of the parts count. The engine is 2%. But put a lawnmower engine in that chassis and you don’t get 98% of the performance. The engine is what moves the car. The chassis is what lets the engine operate at its limits safely.
People who use pi.dev or Aider with Opus and get excellent results prove the point: a thin wrapper plus a strong model is enough for a single developer working interactively. The 512K lines become necessary when you remove the human from the loop: parallel agents, crash recovery, security sandboxing, 45-minute autonomous sessions. The infrastructure does not make the model smarter. It makes the system reliable enough to run without supervision.
The tool system
54 built-in tools: 19 always active, 35 behind feature flags. Two execution paths: StreamingToolExecutor begins executing tools while the model response is still streaming; runTools() is the fallback. Read-only operations run in parallel; state-modifying operations serialize. Tool results are budgeted at 50K characters per message; oversized outputs get replaced with content references. When Tool Search is enabled, schemas load on demand rather than occupying the system prompt, reducing token overhead by ~85%.
The permission system
Eight independent layers, any of which can block execution. Build-time gating physically removes internal tools from external builds. Feature-flag kill switches enable server-side deactivation without client updates. An ML classifier (a separate Sonnet 4.6 model call) evaluates each tool invocation in auto mode against the user’s stated intent. Bash security spans 2,592 lines with 23 numbered checks, from Unicode zero-width space injection to zsh equals expansion to IFS null-byte attacks. Users approve 93% of permission prompts, which Anthropic interprets as approval fatigue rather than safety, hence the shift toward sandbox boundaries rather than more warnings.
Context management
Five compaction layers, applied cheapest-first, totaling 3,960 lines of TypeScript:
- Budget reduction. Per-message size limits on tool results. Always active, zero cost.
- Snip. Lightweight temporal trim of older history. Zero cost, high information loss.
- Microcompact. Selective tool result clearing with cache-aware pinning to preserve prompt cache hit rates.
- Context collapse. Read-time projection over conversation history; summaries persist across turns, originals preserved. ~90% compression.
- Auto-compact. Full model-generated summary when all else fails. Circuit breaker halts after 3 consecutive failures.
This pipeline exists because, as Anthropic’s context engineering guide puts it, “context must be treated as a finite resource with diminishing marginal returns.” Context rot (accuracy degrading as token count rises) is the practical limit on agent session length. Claude Code’s 99.9th percentile session duration doubled from 25 minutes to 45+ minutes between October 2025 and January 2026 (Measuring Agent Autonomy). The compaction pipeline is what made that possible.
Sub-agents
Three execution models. Fork: spawned agents inherit parent context as byte-identical copies. Because of prompt caching, spawning 5 agents costs barely more than 1. Worktree: each agent gets an isolated git worktree, preventing filesystem contamination. Teams (experimental): independent agents with their own context windows communicate via a filesystem-based mailbox system.
Multi-agent Opus 4 lead with Sonnet 4 sub-agents outperformed single-agent Opus 4 by 90.2% on browsing tasks. Token usage alone explained 80% of the variance.
The scaffold matters, with numbers
The central question in agentic AI is how much the scaffold contributes versus the model. Both sides have evidence.
The scaffold dominates
| Experiment | What changed | Effect | Source |
|---|---|---|---|
| Same model, generic scaffold → optimized scaffold | Scaffold only | 42% → 78% (+36 pt) | Particula Tech |
| SWE-bench Pro, same model, 4 scaffolds | Scaffold only | 22+ point spread | Scale AI |
| Grok Code Fast, edit tool format change | One tool parameter | 6.7% → 68.3% (10x) | Particula Tech |
| LangChain Terminal-Bench, harness tuning | Harness only | Outside Top 30 → Top 5 | LangChain blog |
| Vercel, 17 tools → 2 tools | Fewer tools | 80% → 100%, 3.5x faster | Anthropic tool design |
“The scaffold accounts for a 22+ point swing while model swaps account for roughly 1 point.” Six top models now cluster within 0.8 points on SWE-bench Verified. Model choice at the frontier is noise; scaffold quality is the signal.
The Confucius Code Agent case (Meta + Harvard): custom-scaffolded Claude Sonnet 4.5 scored 52.7% on SWE-bench Pro, beating Claude Opus 4.5 at 52.0% with a different scaffold. A weaker, cheaper model outperformed the flagship because the scaffolding was better.
The model dominates
mini-swe-agent, about 100 lines of Python with only bash as a tool, achieves >74% on SWE-bench Verified. It is now the standard evaluation scaffold, chosen because it puts the burden on the model rather than the harness. Scale AI’s SWE-Atlas found that Opus 4.6 scored only 2.5 points better in Claude Code than in the generic SWE-Agent framework, within error margin.
Boris Cherny: “All the secret sauce is in the model.” Noam Brown (OpenAI): “As reasoning models improve, those scaffolds will also just be replaced.”
The synthesis
A 2026 paper from the “Architecture Without Architects” team identifies the resolution: some prompt-architecture couplings are contingent (they will weaken as models improve) while others are fundamental (they persist regardless of model capability). Tool orchestration, error recovery, and context management appear to be fundamental. Fine-grained planning instructions appear to be contingent.
In our own orchestration framework, we’ve seen this concretely. In an earlier article, we showed how Opus 4.7 decomposed a 198-line spec into 12 changes (2.17x token cost, scope wandering, sibling-spec breakage) while Opus 4.6 produced 5 changes (clean merges, zero retries). The model’s “creativity,” a virtue in single-shot use, became a liability under orchestration. The scaffold’s job is to channel that creativity productively. The better the model, the more the scaffold matters for constraint enforcement, not less.
But the scaffold multiplies; it does not substitute
The data above might suggest that the model barely matters. That you could plug any model into Claude Code’s 512K-line framework and get comparable results. You cannot.
The scaffold evidence shows that swapping scaffolds while keeping the model fixed produces a 22+ point swing. But the inverse test (swapping models while keeping the scaffold fixed) shows something equally important: the model is the floor that the scaffold multiplies.
The Aider polyglot leaderboard uses a single, identical scaffold for every model. Same tools, same prompts, same edit format. The only variable is the model:
| Model | Aider score | Type |
|---|---|---|
| GPT-5 (high) | 88.0% | Frontier |
| Gemini 2.5 Pro | 83.1% | Frontier |
| DeepSeek V3.2-Exp (Reasoner) | 74.2% | Open weight |
| Qwen3 235B | 59.6% | Open weight |
| Kimi K2 | 59.1% | Open weight |
Same scaffold. 28-point spread from model choice alone. The scaffold contribution (22+ points when measured independently) and the model contribution (28+ points when measured independently) are both large. Neither is negligible.
On SWE-bench Pro, the model spread is 24 points (Mythos 77.8% vs. Qwen 3.6-27B 53.5%). Even accounting for scaffold variation between submissions, this gap is too large to attribute to scaffolding alone. Model differences do not vanish.
What happens if you route a local model through Claude Code? Claude Code’s system prompt is 2,300-3,600 tokens of carefully engineered instructions, conditional on 110+ sections. Its tool definitions add 14,000-17,000 tokens. The entire prompt design assumes a model that can reliably follow multi-constraint instructions, produce valid JSON tool calls, maintain state across dozens of turns, and self-correct when a tool result contradicts its plan. A model that drops constraints from long prompts, generates malformed JSON, or loses coherence after 3-4 steps will hit every failure mode in the scaffold simultaneously, not because the scaffold is bad, but because the scaffold is designed for a model class that the local model does not belong to.
A useful mental model, though simplified: the scaffold acts as a multiplier on the model’s base capability. On SWE-bench Pro (SEAL track), Opus 4.5 scores ~46% with the standardized scaffold and ~57% with its own optimized scaffold. That is a ~1.2x multiplier from scaffold alone. Apply the same scaffold improvement to a model with a lower base, and the absolute gap persists even as both improve. The scaffold lifts both, but it cannot close a 15-point base gap. This is an approximation, not a derived constant: the actual scaffold impact varies by model, task type, and which specific features the scaffold provides.
This is why Claude Code’s own model resolver uses 13 distinct roles with different model assignments. The planner needs a model that can hold a full spec in context and reason about decomposition tradeoffs. The reviewer needs a model that can judge code quality against a rubric. The classifier needs speed, not depth. Matching the model capability to the role demand is itself an orchestration skill, and it only works if you have access to models across the capability spectrum.
Why “a few percent” on a benchmark is not “a few percent” in production
A benchmark score is an average across independent, single-turn tasks. An agent loop is a chain of dependent, multi-turn decisions. The difference matters because errors compound.
Consider a 20-step agent task (read files, plan approach, write code, run tests, fix errors, iterate). Each step requires the model to produce a correct tool call and a useful response. If the per-step success rate is 95% (Opus-class), the chain completes 36% of the time. If it drops to 85% (strong mid-tier), chain completion falls to 4%. At 75% (Haiku-class), it is 0.3%.
| Per-step success rate | 5-step chain | 10-step chain | 20-step chain |
|---|---|---|---|
| 95% (Opus-class) | 77% | 60% | 36% |
| 90% (strong mid-tier) | 59% | 35% | 12% |
| 85% (weaker mid-tier) | 44% | 20% | 4% |
| 75% (Haiku-class) | 24% | 6% | 0.3% |
A 20-point benchmark gap between Opus-class (95% per step) and Haiku-class (75% per step) models looks like a leaderboard detail. In a 20-step chain it is the difference between 36% completion and 0.3%. The Opus-class model is 120x more likely to finish. Even a 10-point gap (95% vs 85%) produces a 9x difference at 20 steps.
This is why Claude Code’s model resolver does not use Haiku for the agent role. Haiku is fast and cheap, and on single-turn code generation it scores within 15-20 points of Opus. But orchestration is not single-turn code generation. It is a 10-50 step chain where each step depends on the previous one. The gap that looks like “just 15 points on a benchmark” becomes the gap between a pipeline that runs overnight and produces merged code, and one that exhausts its retry budget on the third change.
The same compounding applies to tool-call reliability. If the model produces valid JSON tool calls 98% of the time (frontier), a 20-call sequence has a 67% chance of completing without a malformed call. At 90% (typical for open-weight models without native tool-call training), the same sequence completes 12% of the time. One in eight instead of two in three.
This is why the “just a few percent on the benchmark” framing is misleading for agent use cases. Benchmarks measure average single-task performance. Agent pipelines need reliable multi-step chains. A 20-point gap that looks modest on a leaderboard becomes 120x in production. The model quality that seems marginal in a bar chart is the difference between a pipeline that runs overnight and produces merged code, and one that exhausts its retry budget on the third change.
The bottom line on model vs scaffold
The scaffold is the differentiator between teams using the same model tier. The model is the differentiator between capability tiers. Both matter. Claiming “just the scaffold” or “just the model” misreads the evidence. The teams getting the best results have both: a strong model channeled through well-designed orchestration.
The gap: frontier vs. local
Here are the benchmark numbers as of May 2026. SWE-bench Verified tests against 500 real GitHub issues (Python only). SWE-bench Pro uses 1,865 decontaminated multi-language tasks with a standardized scaffold.
SWE-bench Verified
| Model | Score | Type |
|---|---|---|
| Claude Mythos Preview | 93.9% | Closed frontier |
| Claude Opus 4.7 (Adaptive) | 87.6% | Closed frontier |
| GPT-5.3 Codex | 85.0% | Closed frontier |
| Claude Opus 4.5 | 80.9% | Closed frontier |
| DeepSeek V4 Pro (Max) | 80.6% | Open weight |
| Qwen 3.6-27B | 77.2% | Open, runs on 1 GPU |
SWE-bench Pro (decontaminated)
| Model | Score | Type |
|---|---|---|
| Claude Mythos Preview | 77.8% | Closed frontier |
| Kimi K2.6 | 58.6% | Open weight |
| Claude Opus 4.5 | 57.1% | Closed frontier |
| DeepSeek V4 Pro (Max) | 55.4% | Open weight |
| Qwen 3.6-27B | 53.5% | Open, 1 GPU |
| Claude Opus 4.6 | 53.4% | Closed frontier |
One number stands out: on SWE-bench Pro, Qwen 3.6-27B (53.5%) scores within noise margin of Claude Opus 4.6 (53.4%). A 0.1-point difference on ~1,800 tasks is not a meaningful gap. A free model running on a single RTX 4090 matches a $5/$25 per million token API on this benchmark. The gap between the best frontier model (Mythos, 77.8%) and the best local model (Qwen, 53.5%) is still 24 points. But the gap between mid-tier frontier and best-local is closing.
Two important caveats. First, SWE-bench Verified is contaminated: OpenAI flagged that every frontier model can reproduce verbatim gold patches for some tasks. Second, SWE-bench Pro has two tracks: the SEAL standardized scaffold (identical tooling for every model) and custom scaffold submissions. Scores from different tracks mix model and scaffold variation. The Qwen and Opus numbers above may come from different scaffold tracks, making the comparison approximate rather than controlled. The cleaner signal is SEAL-track scores, where the scaffold is held constant.
Where the gap is huge
The benchmark numbers flatten real differences. In a benchmark by kunalganglani.com, Qwen 2.5-Coder-32B scored 2.8/5.0 on multi-file context tasks while Claude Sonnet 4 scored 4.5/5.0, a 60% advantage for the frontier model. Single-file generation was close (4.1 vs 4.4). The gap is not in code generation. It is in codebase-scale reasoning.
Concrete failure modes when local models run in agent loops:
- Malformed tool calls. JSON syntax errors, wrong parameter names. One bad response derails a 20-step chain.
- Hallucinated tool outputs. The model generates a fake response instead of actually calling the tool.
- State loss after 3-4 steps. The model forgets what it already tried and repeats failed strategies.
- Context cliff. Quality degrades sharply well before the advertised context window limit, especially under quantization.
- Self-correction failure. Frontier models adapt their approach after a failed subtask. Local models tend to retry the same strategy.
These failures compound. In single-file, single-turn code generation, local models are within striking distance. In multi-step agent loops across a real codebase, the gap widens.
The 80/20 pattern
Real-world experience converges on a consistent pattern: local models handle 60-80% of daily coding tasks adequately. The remaining 20-40% (complex debugging, multi-file refactors, architecture decisions spanning 20+ files, long agent loops) still require frontier models. The productive setup is both: local for volume, API for the hard problems. “Every week spent evaluating whether to switch models is a week you could have spent building better tool orchestration” (DEV Community consensus).
What this means for orchestration builders
We’ve spent the past year building set-core, an orchestration framework that decomposes specs into parallel changes, dispatches agents to isolated git worktrees, runs 10+ quality gates per change, and merges serially through an integration queue. We’ve traced the anatomy of those runs, measured the cost, and compared model behavior within the pipeline.
What this research tells us is that the work we’re doing (context engineering, tool design, gate verification, error recovery) is the work that differentiates outcomes. Not the model choice.
Four observations from the data:
The scaffold is the differentiator. At the frontier, six models score within 0.8 points of each other on SWE-bench Verified. The scaffold swing is 22+ points. For orchestration framework builders, this is the clearest possible signal: the engineering value is in the harness, not in the model dropdown.
All systems converge on four tool primitives. Read, search, edit, execute. Every coding agent (Claude Code, Codex, Aider, SWE-agent, OpenHands, Goose) implements these four categories despite radically different architectures. Everything else is context management and error recovery. Anthropic put it directly: “More tools don’t always lead to better outcomes” (Writing Tools for Agents).
Some scaffold components are permanent, others are temporary. Tool orchestration, error recovery, and context management appear to be fundamental: they persist regardless of model improvement. Planning decomposition instructions, output format enforcement, and retry heuristics appear to be contingent: they will likely be absorbed into the model. Build the permanent parts. Accept that the temporary parts will be deleted.
The market is pricing scaffolding, not models. Cursor at $2B ARR, Claude Code at $2.5B ARR, Copilot at 20M users. These are scaffold products. The models behind them are increasingly interchangeable. The value is in the integration layer.
Caveats
SWE-bench numbers are moving targets. Leaderboard positions shift weekly. The structural patterns (frontier-local gap, scaffold impact) are more stable than the specific percentages.
The “1.6% AI logic” statistic is from a single paper analyzing one version of Claude Code. The ratio may differ in other systems or newer versions.
We have not tested local models in our own orchestration pipeline. The multi-file reasoning gap and tool-call reliability data come from third-party benchmarks and reports, not our production runs.
“90% written by Claude” is Boris Cherny’s characterization. We have no way to verify the percentage independently.
Inference infrastructure details (PagedAttention, speculative decoding, MoE routing) are known from published research. The specific implementations at Anthropic and OpenAI are proprietary and may differ from the academic versions.
Glossary
RLHF (Reinforcement Learning from Human Feedback). Post-training technique where human evaluators rank model outputs, creating preference data that trains a reward model. The reward model then guides further training via reinforcement learning. Frontier labs run this on weekly cadences with domain-specific annotators (production coders for code models). Anthropic RLHF paper. See also: RLAIF.
RLAIF (Reinforcement Learning from AI Feedback). Variant of RLHF where the model evaluates its own outputs against a written constitution rather than relying on human judges. Central to Anthropic’s Constitutional AI approach. Scales better than human annotation but trades off some signal quality.
Constitutional AI (CAI). Anthropic’s alignment method. The model generates responses, self-critiques against explicit principles (the “constitution”), and revises. The revised outputs become training data. The constitution draws from the UN Declaration of Human Rights, Anthropic’s own principles, and other sources. Paper.
GRPO (Group Relative Policy Optimization). DeepSeek’s alternative to RLHF that bypasses supervised fine-tuning entirely. Uses binary rewards from compilers and test suites (pass/fail) instead of human preference. Monte Carlo rollouts estimate advantages instead of a learned critic. Dramatically cheaper than RLHF. Paper.
SFT (Supervised Fine-Tuning). Training a pretrained model on curated input-output pairs (e.g., instruction-response examples) to teach it a specific output format or behavior. Typically 5,000-50,000 examples for tool-use training. The first step in most post-training pipelines before RLHF/DPO.
DPO (Direct Preference Optimization). Preference alignment technique that skips the reward model entirely, learning directly from preference pairs. “Stable, performant, and computationally lightweight, exceeding PPO-based RLHF” in many settings. Paper.
CoT (Chain-of-Thought). Inference technique where the model generates intermediate reasoning steps before the final answer. Provably expands the computational class of constant-depth transformers (ICLR 2024). In production systems, implemented as “extended thinking” (Anthropic) or “reasoning tokens” (OpenAI).
MoE (Mixture of Experts). Architecture where only a fraction of model parameters are active per token. A router network selects which “expert” subnetworks process each token. Allows larger total capacity without proportional compute cost. DeepSeek V4 Flash: 284B total, 13B active. Qwen3-Coder-Next: 80B total, 3B active.
PagedAttention. Memory management technique for KV cache in transformer inference (vLLM). Traditional approaches waste 60-80% of GPU memory due to fragmentation. PagedAttention stores KV cache in non-contiguous pages (like OS virtual memory), enabling 2-4x more concurrent requests.
Speculative Decoding. Inference acceleration where a small “draft” model generates candidate tokens and a large model verifies them in a single forward pass. Mathematically guaranteed to produce the same output distribution as the large model alone. 2-3x latency reduction in practice.
SWE-bench. Benchmark for software engineering agents. Tests whether an AI system can resolve real GitHub issues from open-source Python repositories. Verified (500 tasks, human-validated): widely used but contaminated. Pro (1,865 tasks, multi-language, decontaminated): has both a standardized scaffold track (SEAL) and custom scaffold submissions. swebench.com.
ReAct (Reasoning + Acting). Agent loop pattern where the model alternates between reasoning about what to do and taking actions (tool calls). The core of Claude Code, Codex, and most coding agents. Implemented as a while-loop: call model, execute tools, feed results back, repeat.
Scaffold / Harness. The deterministic infrastructure surrounding an LLM in an agentic system. Includes tool routing, context management, permission gates, error recovery, security checks. Claude Code’s scaffold is 512K lines of TypeScript (98.4% of the codebase). The OpenDev paper distinguishes scaffolding (pre-prompt assembly, relatively stable) from harness (runtime orchestration, more model-dependent).
Context Engineering. The practice of curating what goes into the LLM’s context window to maximize output quality. Successor to “prompt engineering.” Includes: project-specific instruction files (CLAUDE.md, AGENTS.md), just-in-time tool loading, conversation compaction, sub-agent summarization. Term popularized by Karpathy and Tobi Lutke (June 2025).
KV Cache. Key-value cache storing intermediate attention computations from previous tokens. Reusing the cache avoids recomputing attention over the full conversation history on each new token. Prompt caching (Anthropic) stores these tensors in GPU VRAM across API calls, giving a 90% input cost discount on cache hits.
LoRA (Low-Rank Adaptation). Fine-tuning technique that freezes the base model and trains small adapter matrices (typically 0.1-1% of total parameters). Enables domain-specific fine-tuning on a single GPU in hours. Standard approach for adapting open-weight models to proprietary codebases.
RAG (Retrieval-Augmented Generation). Architecture where the model retrieves relevant documents from an external knowledge base before generating a response. For code: retrieves relevant files, functions, or documentation to include in the prompt context. FSE 2025 data: RAG (53.8% exact match) outperforms fine-tuning (44.2%) on proprietary codebases.
Part 2 (coming tomorrow) follows the data to the demand side: why 93% of developers use AI tools but only 10% see productivity gains, why experienced developers measured 19% slower with AI assistance, and what separates the teams reporting 5-10x output from the teams reporting net-negative results.