Consumer and orchestration Claude: the two branches have separated

claude-4-7-vs-4-6-orchestration

A write-up of why we flipped our orchestration framework’s default Claude model from claude-opus-4-7 back to claude-opus-4-6 on May 2, 2026. Framework: set-core. Author: setcode.dev.

Bottom line

Claude 4.7 is optimized for the consumer context. It infers a lot from a small input, fills in missing details, decomposes finely. These are virtues in single-shot use cases.

In multi-agent orchestration, the same traits become liabilities. On the same spec, the planner LLM cuts the work into 12 chunks instead of 5. A feature agent breaks the spec test of another already-merged change because it “decided” the build pipeline needed touching too. Output token spend ends up at 2.17× for the same merged output.

The 4.6 is not a more advanced model. It is more conservative. More predictable. Less prone to “guessing what you meant.” In production orchestration that is the desired behavior.

The new shape of the framework’s codebase (13 roles + 5-tier resolver + 40+ hardcoded-fallback purge) is the proof that this is not a default-flip problem. The consumer stack and the production-orchestration stack share the same API, but they no longer solve the same problem.

How we got here

The rollback wasn’t sudden. For days we’d been watching token spend climb across orchestration runs, and the symptom was always the same: the planner cutting work finer than necessary, individual agents wandering outside their declared scope, verify-gate retries spreading instead of concentrating. Each run on its own looked anecdotal, so we kept going.

The May 1-2 micro-web runs were the last drop. Two clean back-to-back runs on the same 198-line spec (one on 4.7 default, one on 4.6 default) gave us a clean before/after on a single workload. Same input, same merged output, two very different cost and stability profiles. That was the run pair that justified the rollback to ourselves; the rollback itself had been overdue.

A 198-line spec like this is not a whole product. It’s a single feature slice: one task within a larger orchestration. A production run chews through many such slices. What’s shown below is the cost on one of them; in real workloads it compounds across the portfolio. Putting two runs side by side here is not the whole story; it shows what one slice looks like, and you can multiply from there.

What follows is the data from those two reference runs, plus what we changed in the framework on top of the alias flip.

Reference data: 2 clean micro-web E2E runs, both on spec.md sha=ef1006c4 (198 lines, one task slice). One on 4.7 default (May 1, 18:05), one on 4.6 default (May 2, 11:04, post-rollback). Numbers below are backed by orchestration-state.json and journal sources; an explicit caveats section at the end records what the data does not support.

On May 2 at 11:04 AM (commit db82ebb0) we re-mapped the opus model alias from claude-opus-4-7 to claude-opus-4-6. Alongside the alias flip we purged hardcoded "opus" fallbacks from 40+ call sites and introduced a 13-leaf-role, 5-tier resolver chain (covered later in this article).

TL;DR

The 198-line spec below is one feature task within a larger orchestration pipeline, not a whole product. Run twice with different model defaults:

With the 4.7 default model (opus alias→4-7) the planner decomposed it into 12 chunks, and the full run cost 1.40 million output tokens.
With the 4.6 default the planner decomposed the same spec into 5 chunks, and the full run cost 642 thousand output tokens.

Same output. 2.17× more token spend on the 4.7 side.

The difference is not per-token efficiency (per-chunk median token count is actually higher on the 4.6 side, 136k vs 76k). The difference is the creativity of the planner LLM’s decision-making: the 4.7 cuts the work into finer pieces than necessary, producing 12 dispatch points, 12 verify-gate cycles, 12 merge surfaces, not 5.

Within that, 407,505 tokens (29.2% of the 4.7 run) went to changes nobody asked for: the 4.7 planner generated 4 standalone test/E2E change records on the side, while E2E testing is already the responsibility of the framework’s verify gates (and the spec itself prescribed the tests at feature level). The 4.6 planner did not do this once on the same spec.

The change-level “wandering” is tangible: one change in the 4.7 run (blog-list-with-filter) burned 292,086 tokens, and along the way broke the spec test of another already-merged change it had no business touching. The 4.6 run’s analogous, larger-scope change (blog-pages: blog list + detail combined) closed in 136,158 tokens, with no retries, on a direct planning → merging path.

Same 198-line spec, two model defaults: Opus 4.7 split it into 12 leaf changes (and 1.40M output tokens), Opus 4.6 into 5 (and 642k). The 4.7 result includes 4 standalone test/E2E change records the spec did not ask for.

The two reference runs

Run ID	Started	Default model	Outcome
`micro-web-run-20260501-1805`	2026-05-01 18:05	4.7 (opus→4-7)	12/12 merged
`micro-web-run-20260502-1104`	2026-05-02 11:04	4.6 (post-rollback)	5/5 merged, 0 paused

Both runs operated on the SHA-256-identical spec; both merged cleanly through the framework’s verify gates. The only variable is the default model.

A handful of other runs sat alongside these two on the same spec, in the same window: some cut short, some with manual intervention. The pattern there is consistent with what’s below, but those don’t isolate the model variable cleanly enough to be reference data, so they’re not used here.

Decomposition divergence: same spec, two different plans

The planner phase uses an LLM for spec → change decomposition (decompose_brief, decompose_domain, decompose_merge roles). On the same 198-line spec:

4.7 planner output (run 0501-1805): 12 leaf changes

#	Name	Complexity	Model
1	foundational-scaffold-and-shell	M	sonnet
2	test-infrastructure-setup	S	sonnet
3	site-header-and-mobile-nav	M	opus
4	home-page-and-newsletter-sheet	S	opus
5	about-page	S	opus
6	blog-list-with-filter	M	opus
7	contact-page-and-wizard	M	opus
8	blog-detail-with-progress-and-reactions	M	opus
9	command-palette	M	opus
10	validation-and-contact-wizard-tests	S	opus
11	pages-smoke-and-blog-filter-e2e	S	opus
12	cmdk-and-mobile-nav-e2e	S	opus

4.6 planner output (run 0502-1104): 5 leaf changes

#	Name	Complexity	Model
1	foundation-navigation	M	opus
2	home-about-pages	M	opus
3	blog-pages	M	opus
4	contact-wizard	M	opus
5	blog-reactions-and-acceptance	S	opus

Observation

The 4.7 planner created standalone changes just for the E2E specs (pages-smoke-and-blog-filter-e2e, cmdk-and-mobile-nav-e2e, validation-and-contact-wizard-tests): 3 changes worked on what was, on the 4.6 side, just the Ship E2E tests at e2e/... section of each feature change. The 4.7 also fragmented navigation (header + mobile-nav, command-palette, cmdk-e2e: three chunks for what the 4.6 covered with a single one, foundation-navigation).

This is not a wrong decision in the classical sense. In a single-agent vibe-coding context, finer chunking can even be an advantage (“smaller context window, more focused agent”). But in multi-agent orchestration every chunk boundary is a failure surface: dispatch, verify-gate sequence, merge conflict, sibling-spec drift. The 4.7 builds 12 such surfaces against the 4.6’s 5.

Two concrete wandering patterns

The spec decomposition came out “more creative” on 4.7 in two distinct ways.

1. Standalone test and E2E change records nobody asked for

The framework handles E2E and unit tests at the gate level: every change’s verify pipeline includes test, e2e, e2e_coverage, lint, build, design-fidelity, and scope_check gates. If a change ships feature code but forgets the tests, the gate flags it. There is no need for an LLM to create a separate change record for “testing tasks.”

The 4.7 planner did this anyway. Of the 12 changes in the 0501-1805 run, 4 are test/infra-only:

Change record	Type	Tokens	Counterpart on 4.6 side?
`test-infrastructure-setup`	infrastructure	34,462	– (configured by feature changes)
`validation-and-contact-wizard-tests`	feature	115,559	– (part of `contact-wizard` feature)
`pages-smoke-and-blog-filter-e2e`	feature	59,268	– (part of `blog-pages`)
`cmdk-and-mobile-nav-e2e`	feature	198,216	– (part of `foundation-navigation`)
Total		407,505	0 standalone test changes

The full 4.7 run produced 1,394,032 output tokens, of which 407,505 (29.2%) went to changes the 4.6 planner did not create at all. The spec states REQ-TEST-001..006 requirements; the 4.6 folded these into the feature changes’ scope (“Ship E2E tests at e2e/…spec.ts”), which is exactly the form the framework’s testing conventions prescribe. The 4.7 instead manufactured separate orchestration units around them, without the spec asking for it.

On the 4.6 side this kind of standalone test-change does not appear at all.

2. v0.app design drift (observed, not measured)

The micro-web stack is built against a v0-export/ reference (components generated in v0.app). The framework runs a design-fidelity gate between planner and verify-gate, which reports token mismatches as [WARNING] and critical contract drift as [CRITICAL].

In these runs the gate output was a placeholder (a 32-character fixed string), so the v0.app drift frequency cannot be quantified from this dataset. Anecdotally (from the contents of the worktrees, and from the fact that the design-pipeline.md and design-bridge.md rules address exactly this), the 4.7 was more prone to improvising component variants not present in the v0-export: extra Card layering, non-token-based spacing. This is a qualitative observation; concrete numbers will come from a future run with a substantive design-fidelity output.

Token spend, on the same spec

Output tokens summed across merged changes. Of the 4.7 run’s 1.40M, 407k (29.2%, in red) went to standalone test/E2E change records — work the 4.6 planner folded into the feature changes themselves. Same input, same merged output, 2.17× the spend.

The tokens_used field is the per-change cumulative token counter (input + output + cache + retries all inclusive). Computed over merged changes:

Metric	4.7 (1805)	4.6 (1104)	Ratio
Changes (all merged)	12	5	—
Sum output tokens	1,394,032	642,603	2.17×
Sum input tokens	205,363,639	115,059,327	1.79×
Sum cache-read tokens	205,360,830	115,057,537	1.79×
Sum cache-create tokens	11,448,986	5,143,953	2.23×
Cache-hit ratio	48.64%	48.91%	≈

Output tokens are the main billable line item on the Claude API. Here it’s a 2.17× difference for the same output. Cache-hit ratio is the same on both sides (~49%), so prompt-cache efficiency did not regress; the agents simply did more total work.

Per-change median

Metric	4.7 (1805)	4.6 (1104)
Median tokens/change	76,550	136,158
Mean tokens/change	116,403	128,879
Max tokens/change	292,086 (blog-list-with-filter)	185,226 (contact-wizard)
Mean duration / change	22.5 min	21.9 min

Per-chunk median is higher on the 4.6 side, because fewer and larger chunks. Per-spec total is higher on the 4.7 side. The two numbers together tell the story: the 4.7 makes many small chunks, the 4.6 makes few big ones, and “many small chunks × per-chunk overhead” is what makes it 2× more expensive.

Stuck-loop, paused, verify-retry: run stability

Metric	4.7 (1805)	4.6 (1104)
`stuck_loop_count` (sum)	1	0
Paused/pending changes	0	0
Changes with verify-retry	3 / 12	1 / 5
Build-fix attempts	1	0

Both runs merged everything. The 4.7 1805 run got there with 1 stuck-loop along the way and verify-retries on 3 different changes, spread thin. The 4.6 1104 run took 4 verify-retries on a single change (foundation-navigation) and zero on the rest. Concentrated stubbornness on one site, but deterministic, beats spread instability across many.

The concrete story: `blog-list-with-filter` (4.7) vs `blog-pages` (4.6)

The most expensive change in the 4.7 1805 run: blog-list-with-filter. It used 292,086 tokens, run time ~44 minutes (19:12 → 19:56). Journal file blog-list-with-filter.jsonl, 64 events.

4.7 step transitions

2026-05-01 19:12:18  → planning
2026-05-01 19:23:11  → fixing       ← stuck here
2026-05-01 19:56:07  → merging
2026-05-01 19:56:08  → archiving
2026-05-01 19:56:09  → done

The fixing step generated two retry_context messages. The first:

E2E gate failed with exit_code=1 but Playwright did not emit a failure list. This usually means the suite crashed before completing — check the worktree for stack traces, OOM kills, webServer startup errors…

The second, the critical one:

Integration smoke gate FAILED: 1 of 2 inherited sibling spec(s) failed. (…) Failing spec files: tests/e2e/foundational-scaffold-and-shell.spec.ts

That is: the blog-list-with-filter agent did something that broke the spec test of the foundational-scaffold-and-shell change, another change already merged into main. The test output: ERR_CONNECTION_REFUSED at http://localhost:4093/ and ENOENT: no such file or directory, mkdir '...wt-blog-list-with-filter/.next'. The agent had touched something in the build process / scaffold contract that it had no business touching.

This is the journal-line-level fingerprint of “scope wandering.”

4.6 equivalent

The 4.6 1104 run’s blog-pages change covers the blog list and the blog detail (on the 4.7 side these were 2 separate changes). It used 136,158 tokens, run time ~23 minutes (11:50 → 12:13). Journal: 32 events (half of the 4.7 side’s 64).

2026-05-02 11:50:16  → planning
2026-05-02 12:13:25  → merging
2026-05-02 12:13:26  → archiving
2026-05-02 12:13:27  → done

No fixing step. No retry_context. verify_retry_count=0. Straight from planning to merging.

The parallel

	4.7 `blog-list-with-filter`	4.6 `blog-pages`
Scope	blog list + filter only	blog list + filter + blog detail
Tokens	292,086	136,158
Time	~44 min	~23 min
Journal events	64	32
`fixing` step	yes, twice	no
Broke a sibling spec	yes (foundational scaffold)	no

The 4.7 did the smaller scope, used 2.15× more tokens doing it, and broke another already-merged change in the process. This is the concrete shape of “too creative, wanders off.”

The split shows in the architecture too: three commits

The model rollback wasn’t a default flip. Three commits, ~2.5 hours, to turn “model alias” from a single string into a 5-tier resolver chain, per role. All on main, May 2, 2026.

1. `8cdcbd9f` — `feat(model-config): unified models block + opus-4-6 default + foundational→opus` (09:45)

lib/set_orch/config.py schema extended with a top-level models: directive block containing 13 leaf roles:

agent, agent_small, digest,
decompose_brief, decompose_domain, decompose_merge,
review, review_escalation,
spec_verify, spec_verify_escalation,
classifier, supervisor, canary

…and a 4-key trigger sub-dict (integration_failed, non_periodic_checkpoint, terminal_state, default).

New module: lib/set_orch/model_config.py, with a resolve_model(role, *, project_dir, cli_override) function that implements a 5-tier chain:

1. CLI override                         (--model … flag)
2. SET_ORCH_MODEL_<ROLE>                env var
3. orchestration.yaml → models.<role>   per-project config
4. profile.model_for(role)              per-stack plugin override
5. DIRECTIVE_DEFAULTS                   framework-level fallback

Trigger sub-roles are addressable via dotted paths: trigger.integration_failed → SET_ORCH_MODEL_TRIGGER_INTEGRATION_FAILED. Any layer with an unknown role or an invalid model name → ValueError, fail-loud.

Plus a PRESETS dict for the --model-profile shortcut: default, all-opus-4-6, all-opus-4-7, cost-optimized.

2. `db82ebb0` — `fix(model-config): purge hardcoded model fallbacks; opus alias → 4-6` (11:04)

The root-cause fix. An interim run between the two reference runs revealed that the alias flip was not actually taking effect: agents were still running on claude-opus-4-7, despite the new default. Two reasons:

_MODEL_MAP["opus"] = claude-opus-4-7 (the latest family alias). The “default” role’s new value was opus, which resolved back to 4-7.
40+ call sites had hardcoded or "opus" fallbacks: cli.py, builder.py, planner.py, investigator.py, category_resolver.py, dispatcher.sh, digest.sh, etc. These bypassed the new resolver.

The fix:

# subprocess_utils.py + bin/set-common.sh (kept in sync)
_MODEL_MAP["opus"] = "claude-opus-4-6"   # was: claude-opus-4-7

…and every hardcoded fallback rewritten to resolve_model("<role>").

3. `fd9583be` — `fix(model-config): legacy directive defaults shadowed unified models block` (12:10)

Another bug. DIRECTIVE_DEFAULTS["review_model"] = "opus", a legacy default, leaked through into project state (state.directives), and verifier.py:_execute_review_gate read that before the result of resolve_model("review"). The verify-review gate was running on opus the whole time, even though we’d configured it to sonnet. Surfaced via the Tokens UI panel showing the review-call actually ran on opus.

Plus: the LLM_CALL event used to log with the alias ("opus"), now it logs with the resolved full ID (claude-opus-4-6). This is a self-observability fix; until now the UI was lying about which model ran.

Significance of the three commits

In a consumer-model single-shot context, “model name” = string. In an orchestration-model context, model name = 13 roles × 5-tier resolution priority × explicit fail-loud invalid-input handling. The two branches no longer solve the same problem, and the shape of the codebase is what shows it.

What the data does not say (caveats)

An honest list of what shouldn’t be inferred from these numbers:

There is no 1-to-1 identical change pair. The planner decomposed differently on the two models, so there is no single change you can pin “this took X tokens on 4.7, Y on 4.6.” The comparison is clean at the spec level (same input → same expected output), but no change-level “this exact change” comparison exists.
One spec, two reference runs. This is one task type (Next.js + shadcn/ui + tailwind + Playwright micro-site). On a different stack (e.g., Python backend, mobile, fintech) the ratios may be different. Don’t extrapolate the numbers here.
The ITERATION_END event’s reason field was unfilled on these runs (all ?). So there is no structured answer to “what caused the iteration”; only the story reconstructed from journal step transitions and retry_context messages.
Code quality is not measured. The outputs (Next.js apps) merged on both sides through the verify gates, meaning lint + build + test + design-fidelity all passed. Whether the 4.7 or 4.6 produces better-quality code is not what these numbers say. They say only how many tokens and iterations it took to push it through the same gates.
The “too creative” interpretation for blog-list-with-filter (breaking a sibling spec) is concrete, journal-recorded evidence. But it does not logically follow that every 4.7 change wanders the same way. One sample, hard evidence; population-level claim, no. The 12-vs-5 decomposition and the 2.17× output-token total, on the other hand, are the full-run-level sample, and that’s the stronger claim.
Cost estimates are deliberately omitted. The exact output-token / input-token / cache-read pricing formula is time-of-day and version-specific; we don’t want to misinform the reader with wrong dollar figures. The token ratio alone is enough.
v0.app design drift is unquantified. The design-fidelity gate’s output in these runs was a placeholder. The “4.7 is more prone to off-export component variants” observation is valid only at a qualitative level.

Reproducibility

Every number in this article comes from orchestration-state.json and journal files emitted by the two reference end-to-end runs. The methodology (same-spec sha + alias-only flip + per-change token accounting from journals) is reproducible on any orchestration platform that emits structured run telemetry. Spec sha-256 across both runs: ef1006c4b448… (verified identical).