From spec to merged: anatomy of one SET orchestration run

spec-to-merged-anatomy

A walk through one orchestration run in set-core — what a spec turns into between “I run the planner” and “the change is merged into main”. Author: setcode.dev.

Bottom line

An orchestration run is five named stages: decompose, dispatch, verify, merge, archive. Every transition is a typed JSON event in a journal you can replay. The planner runs as one LLM call by default; very large specs route to a three-phase pipeline. Each change runs in its own git worktree. Gates produce structured pass/fail outputs that the verify-retry loop and the merge queue both consume. The merge queue is serial on purpose.

The shape of the run is what makes parallel agents survivable. The shape is also what this article describes — not a particular model, not a particular project type.

TL;DR

One spec → planner → N change records. For typical specs the planner is one LLM call (decompose_brief); very large specs route to a three-phase pipeline (decompose_brief → decompose_domain per domain → decompose_merge).
Each change gets its own git worktree (git worktree add …). N agents can run at the same time, bounded by max_parallel (default 3).
The universal gate stack for a feature change is build, test, scope_check, test_files, review, spec_verify, rules. Project profiles register extras — for the web profile: e2e, lint, design-fidelity, i18n_check, required-components. Most gates are exit-code based; review and spec_verify are graded by a reviewer LLM.
Merges are serial. The queue integrates fresh main into the change branch and runs the integration gate stack before each fast-forward.
Every state transition is a JSONL event (DISPATCH, LLM_CALL, VERIFY_GATE, STATE_CHANGE, CHANGE_INTEGRATION_FAILED, …). The event log is the source of truth for what happened.

The shape of one orchestration run. Decompose is one LLM call by default (decompose_brief); very large specs route to a three-phase pipeline (1 + N + 1 calls, dashed sub-boxes). Dispatch fans the changes out into N git worktrees, each with its own agent and verify gate stack. Verified changes converge into a serial merge queue with an integration gate against fresh main. Archive syncs the change’s delta specs into the project’s main OpenSpec tree.

The five stages

spec.md ─► decompose ─► dispatch ─► verify ─► merge ─► archive
            (planner)   (worktree)  (gates)  (queue)  (openspec)

Each stage has one job and a clear handoff. The interesting choice is that the handoff is always state plus event, not a function return value. A change moves from one stage to the next by changing its status in orchestration-state.json and emitting a typed event into the JSONL log. Anything watching the run — the dashboard, the supervisor, a re-run after a crash — reads the same two surfaces.

This sounds like an implementation detail, but it is the core constraint: the run can be paused, resumed, replayed, and inspected from outside, because nothing important lives only inside a Python function call.

Stage 1 — Decompose: one call, or three

The planner is in lib/set_orch/planner.py. It reads a spec (plus the project’s existing OpenSpec specs, conventions file, and any in-flight changes) and produces a structured plan: an ordered list of changes, with dependencies and roles.

For a typical spec the planner does this in one LLM call (decompose_brief). The brief alone is the plan: domain priorities, resource ownership, cross-cutting work, a phasing strategy, and the leaf change list. That output is what the rest of the run consumes.

For specs whose estimated input would blow past a token threshold, the planner switches to a three-phase pipeline:

decompose_brief — same as above, one call. Output: the JSON brief.
decompose_domain — input: one domain’s summary, its requirements, the brief from Phase 1, and the test plan. Output: that domain’s list of changes. One call per domain.
decompose_merge — input: all domain plans concatenated, plus the brief and the dependency map. Output: a single unified plan with a topological order, deduplicated cross-cutting work, and final change names. One call.

Total LLM calls in the three-phase mode: 1 + N + 1, where N is the number of domains. The routing decision is in _resolve_planner_strategy — serial and parallel are the two endpoints, auto (the default) chooses based on a token estimate against planner.single_call_max_input_tokens.

Why two modes. A single call is cheaper, more cache-friendly, and avoids the orchestration overhead of the multi-phase fan-out. It works as long as the spec fits comfortably into one prompt with all its context. Past that size, the model starts to struggle with global consistency — picking a phasing strategy late, dropping cross-cutting concerns, returning a poorly ordered plan. The three-phase pipeline is a deliberate trade: pay for more LLM round-trips to keep the per-call inputs small enough that the model can reliably commit to the structure.

The output of the planner — whichever mode — lands in orchestration-plan.json with a list of named changes. The directory under openspec/changes/<name>/ is materialized later, by the agent itself, during dispatch.

Stage 2 — Dispatch: a worktree per change

Once the plan exists, the dispatcher (lib/set_orch/dispatcher.py) walks it. For each change whose dependencies are met:

Lock the change record. Inside an atomic state lock, the dispatcher flips the change’s status from pending to dispatched. Without the lock, two dispatcher polls could pick the same change up.
Create a worktree. git worktree add <path> <branch> gives the change a separate working tree on a separate branch. Same .git directory, different files on disk.
Launch an agent inside it. The dispatcher starts a set-loop (Ralph) process in the worktree — a retry loop wrapping the Claude CLI. The change scope, the OpenSpec roadmap item, and (for projects that have one) a design context are passed in as the iteration’s inputs.
Emit DISPATCH and AGENT_SESSION_DECISION. The journal records the change scope and the session decision (fresh start vs. re-attach to an existing session ID).

Worktrees are the isolation primitive. Ten agents writing to ten worktrees do not collide on the filesystem and do not touch each other’s branches. The .git directory is shared, but git itself is the synchronisation point — a git fetch from one worktree is visible in all of them. When an agent finishes, its worktree gets torn down (or kept around for re-dispatch on a verify retry, which is cheaper than starting from scratch).

Parallelism is bounded by one knob: max_parallel. The dispatcher counts in-flight changes and stops launching new ones when the cap is hit. Topological order is enforced separately — a change does not get dispatched until everything in depends_on is in a terminal state.

There is no “pool of workers.” There are N agents, N worktrees, N branches. When an agent crashes, only its branch is in a weird state.

Stage 3 — Verify: gates produce pass/fail, not opinions

After the agent reports done, the change goes into verify (lib/set_orch/verifier.py, lib/set_orch/gate_profiles.py). A gate stack runs against the change’s worktree.

The universal stack, by change type, is in gate_profiles.py:UNIVERSAL_DEFAULTS. For a feature change it is:

Gate	Mode	What it checks
`build`	run	The project builds (compile / bundler exit code).
`test`	run	Unit tests pass.
`scope_check`	run	The change branch contains implementation code — i.e., the diff against merge-base is more than just OpenSpec proposal / task artifacts.
`test_files`	run	The change ships test files commensurate with the requirements it covers.
`review`	run	A reviewer LLM reads the diff against the spec and signs off, with a structured rubric.
`spec_verify`	run	The implementation actually satisfies each requirement in the change’s delta spec.
`rules`	warn	Project rules (the `.claude/rules/` set) are respected.

Each gate’s mode is one of run (blocking on failure), warn / soft (non-blocking), or skip. Profile plugins register additional gates. The web project profile, for example, adds e2e (Playwright suite), lint (linter), design-fidelity (Playwright + pixel diff against a v0.app reference), i18n_check, and required-components (presence of expected shadcn primitives). Profiles for other project types (voice agents, mobile, backend) register their own.

Most gates are deterministic — they call out to a shell command and read the exit code. Two are LLM-graded: review and spec_verify. The distinction matters: a deterministic gate’s “pass” means a process produced the expected exit. An LLM-graded gate’s “pass” means a separate model said yes against a structured prompt. Both produce the same pass/fail/warn output that the next stage can act on.

When a gate fails, the verifier emits a VERIFY_GATE event with the failure detail and builds a retry_context payload — the structured artefact the agent needs to fix the failure: a stack trace, a failing requirement ID, a missing test file, a pixel diff. The agent re-runs with that context. Retries are bounded by a per-change retry budget; when the budget is exhausted, the change transitions to one of the terminal failed states (failed:retry_budget_exhausted and friends) and stops blocking the queue.

The point of structured gate output is that the agent does not have to introspect why it’s stuck. The gate output is the failure description, in machine-readable form. The next iteration’s prompt has the retry_context baked in.

Stage 4 — Merge: serial, with an integration gate

Verified changes go into the merge queue. The merger (lib/set_orch/merger.py, execute_merge_queue) drains it one change at a time. Serially. On purpose.

Per change:

Dependency check. Every entry in depends_on must be in a terminal merged state. If not, the change is parked as dep-blocked and the queue moves on.
Integrate fresh main. The merger updates the change’s branch with the current main. This is where two merges that were independently green can collide. If the integration fails (build break, test regression caused by interaction with a sibling change merged a minute ago), the change is marked integration-failed and goes back to verify with the integration’s diagnostic as retry_context.
Integration gate stack. The merged-state branch runs build, test, and (where the profile registers it) e2e. The web profile’s integration e2e step has an optional smoke sub-phase that runs first against inherited sibling specs — a fast fail signal if a previous merge already broke the suite. If integration passes, the merger continues; if not, same path as #2.
Fast-forward into main. git merge --ff-only. No merge commits. By the time the merge runs, the branch already contains the latest main, so a fast-forward is always possible — and the absence of merge commits keeps the history of main linear.
Emit STATE_CHANGE with status merged.

Why serial. Two parallel merges that both pass their integration gate against the same main can still produce a broken main after both land. Serialising the queue means each merge sees the previous one’s actual result on main before its integration gate runs. The cost is throughput at the merge step. The benefit is that main is, by construction, a state every gate has signed off on against the actual neighbours.

The work that takes time — the agent’s implementation, the verify stack — has already happened in parallel. The merge step is what the queue serialises, and that step is fast.

Stage 5 — Archive: the spec is updated, not just the code

A merged change is not done until OpenSpec is updated.

archive_change (in merger.py) does three things, the first two delegated to the openspec archive <name> CLI:

Move openspec/changes/<name>/ to a timestamped archive/ subdirectory.
Sync the change’s delta specs into the project’s main openspec/specs/ tree — so the next planner run sees the new state of the world.
Stage and commit the spec update.

Until the archive step runs, a freshly merged change’s delta exists in openspec/changes/, and the next planner run would see it as still in flight. After archive, the change is part of the spec base, and the planner treats its requirements as known.

This is the loop closure. The next spec the planner reads is the one this change just helped write.

What the journal actually looks like

A reconstructed journal for one change, from dispatch to done. Fifteen typed events: state transitions (blue), LLM calls (yellow), gates passing (green) and failing (red), the gate registry expanding when the change touched UI code (purple). The design-fidelity gate failed at 12:02:11; the verify-retry loop fed the gate output back to the agent as retry_context, and the next iteration passed at 12:06:33. Every line is a JSON object in the per-change journal file.

The event log is one file: orchestration-events.jsonl, one JSON object per line. Per-change journals live at <state-dir>/journals/<change_name>.jsonl and contain the same events filtered to one change.

A working subset of the event types you will see:

DISPATCH — a change went into a worktree.
AGENT_SESSION_DECISION — fresh session, or re-attach to an existing session ID.
LLM_CALL — every model call, with role (decompose_brief, agent, review, spec_verify, …), input/output token counts, cache reads, and the resolved model ID.
VERIFY_GATE — one gate ran, with name and result.
GATE_SET_EXPANDED — the gate registry added gates based on the change’s content (e.g. it touched UI code, so design-fidelity was attached).
STATE_CHANGE — a change moved from one status to another.
CHANGE_INTEGRATION_FAILED — the merge queue’s integration gate failed for a change.
MONITOR_HEARTBEAT / WATCHDOG_HEARTBEAT — supervisor liveness pings.

A stripped-down trace for one change, in event order:

STATE_CHANGE          name=add-blog-list   status: pending → dispatched
DISPATCH              data={scope: ...}
AGENT_SESSION_DECISION  decision=fresh
LLM_CALL              role=agent           tokens_out=14820
LLM_CALL              role=agent           tokens_out=22104
GATE_SET_EXPANDED     added=[e2e, lint, design-fidelity]
VERIFY_GATE           name=build           result=pass
VERIFY_GATE           name=test            result=pass
VERIFY_GATE           name=design-fidelity result=fail   retry_context=token-mismatch
LLM_CALL              role=agent           tokens_out=8740   verify_retry=1
VERIFY_GATE           name=design-fidelity result=pass
VERIFY_GATE           name=review          result=pass
VERIFY_GATE           name=spec_verify     result=pass
STATE_CHANGE          status: dispatched → merged
STATE_CHANGE          status: merged → done

Each line is timestamped. The whole trace is replayable, diffable, and survives crashes — because nothing about it lives only in process memory.

What this shape buys you

Three concrete things, in order of importance:

Crash recovery is cheap. State lives in orchestration-state.json. Per-change progress lives in the per-change journal. Worktrees live on disk. If the orchestrator process dies, restarting it picks up exactly where it stopped — no in-flight change forgets which step it was on. The supervisor process can also tell, from the same state, which agents it needs to put back.

Parallelism is bounded by isolation, not by coordination. Ten worktrees on ten branches do not need to coordinate writes; they need to merge in order. The hard part — making sure two parallel changes do not produce a broken main — is concentrated in one place (the merge queue’s integration gate), not smeared across the whole pipeline.

Failure has a shape, not a story. Every gate produces a structured pass/fail with the artefacts the agent needs to act on: a stack trace, a failing requirement ID, a missing test file, a pixel diff. Those artefacts get handed back as retry_context on the next iteration. The agent does not narrate its own failures — the gate does, in machine-readable form, and the verify-retry loop is what turns that into the next prompt.

What this article deliberately does not specify

The exact threshold between single-call and three-phase planner. The two modes are stable; the threshold (planner.single_call_max_input_tokens) and the auto-routing heuristic are tunable, project-specific, and being adjusted from real production runs.
Specific anomaly rules in the supervisor. The supervisor’s anomaly-detection rules are still being hardened from real production runs. The supervisor’s role (watch, restart, replay) is stable; the rule set is not.
Project-type-specific gate sets. Web, voice-agent, and other project profiles register their own gates and configure modes differently. The article describes the universal stack and the idea of profile extension, not the per-profile detail.
Exact retry budgets and timeouts. These are knobs, configured per project.

Reproducibility

Every claim in this article maps to a file in github.com/tatargabor/set-core:

Decompose: lib/set_orch/planner.py (_resolve_planner_strategy, _phase1_planning_brief, _decompose_single_domain, _phase3_merge_plans).
Dispatch: lib/set_orch/dispatcher.py (dispatch_change, dispatch_ready_changes).
Gate stack: lib/set_orch/gate_profiles.py (UNIVERSAL_DEFAULTS), lib/set_orch/verifier.py.
Merge queue: lib/set_orch/merger.py (execute_merge_queue, archive_change).
Event log and journals: lib/set_orch/events.py (EventBus), lib/set_orch/state.py.
Model resolver (13 roles, 5-tier chain): lib/set_orch/model_config.py.

The shape described here is what the code does today. The knobs are documented separately in the project’s README and OpenSpec specs.