🇬🇧 EN 🇭🇺 HU · not translated

Article · ITLine

2026-05-08   ·   10 min read

From spec to merged: anatomy of one SET orchestration run

What actually happens between a markdown spec and a merged main branch when set-core runs one orchestration. Five named stages, three planner calls, N parallel worktrees, a gate stack, a serial merge queue, and a JSONL event log that records every transition.

spec-to-merged-anatomy

A walk through one orchestration run in set-core — what a spec turns into between “I run the planner” and “the change is merged into main”. Author: setcode.dev.

Bottom line

An orchestration run is five named stages: decompose, dispatch, verify, merge, archive. Every transition is a typed JSON event in a journal you can replay. The planner runs as one LLM call by default; very large specs route to a three-phase pipeline. Each change runs in its own git worktree. Gates produce structured pass/fail outputs that the verify-retry loop and the merge queue both consume. The merge queue is serial on purpose.

The shape of the run is what makes parallel agents survivable. The shape is also what this article describes — not a particular model, not a particular project type.

TL;DR

One spec, one merged main: the five stages spec.md + openspec DECOMPOSE planner.py decompose_brief 1 LLM call · default ↓ only for very large specs decompose_domain N calls · one per domain decompose_merge 1 call → N change records orchestration-plan.json DISPATCH · VERIFY dispatcher.py · verifier.py · gate_profiles.py change-A git worktree add agent → build · test · e2e · scope_check · review · spec_verify → ready to merge change-B git worktree add agent → build · test · scope_check ✗ retry → … pass → retried, then ready change-C git worktree add agent → build · test · e2e · scope_check · review · spec_verify → ready to merge N parallel · bounded by max_parallel isolation = git worktree per change MERGE merger.py change-A change-C change-B serial · integration gate git merge –ff-only ARCHIVE openspec delta → main specs commit main updated Every transition emits a typed JSONL event · the run is replayable from the journal
The shape of one orchestration run. Decompose is one LLM call by default (decompose_brief); very large specs route to a three-phase pipeline (1 + N + 1 calls, dashed sub-boxes). Dispatch fans the changes out into N git worktrees, each with its own agent and verify gate stack. Verified changes converge into a serial merge queue with an integration gate against fresh main. Archive syncs the change’s delta specs into the project’s main OpenSpec tree.

The five stages

spec.md ─► decompose ─► dispatch ─► verify ─► merge ─► archive
            (planner)   (worktree)  (gates)  (queue)  (openspec)

Each stage has one job and a clear handoff. The interesting choice is that the handoff is always state plus event, not a function return value. A change moves from one stage to the next by changing its status in orchestration-state.json and emitting a typed event into the JSONL log. Anything watching the run — the dashboard, the supervisor, a re-run after a crash — reads the same two surfaces.

This sounds like an implementation detail, but it is the core constraint: the run can be paused, resumed, replayed, and inspected from outside, because nothing important lives only inside a Python function call.


Stage 1 — Decompose: one call, or three

The planner is in lib/set_orch/planner.py. It reads a spec (plus the project’s existing OpenSpec specs, conventions file, and any in-flight changes) and produces a structured plan: an ordered list of changes, with dependencies and roles.

For a typical spec the planner does this in one LLM call (decompose_brief). The brief alone is the plan: domain priorities, resource ownership, cross-cutting work, a phasing strategy, and the leaf change list. That output is what the rest of the run consumes.

For specs whose estimated input would blow past a token threshold, the planner switches to a three-phase pipeline:

  1. decompose_brief — same as above, one call. Output: the JSON brief.
  2. decompose_domain — input: one domain’s summary, its requirements, the brief from Phase 1, and the test plan. Output: that domain’s list of changes. One call per domain.
  3. decompose_merge — input: all domain plans concatenated, plus the brief and the dependency map. Output: a single unified plan with a topological order, deduplicated cross-cutting work, and final change names. One call.

Total LLM calls in the three-phase mode: 1 + N + 1, where N is the number of domains. The routing decision is in _resolve_planner_strategyserial and parallel are the two endpoints, auto (the default) chooses based on a token estimate against planner.single_call_max_input_tokens.

Why two modes. A single call is cheaper, more cache-friendly, and avoids the orchestration overhead of the multi-phase fan-out. It works as long as the spec fits comfortably into one prompt with all its context. Past that size, the model starts to struggle with global consistency — picking a phasing strategy late, dropping cross-cutting concerns, returning a poorly ordered plan. The three-phase pipeline is a deliberate trade: pay for more LLM round-trips to keep the per-call inputs small enough that the model can reliably commit to the structure.

The output of the planner — whichever mode — lands in orchestration-plan.json with a list of named changes. The directory under openspec/changes/<name>/ is materialized later, by the agent itself, during dispatch.


Stage 2 — Dispatch: a worktree per change

Once the plan exists, the dispatcher (lib/set_orch/dispatcher.py) walks it. For each change whose dependencies are met:

  1. Lock the change record. Inside an atomic state lock, the dispatcher flips the change’s status from pending to dispatched. Without the lock, two dispatcher polls could pick the same change up.
  2. Create a worktree. git worktree add <path> <branch> gives the change a separate working tree on a separate branch. Same .git directory, different files on disk.
  3. Launch an agent inside it. The dispatcher starts a set-loop (Ralph) process in the worktree — a retry loop wrapping the Claude CLI. The change scope, the OpenSpec roadmap item, and (for projects that have one) a design context are passed in as the iteration’s inputs.
  4. Emit DISPATCH and AGENT_SESSION_DECISION. The journal records the change scope and the session decision (fresh start vs. re-attach to an existing session ID).

Worktrees are the isolation primitive. Ten agents writing to ten worktrees do not collide on the filesystem and do not touch each other’s branches. The .git directory is shared, but git itself is the synchronisation point — a git fetch from one worktree is visible in all of them. When an agent finishes, its worktree gets torn down (or kept around for re-dispatch on a verify retry, which is cheaper than starting from scratch).

Parallelism is bounded by one knob: max_parallel. The dispatcher counts in-flight changes and stops launching new ones when the cap is hit. Topological order is enforced separately — a change does not get dispatched until everything in depends_on is in a terminal state.

There is no “pool of workers.” There are N agents, N worktrees, N branches. When an agent crashes, only its branch is in a weird state.


Stage 3 — Verify: gates produce pass/fail, not opinions

After the agent reports done, the change goes into verify (lib/set_orch/verifier.py, lib/set_orch/gate_profiles.py). A gate stack runs against the change’s worktree.

The universal stack, by change type, is in gate_profiles.py:UNIVERSAL_DEFAULTS. For a feature change it is:

Gate Mode What it checks
build run The project builds (compile / bundler exit code).
test run Unit tests pass.
scope_check run The change branch contains implementation code — i.e., the diff against merge-base is more than just OpenSpec proposal / task artifacts.
test_files run The change ships test files commensurate with the requirements it covers.
review run A reviewer LLM reads the diff against the spec and signs off, with a structured rubric.
spec_verify run The implementation actually satisfies each requirement in the change’s delta spec.
rules warn Project rules (the .claude/rules/ set) are respected.

Each gate’s mode is one of run (blocking on failure), warn / soft (non-blocking), or skip. Profile plugins register additional gates. The web project profile, for example, adds e2e (Playwright suite), lint (linter), design-fidelity (Playwright + pixel diff against a v0.app reference), i18n_check, and required-components (presence of expected shadcn primitives). Profiles for other project types (voice agents, mobile, backend) register their own.

Most gates are deterministic — they call out to a shell command and read the exit code. Two are LLM-graded: review and spec_verify. The distinction matters: a deterministic gate’s “pass” means a process produced the expected exit. An LLM-graded gate’s “pass” means a separate model said yes against a structured prompt. Both produce the same pass/fail/warn output that the next stage can act on.

When a gate fails, the verifier emits a VERIFY_GATE event with the failure detail and builds a retry_context payload — the structured artefact the agent needs to fix the failure: a stack trace, a failing requirement ID, a missing test file, a pixel diff. The agent re-runs with that context. Retries are bounded by a per-change retry budget; when the budget is exhausted, the change transitions to one of the terminal failed states (failed:retry_budget_exhausted and friends) and stops blocking the queue.

The point of structured gate output is that the agent does not have to introspect why it’s stuck. The gate output is the failure description, in machine-readable form. The next iteration’s prompt has the retry_context baked in.


Stage 4 — Merge: serial, with an integration gate

Verified changes go into the merge queue. The merger (lib/set_orch/merger.py, execute_merge_queue) drains it one change at a time. Serially. On purpose.

Per change:

  1. Dependency check. Every entry in depends_on must be in a terminal merged state. If not, the change is parked as dep-blocked and the queue moves on.
  2. Integrate fresh main. The merger updates the change’s branch with the current main. This is where two merges that were independently green can collide. If the integration fails (build break, test regression caused by interaction with a sibling change merged a minute ago), the change is marked integration-failed and goes back to verify with the integration’s diagnostic as retry_context.
  3. Integration gate stack. The merged-state branch runs build, test, and (where the profile registers it) e2e. The web profile’s integration e2e step has an optional smoke sub-phase that runs first against inherited sibling specs — a fast fail signal if a previous merge already broke the suite. If integration passes, the merger continues; if not, same path as #2.
  4. Fast-forward into main. git merge --ff-only. No merge commits. By the time the merge runs, the branch already contains the latest main, so a fast-forward is always possible — and the absence of merge commits keeps the history of main linear.
  5. Emit STATE_CHANGE with status merged.

Why serial. Two parallel merges that both pass their integration gate against the same main can still produce a broken main after both land. Serialising the queue means each merge sees the previous one’s actual result on main before its integration gate runs. The cost is throughput at the merge step. The benefit is that main is, by construction, a state every gate has signed off on against the actual neighbours.

The work that takes time — the agent’s implementation, the verify stack — has already happened in parallel. The merge step is what the queue serialises, and that step is fast.


Stage 5 — Archive: the spec is updated, not just the code

A merged change is not done until OpenSpec is updated.

archive_change (in merger.py) does three things, the first two delegated to the openspec archive <name> CLI:

  1. Move openspec/changes/<name>/ to a timestamped archive/ subdirectory.
  2. Sync the change’s delta specs into the project’s main openspec/specs/ tree — so the next planner run sees the new state of the world.
  3. Stage and commit the spec update.

Until the archive step runs, a freshly merged change’s delta exists in openspec/changes/, and the next planner run would see it as still in flight. After archive, the change is part of the spec base, and the planner treats its requirements as known.

This is the loop closure. The next spec the planner reads is the one this change just helped write.


What the journal actually looks like

One change in the event log: 15 lines, dispatch to done <state-dir>/journals/add-blog-list.jsonl TIME EVENT DETAIL 11:50:16 STATE_CHANGE status: pending → dispatched 11:50:16 DISPATCH data={scope: …} 11:50:17 AGENT_SESSION_DECISION decision: fresh 11:52:04 LLM_CALL role=agent tokens_out=14820 cache_hit=0.71 11:55:31 LLM_CALL role=agent tokens_out=22104 cache_hit=0.83 11:58:42 GATE_SET_EXPANDED added: [e2e, lint, design-fidelity] 11:59:18 VERIFY_GATE name=build result=pass 12:00:42 VERIFY_GATE name=test result=pass 12:02:11 VERIFY_GATE name=design-fidelity result=fail retry_context=token-mismatch 12:04:50 LLM_CALL role=agent tokens_out=8740 verify_retry=1 12:06:33 VERIFY_GATE name=design-fidelity result=pass 12:08:14 VERIFY_GATE name=review (LLM-graded) result=pass 12:09:05 VERIFY_GATE name=spec_verify (LLM-graded) result=pass 12:09:48 STATE_CHANGE status: dispatched → merged (ff-only into main) 12:09:50 STATE_CHANGE status: merged → done (post-archive) state llm gate pass gate fail gate registry
A reconstructed journal for one change, from dispatch to done. Fifteen typed events: state transitions (blue), LLM calls (yellow), gates passing (green) and failing (red), the gate registry expanding when the change touched UI code (purple). The design-fidelity gate failed at 12:02:11; the verify-retry loop fed the gate output back to the agent as retry_context, and the next iteration passed at 12:06:33. Every line is a JSON object in the per-change journal file.

The event log is one file: orchestration-events.jsonl, one JSON object per line. Per-change journals live at <state-dir>/journals/<change_name>.jsonl and contain the same events filtered to one change.

A working subset of the event types you will see:

A stripped-down trace for one change, in event order:

STATE_CHANGE          name=add-blog-list   status: pending → dispatched
DISPATCH              data={scope: ...}
AGENT_SESSION_DECISION  decision=fresh
LLM_CALL              role=agent           tokens_out=14820
LLM_CALL              role=agent           tokens_out=22104
GATE_SET_EXPANDED     added=[e2e, lint, design-fidelity]
VERIFY_GATE           name=build           result=pass
VERIFY_GATE           name=test            result=pass
VERIFY_GATE           name=design-fidelity result=fail   retry_context=token-mismatch
LLM_CALL              role=agent           tokens_out=8740   verify_retry=1
VERIFY_GATE           name=design-fidelity result=pass
VERIFY_GATE           name=review          result=pass
VERIFY_GATE           name=spec_verify     result=pass
STATE_CHANGE          status: dispatched → merged
STATE_CHANGE          status: merged → done

Each line is timestamped. The whole trace is replayable, diffable, and survives crashes — because nothing about it lives only in process memory.


What this shape buys you

Three concrete things, in order of importance:

Crash recovery is cheap. State lives in orchestration-state.json. Per-change progress lives in the per-change journal. Worktrees live on disk. If the orchestrator process dies, restarting it picks up exactly where it stopped — no in-flight change forgets which step it was on. The supervisor process can also tell, from the same state, which agents it needs to put back.

Parallelism is bounded by isolation, not by coordination. Ten worktrees on ten branches do not need to coordinate writes; they need to merge in order. The hard part — making sure two parallel changes do not produce a broken main — is concentrated in one place (the merge queue’s integration gate), not smeared across the whole pipeline.

Failure has a shape, not a story. Every gate produces a structured pass/fail with the artefacts the agent needs to act on: a stack trace, a failing requirement ID, a missing test file, a pixel diff. Those artefacts get handed back as retry_context on the next iteration. The agent does not narrate its own failures — the gate does, in machine-readable form, and the verify-retry loop is what turns that into the next prompt.


What this article deliberately does not specify


Reproducibility

Every claim in this article maps to a file in github.com/tatargabor/set-core:

The shape described here is what the code does today. The knobs are documented separately in the project’s README and OpenSpec specs.