A specification language for human-agent boundaries. Define modes, validators, and disagreement policies — then enforce them structurally, not by convention.
Analysis of AI-generated code finds consistent security weaknesses across Python and JavaScript — persistent across model generations, not a temporary capability gap.
Newer models produce code that fails to perform as intended but runs without syntax errors or obvious crashes. Hidden costs accumulate through delayed discovery.
Autonomous code generation introduces audit complications that go beyond code quality into regulatory territory. What decisions were made? What validation occurred?
Single agents reviewing their own output validate against the same flawed model that produced the implementation. Independent review requires structural separation.
Operational stages with defined tool sets. Producer cannot merge; reviewer cannot edit. WF1 enforces disjoint capabilities at the MCP boundary — not by convention, at the infrastructure level.
Independent evaluators with orthogonal criteria — security, architecture, quality. Each assesses without seeing others' verdicts. Pre- and post-execute phases declared in the protocol.
Unanimous, majority, quorum, or any. Turns variable validator outputs into deterministic decisions: proceed, escalate, or halt. Any blocker is non-overridable — INV3.
JWT-backed single-use credentials issued on validator quorum. MCP tools require a valid token for any mutating operation. Validation cannot be skipped, drifted, or bypassed.
Full governed task path vs. representative LLM inference latency. Structural enforcement is essentially free.
At a 5-stage pipeline, behavioural compliance fails 45% of the time as corrupted state compounds. Structural enforcement: 8%.
The gap widens past an order of magnitude as validators diversify. Even at 20 stages — where behavioural failure hits 91% — structural enforcement holds at 32%.
Two human-in-control roles plus N specialized agents. Producer mode cannot merge; reviewer mode cannot edit. The minimum of two humans derives from Separation of Duties: a single human cannot independently validate their own work, no matter how many agents are on the team.
The language places no upper bound on modes. Enterprise teams typically add planning, deployment, and operations modes. The 2+N pattern is the minimum viable structure.
We propose a domain-specific language for specifying AI-SDLC processes, with formal abstract syntax, well-formedness conditions, operational semantics, and enforcement invariants. The language distinguishes policy (declared intent) from mechanism (structural enforcement), enabling implementations to bound process non-determinism through validation tokens and capability boundaries. Three results follow: structural enforcement bounds system failure rates at a weighted product of agent and validator rates; the 2+N team pattern formalizes Separation of Duties for AI-SDLC; and Kleene closure of orchestration loops and reflexive protocol-adherence validation arise as emergent properties of the design. Simulation studies characterise disagreement-policy trade-offs, governance overhead, the failure-rate model, and Byzantine robustness.
Three layers: unit, integration, and end-to-end CLI journeys driven through real subprocess invocation.
Property-based testing of the core invariants — token unforgeability, audit-chain integrity, mode separation, policy halt — against randomized inputs.
Ship-ready reference protocols — solo, team, and 2+N — all expressed in the same DSL, with the library growing toward ~10 configurations.
Every invariant in the specification maps to a concrete enforcement mechanism in the implementation. The protocol doesn't ask agents to behave — the infrastructure makes the non-compliant path impossible.
The system is built under its own rules: it extends its own codebase through the same 2+N protocol it defines — dogfooded end to end.
AGPLv3, developed in the open. Install from PyPI; the full source, protocol templates, and test suite are public.
As foundation models commoditize, the durable engineering asset becomes process design. The protocol is institutional memory; models are commodity infrastructure.