Simpsons Loops: What Two Weeks of Dogfooding Fixed

Two weeks ago I wrote about getting Simpsons Loops running with Speckit and Claude Code. It one-shot a full-stack feature from specs. Magical.

Then I tried to use it on more projects and it kept breaking.

speckit vs. what I built on top

Quick distinction before I get into it.

Speckit is a framework for spec-driven development. You run each step manually: /speckit.specify to write the spec, /speckit.clarify for clarification, /speckit.analyze for analysis, /speckit.implement for implementation. That's by design — control at every stage.

The Ralph Loop concept comes from Geoffrey Huntley — ghuntley.com/ralph. I liked the Simpsons motif and added Homer and Lisa as the clarification and analysis loops. Homer clarifies. Lisa analyzes. Ralph implements. That naming is separate from Speckit as a concept. Speckit is the framework. Homer, Lisa, and Ralph are how I think about the agents running inside it.

Simpsons Loops is the automation layer I built on top. It chains those steps into a single pipeline command. Pass a description, it runs the whole thing, comes back with a branch.

The problems below? Those are mine. Speckit did what it's supposed to do. My orchestration layer was the fragile part.

what I learned was needed

There was no pipeline. Each stage — specify, clarify, plan, tasks, analyze, implement — was a manual process. You ran each one yourself, in order, checking output between steps. It worked well enough to ship features but it was tedious and easy to mess up.

Bootstrapping was fragile too. If you weren't already on a feature branch with a spec.md in the right place, it choked. You had to manually set up the branch, create the spec directory, run /speckit.specify, then kick off the loops. Four steps before automation even started.

Quality gates needed structure. The concept was sound but there was nothing enforcing the gates actually had content. Ralph would find .specify/quality-gates.sh, get exit code 0 on an empty file, and mark tasks complete without running a single test. No error. Just a green checkmark on nothing.

what changed

Spent the last two weeks running Simpsons Loops on itself. Five specs through the full cycle, each one fixing something the previous run exposed.

fresh context per iteration

The old workflow was clunky. Start Claude Code, run the loop command, which does preflight checks and prints a bash command. Exit Claude Code. Run that bash command to do the actual looping. Two steps just to get the thing running, and you had to leave Claude Code to do it.

Replaced all of that with a single orchestrator Claude Code session that spawns fresh sub-agents for each iteration through the Agent tool. Homer iteration 7 doesn't know what Homer iteration 6 did. It reads current file state, runs clarification, fixes one finding, commits, exits.

This wasn't about fixing context drift — the bash loops weren't really causing that. It's a cleaner architecture. One orchestrator with full visibility, spinning up focused sub-agents for each task. The orchestrator can give consistent summaries, track state cleanly, and report what actually happened across the whole run. It's just a better way to build it.

pipeline bootstrapping

The pipeline is entirely new. It auto-detects where to start based on what exists in your repo.

tasks.md with completed tasks? Start at Ralph.
tasks.md with nothing completed? Start at Lisa.
plan.md exists? Generate tasks, keep going.
spec.md exists? Start at Homer.
Nothing? Pass --description "I want XYZ" and it runs /speckit.specify first, creates the feature branch, then continues.

The fix was surprisingly involved. Feature directory resolution assumed you were already on a feature branch. Running from main with --description meant the specify step created the branch after the pipeline already tried to resolve the directory. Added a post-specify re-resolution step. 30+ commits across spec 004 just for that one fix.

quality gate validation

Ralph now validates that quality-gates.sh exists and is non-empty before starting. Not just "file exists" but "file contains actual commands, not just comments and whitespace." Empty file, Ralph refuses to run.

I'd rather the pipeline refuse to run than silently skip validation.

--stop-after

The pipeline takes a --stop-after flag now. Pass any step name and it stops cleanly after that step, reports status, exits.

Combine it with --from to run just a slice. --from lisa --stop-after lisa runs only the analysis loop. Invalid ranges error before executing anything.

I use --stop-after tasks constantly. Spec, clarify, plan, tasks — then I read everything before handing off to implementation. I don't want Ralph touching code until I've looked at the spec myself.

one finding per iteration

Homer and Lisa used to try fixing all findings at a given severity level in one pass. That worked sometimes. Other times it consumed the entire context window trying to fix 8 MEDIUM findings at once and botched half of them.

Now it's one finding per iteration. Fix the highest severity one. Commit. Exit. Next iteration picks up the next. Slower but reliable.

commits per iteration

Each loop iteration commits its changes, so you can always roll back to any point and retry. This generates a lot of commits though. Spec 004 alone had 82. Required switching to squash merges on PRs to keep main clean.

the dogfooding part

Five specs ran through the full pipeline to fix the pipeline. Spec 001 switched to sub-agents. Spec 002 added re-run capability. Spec 003 got delegation actually working (spawning sub-agents instead of running inline was the spec, not the reality). Spec 004 fixed bootstrapping. Spec 005 consolidated quality gates.

I threw each one to Ralph and prayed I had enough usage tokens left (again).

Spec 004 alone generated 82 commits. Homer and Lisa found CRITICAL, HIGH, and MEDIUM findings before Ralph started anything. The commit log reads like a conversation between the loops.

what it looks like when it finishes

Here's actual output from a recent pipeline run. 19 tasks across a full refactor — renaming classes, rewriting services, updating schedules, cleaning up dead files.

Pipeline Step Status:
  specify .... skipped
  homer ..... executed
  plan ...... executed
  tasks ..... executed
  lisa ...... executed
  ralph ..... executed

Iteration Counts:
- homer: 2 iterations (+ 1 confirmation)
- lisa: 2 iterations (+ 1 confirmation)
- ralph: 16 iterations

Completion Status: success

All 19 tasks complete, all quality gates pass

Cooked for 1h 25m 53s

One command. Walked away. Came back to a branch with everything done and passing.

the workflow

Here's what I actually do, start to finish.

Open a new Claude Code instance on main branch. Write the spec:

/speckit.specify I want to refactor the notification service
to support multiple providers. Watch out for the existing
webhook callbacks.

That creates a feature branch and generates the spec. I read through it, think about what's missing or wrong.

/clear to reset context. Then clarify:

/speckit.clarify Change the retry logic to exponential
backoff instead of fixed intervals

You can leave /speckit.clarify blank and let it find the gaps itself. Run it as many times as you want. Each time you /clear first so it reads the spec fresh.

When the spec feels right, /clear one more time and kick off the full pipeline:

/speckit.pipeline

That runs everything sequentially, each step in a fresh sub-agent:

/speckit.homer.clarify — up to 30 iterations, one finding per pass
/speckit.plan — generates the implementation plan
/speckit.tasks — generates ordered task list
/speckit.lisa.analyze — up to 30 iterations, one finding per pass
/speckit.ralph.implement — one task per iteration with quality gates

The whole thing can take hours. That's the point. Kick it off and go do something else.

I get involved before when I write the spec and after when I review the output. Not during.

If you want to review before implementation, use --stop-after:

/speckit.pipeline --stop-after tasks

That runs specify through tasks, then stops. You read everything, make sure the plan and tasks look right, then resume:

/speckit.pipeline --from lisa

I use --stop-after tasks constantly. I don't want Ralph touching code until I've looked at the spec and plan myself.

You can also slice out just one section:

/speckit.pipeline --from homer --stop-after lisa

That runs only Homer and Lisa. Invalid ranges error before executing anything, so you can't accidentally pass --from lisa --stop-after homer.

setup.sh now brings its own CLAUDE.md

One friction point from the original: you had to write your own CLAUDE.md and constitution.md before the pipeline worked well. These aren't optional extras — they're the infrastructure that lets the pipeline work on any project. Homer and Lisa use them to understand your project's standards and constraints. Missing either means the analysis loops have nothing real to check against.

setup.sh now installs template versions of both. Edit them for your project. The templates give you a bare minimum of best practices on day one.

The most important thing to get right in your CLAUDE.md: TDD. Write failing tests first, then implement. The reason this matters so much for Ralph specifically is that without failing tests, it's very easy for the agent to claim it fixed something when it didn't. A test that was already passing doesn't tell you anything. A test that goes from red to green tells you the thing was actually built.

try it

Fair warning: this thing is under active development and the APIs shift. I've broken my own pipelines updating it.

The core loops work. I'm running them on real features.

Repo is at github.com/jnhuynh/spec-kit-simpsons-loops. Needs Speckit and Claude Code CLI.

Run setup.sh from your project root. Edit .specify/quality-gates.sh to match your stack. Edit .specify/memory/constitution.md to match your standards. Then /speckit.pipeline --description "I want XYZ" and let it run.

If the spec is good the output is good (assuming you have quality gates, but that's a whole other post). Try it on something small.