Tier 4

Tier 4 — Loop Until Done: Make the agent prove it's done, so you can trust the output

Reach for this tier when you can't trust the output without reading every line, and "done" means nothing concrete.

This is the heart of professional agentic engineering, so read the framing first.

Prompts don't make agents reliable. Verification does.

You don't get great output from one prompt — you get it from a loop: the agent acts, checks, improves, repeats.

A loop only works if it has a target it can test itself against — something that reports done / not done without you checking every cycle.

That target is tests. Red → green is the ideal loop condition: unambiguous, automatable, and the agent can run it itself.

flowchart TD
  P[Prompt + executable DoD] --> A[Agent acts]
  A --> V{Oracle: tests / lint / types green?}
  V -->|no, under cap| A
  V -->|no, cap hit| Stop[Stop · summarize blocker]
  V -->|yes| Done[Done · show evidence]

This is why we don't need to invent new techniques for agentic quality — the discipline already exists. TDD and BDD were once highly recommended practices of professional product engineers.

In the agentic era they're mandatory: they're what turns a wandering agent into one that converges instead of declaring success on vibes.

These old practices are now essential — that's what makes it engineering and not just generation.

This is just your Definition of Done, made executable. In agile, the DoD is the shared, explicit checklist a work item must satisfy to count as finished — tests pass, code reviewed, lint clean, types check, docs updated, deployed to staging.

It used to be enforced by team discipline. With agents it becomes the machine-checkable contract the loop runs against: every DoD item that can be a command (tests, lint, typecheck, build, e2e, coverage threshold) becomes part of the bar the agent iterates toward; the few that can't be automated — "does this actually solve the user's problem?" — stay human gates.

The whole skill is writing a DoD precise enough that the agent knows, without you, whether it's done.

The deeper shift: the spec is the new source code. When the loop regenerates the implementation on demand against an executable spec, the code becomes throwaway output and the SPEC.md + tests become the artifact you actually author, version, and review. Your job moves up — from writing syntax to writing the rules the agent executes against.

You decide what to test for; the agent doesn't. The tests that catch real bugs are usually integration and end-to-end, not single units. So treat those as first-class: you say what they must check, the agent writes them to your spec and adds them to CI, and you confirm they check the right things. Build the rails; don't let the agent build its own. (Raised in #1.)

31. Make your Definition of Done executable — a command is what the loop converges to.

Instead of: "Make it work."

Prefer: spell out the DoD as commands the agent runs itself: "Done when: npm test green, npm run lint clean, npm run typecheck passes, and curl /health returns 200. Loop until all four pass; show each output. Flag anything you couldn't verify."

32. Do TDD — the unit-level check.

Instead of: "Implement and test calculateTotal."

Prefer: drive the code with a failing test first.

How:

1. "Write failing tests for calculateTotal (price, discount, tax). No mocks. Run them; show they fail."
2. [commit the failing tests — your safety net]
3. "Implement until green. Do NOT edit the tests."

Committing first means any quiet test-weakening shows up in the diff.

Passing unit tests doesn't mean the app works — the worst bugs usually show up in integration and end-to-end tests, not single units.

33. Use BDD — the behavior-level check.

Instead of: vague acceptance ("it should handle bad input").

Prefer: Given/When/Then scenarios the agent codes against and runs.

How: take the Given/When/Then scenarios you shaped in the spec (Tier 2, Tip 20) and make them the loop's exit check — hand the agent the .feature and a shared gherkin-guidelines.md, then: "Generate step definitions for refund.feature, implement until the scenarios pass, and sanity-check by mutation — break the implementation on purpose, confirm the scenario fails, then revert."

Mental model: TDD = test first (developer level) · BDD = behavior first in business language (acceptance level) · SDD = whole spec first, agent generates code + tests + docs. They nest — Gherkin is the executable middle layer between a user story and unit tests, and all three give the loop a target it can test itself against.

34. Test the UI for real with Playwright MCP — don't eyeball it.

Instead of: "make sure the checkout page looks right."

Prefer: have the agent drive a live browser, perform the flow, and assert.

How: with Playwright MCP connected:

Open localhost:3000, add two items to the cart, go to checkout, verify the
total reads $40 and "Payment received" appears. Save the run as an e2e spec.

It works off the accessibility tree (~200–400 tokens/snapshot), not screenshot guessing. Division of labor: Claude in Chrome for interactive dev, Playwright MCP for the suite/CI, Chrome DevTools MCP for performance.

35. Demand evidence, not a claim of success.

Instead of: accepting "done, it works."

Prefer: "Paste the exact test command and its output before you say it's done."

36. Ask for all findings, not a conservative review.

Instead of: "Only flag high-severity issues." (4.8 obeys and hides real ones)

Prefer: "List every issue with a severity label and line reference. Don't pre-filter — I'll triage."

37. Review with fresh eyes, not the context that wrote it.

Instead of: the same session grading its own code.

Prefer: a fresh subagent/session reviews the diff against the criteria; correctness only, not style. (Stronger still: have a different model review it — see multi-model use.)

38. Run a pre-mortem; treat it like a teammate.

Instead of: "Looks great, ship it."

Prefer: "What could fail here? What did you assume? What's missing?"

39. Iterate UI visually when there's no spec to assert.

Instead of: "Make the dashboard look good."

Prefer: mock → implement → screenshot → compare → fix (2–3 rounds). Override Opus 4.8's default house style (cream, serif, terracotta) with a concrete palette spec.