Green Status Is Not Shipped Work

The most dangerous failure mode in an agent workflow is not always the red error.

A red error is honest. It interrupts you. It says the run failed, the tool broke, the permission was missing, the dependency was wrong, or the model hit a wall.

The more expensive failure is a clean green status attached to an incomplete outcome.

The run ended. The dashboard looks calm. The orchestration layer believes the work is done. But the thing the human actually needed is missing, stale, incomplete, off-scope, unpublished, visually broken, citation-thin, or unsafe to ship.

That distinction matters because agent systems are getting good enough that people are starting to treat workflow completion as outcome completion. They see a successful run and mentally close the loop.

That is the wrong acceptance model.

Green status is not shipped work. It is only a claim about the process. Acceptance has to inspect the deliverable.

Green status says the run ended. Acceptance says the promised thing exists and works.

The false comfort of green internal status

Every automation system needs internal status. Queues need to know whether a worker started. Schedulers need to know whether a step exited. Dashboards need to show progress, failures, retries, and blocked work.

That is useful operating metadata. It is not the same as proof.

A workflow can finish cleanly while doing the wrong scope. It can satisfy the instruction it interpreted instead of the promise the operator thought they made. It can produce a file that exists but lacks the required section. It can write a draft without the planned diagram. It can summarize a source without preserving the claim it was supposed to support. It can mark a content task complete while the public artifact still fails the reader-facing standard.

None of those failures require the orchestration layer to lie. The dashboard can be perfectly accurate about the run and still incomplete about the outcome.

This is the same reason mature software teams do not treat activity as done. The Agile Alliance definition of done is built around agreed criteria that must be met before work is counted as complete, not merely before someone stops working on it: https://www.agilealliance.org/glossary/definition-of-done/

Agent workflows need the same separation.

Workflow-state completion answers: did the system reach the end of its process?

Deliverable-state acceptance answers: does the promised thing exist, work, and meet the standard we named before execution?

Those are different questions. If your system answers the first and your team treats it as the second, you have built optimism into the operating model.

Why dashboard success and artifact success diverge

Dashboards tend to reflect the layer they can observe cheaply.

They can observe whether a worker claimed a step. They can observe whether a process exited. They can observe whether a status field changed. They can observe whether a dependency unblocked. They can observe whether a retry happened.

But the user-facing promise usually lives somewhere else.

The promise might be a published page. A pull request. A report. A diagram. A source-backed recommendation. A visual asset. A migration that actually changed the intended environment. A customer email that preserved the right tone. A knowledge artifact that can be found again.

The dashboard might know the run ended. It may not know whether the page renders, whether the claim is supported, whether the image exists, whether the route works, whether the freshness label is honest, whether the private boundary was respected, or whether the final artifact matches the brief.

That is the gap.

A dashboard can be right about the process and still incomplete about the outcome.

Good delivery systems already acknowledge this. Continuous delivery is not just a celebration of jobs finishing. Martin Fowler describes continuous delivery as keeping software deployable throughout its lifecycle, supported by fast automated feedback on production readiness and repeatable deployments: https://martinfowler.com/bliki/ContinuousDelivery.html

The important idea is not the deployment tooling. It is the discipline: readiness is proven by feedback tied to the thing being shipped.

GitHub environments make a similar distinction in a concrete way. A job can reference an environment, but environment protection rules can require reviewers, wait timers, or other gates before deployment proceeds or protected resources are accessed: https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment

The job is not the release. The run is not the acceptance. The internal state is one signal upstream of a separate gate.

Agent workflows need that gate because agents are not just executing deterministic scripts. They are interpreting intent, using tools, transforming context, making judgment calls, and often producing artifacts that require taste, source discipline, or boundary awareness.

That makes deliverable acceptance more important, not less.

A deliverable-level acceptance checklist

The fix is not to distrust every green light. The fix is to stop asking green lights to do a job they were never designed to do.

Use internal status for process control. Use deliverable-level acceptance for outcome control.

A practical acceptance check starts with the promise.

Before the run, write down what will count as a delivered outcome. Not the step name. Not the vague instruction. The artifact.

For a content workflow, the promise might be: a draft post with frontmatter, a clear thesis, cited sources, three planned visuals, no private operational details, and a readback proving the draft is still marked as draft.

For an engineering workflow, the promise might be: a merged change, passing tests, updated docs, a live route responding correctly, and a rollback note.

For an analysis workflow, the promise might be: a report with the decision served, source coverage, assumptions, tradeoffs, and next actions.

Then acceptance checks the artifact against that promise.

A useful checklist looks like this:

Promise match: What did we ask the agent to deliver, and is the artifact the same class of thing?
Existence: Can we read the artifact from the canonical place where it is supposed to live?
Completeness: Are all required sections, assets, citations, tests, or fields present?
Freshness: Was the artifact generated or verified recently enough to trust it?
Quality: Does it meet the actual standard, not just the schema?
Boundary safety: Did it avoid private details, secrets, internal IDs, or claims that should not ship?
User-visible proof: Can we verify the route, file, report, page, image, API response, or other final surface directly?
Evidence readback: Does the agent provide evidence that maps to the promise instead of a generic success message?

Accept the run only when the evidence maps back to the promise.

The checklist should be boring. That is the point.

Boring acceptance checks are how you avoid dramatic cleanup later.

The best version is not a paragraph where the agent says, "Done." It is a proof packet: here is the artifact, here is where it lives, here are the checks that passed, here are the checks that were skipped, here are the remaining risks, and here is the evidence you can inspect.

That moves the human from trust-by-status to trust-by-readback.

How to make agents report evidence instead of optimism

Most agent reporting is too optimistic because we ask for completion instead of evidence.

If the instruction is "finish the task," the agent will often optimize toward reaching a terminal state. If the instruction is "produce the artifact and prove it meets these acceptance conditions," the operating loop changes.

The agent now has to preserve the promise, inspect the output, and report the evidence.

That is the reporting contract I want more agent systems to adopt:

Do not say complete unless the deliverable has been read back.
Do not summarize internal progress as if it were user-facing proof.
Do not hide skipped checks behind a green label.
Do not make dashboards look authoritative when they only have partial coverage.
Do not treat a successful run as publishable unless publishability was part of the acceptance gate.

This is also where monitoring philosophy helps. Google’s SRE book argues that monitoring should answer two questions: what is broken, and why. The surface should connect symptoms and causes, not just emit comforting state: https://sre.google/sre-book/monitoring-distributed-systems/

Agent dashboards need the same honesty. A useful dashboard does not just show that a run ended. It shows what was promised, what artifact exists, what evidence was checked, what coverage the surface has, how fresh the data is, and what is still unknown.

This is not anti-agent. It is the opposite.

If agents are going to do more real work, they need stronger acceptance contracts. The more autonomy we give them, the less we can afford to confuse process state with deliverable truth.

AI risk frameworks point in the same direction at a higher level. NIST’s AI Risk Management Framework emphasizes governing, mapping, measuring, and managing AI systems, including defined and documented processes for performance, trustworthiness, and oversight: https://doi.org/10.6028/NIST.AI.100-1

For operators, the everyday version is simple: define the promise, require evidence, accept the deliverable only after the evidence maps back to the promise.

That is the shift.

Not "did the agent finish?"

"Did the promised thing arrive, and can we prove it?"

Green status is a useful signal. It is not the finish line.

Replace queue-state completion with deliverable-state acceptance.