A code migration agent finishes its run, and the pipeline looks green. But several pieces were never compiled — and it took days to catch. That’s not a model failure; that’s an agent deciding it was done before it actually was.
Many enterprises are now seeing that production AI agent pipelines fail not because of the models’ abilities but because the model behind the agent decides to stop. Several methods to prevent premature task exits are now available from LangChain, Google and OpenAI, though these often rely on separate evaluation systems. The newest method comes from Anthropic: /goals on Claude Code, which formally separates task execution and task evaluation.
Coding agents work in a loop: they read files, run commands, edit code and then check whether the task is done.
Claude Code /goals essentially adds a second layer to that loop. After a user defines a goal, Claude will continue to turn by turn, but an evaluator model comes in after every step to review and decide if the goal has been achieved.
The two model split
Orchestration platforms from all three vendors identified the same roadblock. But the way they approach these is different. OpenAI leaves the loop alone and lets the model decide when it’s done, but does let users tag on their own evaluators. For LangGraph and Google’s Agent Development Kit, independent evaluation is possible, but requires developers to define the critic node, write up the termination logic and configure observability.
Claude Code /goals sets the independent evaluator’s default, whether the user wants it to run longer or shorter. Basically, the developer sets the goal completion condition via a prompt. For example, /goal all tests in test/auth pass, and the lint step is clean. Claude Code then runs, and every time the agent attempts to end its work, the evaluation model, which is Haiku by default, will check against the condition loop. If the condition is not met, the agent keeps running. If the condition is met, then it logs the achieved condition to the agent conversation transcript and clears the goal. There are only two decisions the evaluator makes, which is why the smaller Haiku model works well, whether it’s done or not.
Claude Code makes this possible by separating the model that attempts to complete a task from the evaluator model that ensures the task is actually completed. This prevents the agent from mixing up what it’s already accomplished with what still needs to be done. With this method, Anthropic noted there’s no need for a third-party observability platform — though enterprises are free to continue using one alongside Claude Code — no need for a custom log, and less reliance on post-mortem reconstruction.
Competitors like Google ADK support similar evaluation patterns. Google ADK deploys a LoopAgent, but developers have to architect that logic.
In its documentation, Anthropic said the most successful conditions usually have:
One measurable end state: a test result, a build exit code, a file count, an empty queue
A stated check: how Claude should prove it, such as “npm test exits 0” or “git status is clean.”
Constraints that matter: anything that must not change on the way there, such as “no other test file is modified”
Reliability in the loop
For enterprises already managing sprawling tool stacks, the appeal is a native evaluator that doesn’t add another system to maintain.
This is part of a broader trend in the agentic space, especially as the possibility of stateful, long-running and self-learning agents becomes more of a reality. Evaluator models, verification systems and other independent adjudication systems are starting to show up in reasoning systems and, in some cases, in coding agents like Devin or SWE-agent.
Sean Brownell, solutions director at Sprinklr, told VentureBeat in an email that there is interest in this kind of loop, where the task and judge are separate, but he feels there is nothing unique about Anthropic’s approach.
“Yes, the loop works. Separating the builder from the judge is sound design because, fundamentally, you can’t trust a model to judge its own homework. The model doing the work is the worst judge of whether it’s done,” Brownell said. “That being said, Anthropic isn’t first to market. The most interesting story here is that two of the world’s biggest AI labs shipped the same command just days apart, but each of them reached entirely different conclusions about who gets to declare ‘done.'”
Brownell said the loop works best “for deterministic work with a verifiable end-state like migrations, fixing broken test suites, clearing a backlog,” but for more nuanced tasks or those needing design judgment, a human making that decision is far more important.
Bringing that evaluator/task split to the agent-loop level shows that companies like Anthropic are pushing agents and orchestration further toward a more auditable, observable system.