Caging Agent Chaos
December 15, 2025
Share

Everyone's waiting for the model that finally makes agents work. It came out a year ago.
There's a consensus forming in tech that agents "aren't ready yet." Wait for better reasoning. Wait for the next breakthrough.
We've been shipping agent systems at Structify for months. They explore databases, write code, fix their own mistakes, and produce outputs customers rely on daily. The models aren't the bottleneck. The infrastructure is.
The same LLM that hallucinates wildly in a naive implementation becomes remarkably reliable when you give it the right constraints. Sandboxed execution where a hallucinated rm -rf can't destroy anything. Typed tool interfaces where the agent either produces valid actions or gets actionable errors. Structured state instead of raw chat history that grows until the agent loses the plot.
Here's the insight that changed how we build: nondeterminism compounds. The longer you let an agent spin without external feedback, the more variance accumulates. It's why Claude Code uses Opus for planning and lets smaller models execute each step. The hierarchy introduces checkpoints. Determinism returns at the boundaries.
The other half is letting agents fail gracefully. When you demand perfect success, the agent does whatever it takes to please you—even when the task is impossible. This is why LLMs wrap everything in try/except when they generate code. They're optimizing for "didn't crash" rather than "actually worked."
Our agents can say "I don't know." They can surface errors and ask for clarification. They operate in loops where failures feed back into context so the agent can iterate. A frustrating crash becomes a self-correcting cycle.
The gap between demo and production isn't intelligence. It's guardrails, recovery loops, and state management. Build those right and today's models are already good enough.
The breakthrough already satisfies. Cage it.

