Skip to main content

The Hard Part Of Agents Is Not How Smart The Model Is

· 9 min read
DingZhiyu
Southwest Petroleum University
Listen to article00:00 / 00:00

Here is what happened.

On April 7, I spent some time reading LangChain's article, The Anatomy of an Agent Harness.

At first, I was drawn in by the word "harness."

It has been appearing more and more often in the Agent world recently, but it is also a slightly awkward word. If you translate it directly as "tack" or "restraint gear," it sounds strange. If you translate it as "framework," it feels too light, as if it is only a few layers of code wrapping. "Exoskeleton" may be a bit closer, but it is still not exact.

In any case, it points to this thing:

the whole system outside the model that lets an Agent actually do work.

My strongest feeling after reading it was that, in the past, we may have stared too much at the model itself.

Which model is smarter.

Which model reasons better.

Which model writes code more fiercely.

These things are of course important. You cannot take a very weak model and insist that engineering alone can make up for it. That is unrealistic.

But if you have really used Claude Code, Codex, Cursor, or built Agents yourself with LangGraph, the OpenAI Agents SDK, or CrewAI, you quickly run into an annoying problem.

A smart model does not equal a reliable system.

It can think of the next step, but that does not mean it knows when to stop.

It can call tools, but that does not mean it knows which tools must not be used carelessly.

It can write files, but that does not mean it knows how to roll back a bad change.

It can read context, but that does not mean it will not tie itself in knots after a dozen turns.

It is like putting a very smart person into a car with no steering wheel, no brakes, and no dashboard.

The brain is good.

The car is not.

What moved me most in the LangChain article was that it made this problem very concrete. It was not vaguely saying that Agents need engineering. It took the harness apart and showed you planning, filesystem, subagents, stateful middleware, context engineering, tool permission, and evaluation loop.

None of these words look exciting.

But every one of them is critical.

Take planning, for example.

Many people may think a plan is just asking the model to write a todo list. Sounds ordinary, right?

But think about what most often goes wrong when an Agent does a complex task. It is not that it completely cannot do the work. It is that it drifts while doing it. It starts out fixing a bug, then halfway through begins refactoring the whole project. It starts out writing an article, then halfway through adds a pile of background material barely related to the topic.

At that moment, a plan is not decoration.

A plan is a rope.

It is not there to restrict the model for the sake of restriction. It is there so that every so often the model can look back and see what it is actually doing.

The filesystem is the same.

We used to think the longer the context window, the better. The longer the window, the more the model remembers, and the stronger the Agent becomes.

That is only half true.

Long context is useful, but you cannot stuff everything into it. Code repositories, logs, drafts, materials, previous attempts, intermediate results. Once everything goes into the context, the model looks as if it knows everything, but in reality its attention becomes scattered.

The filesystem is another brain.

Write down what should be remembered. Read it back when it should be read. You do not need to stuff the whole world into the prompt. You only need the model to know where things are, and when to retrieve them.

This has also become one of my strongest feelings recently.

An Agent is not a model that is better at chatting.

An Agent is a model placed into an environment.

Once the environment becomes complex, the problem changes.

The early ReAct pattern had already exposed this issue. The model thinks one step, acts, observes, and then continues thinking. Back then it was still a very simple loop of Thought, Action, and Observation, almost like hand-building a small operating loop inside a prompt.

Then AutoGPT had its moment. For the first time, many people saw a model decomposing tasks, searching, writing files, and continuing to iterate by itself. It felt astonishing.

And very soon, people saw the other side.

It gets lost.

It falls into loops.

It works very hard in a large circle, and finally hands over something no one knows how to evaluate.

That moment already proved that a model's ability to think is not enough. You still have to give it state, permissions, tool boundaries, error recovery, and places where humans can intervene.

Back to the harness.

I think what LangChain's article really wanted to say was not, "We have invented another new concept."

It was more like a reminder that a large part of Agent capability lives outside the model.

There was an interesting example in the article. They said that with the model unchanged, changing the harness moved a coding Agent from Top 30 to Top 5 on Terminal Bench 2.0.

That result should not be mythologized. A benchmark is always only a slice.

But it is enough to make the point.

The same model can perform very differently when placed inside different systems.

This is actually similar to people.

A smart person without schedules, documents, collaboration tools, review mechanisms, or reminders about priority can also become chaotic. Conversely, someone whose ability is not exaggerated can still produce steadily if they have good processes, notes, tools, and feedback systems.

Agents are the same.

The model is the brain.

The harness is the work habit.

At this point, I suddenly understood why products like Claude Code are so compelling.

It does not only give you a model that writes code better. It gives the model a very concrete work setting. It can read a repo, edit files, run tests, inspect errors, confirm dangerous operations with you, and keep moving through a task.

Behind those experiences is the harness.

The OpenAI Agents SDK is moving in the same direction. It brings concepts like agent, handoff, guardrail, session, and tracing into the SDK. Anthropic's MCP is, in a sense, trying to standardize the way models connect to external tools. LangGraph feels more like a controllable Agent runtime for developers, so that state, branching, human intervention, and durable execution do not all have to be handmade from scratch.

You can see it: everyone says they are talking about Agents.

But the real competition increasingly looks like a competition over whose shell is better.

This matters to ordinary developers too.

If you are only playing around, pick whatever product you like. If Claude Code feels good, use Claude Code. If Cursor suits you, use Cursor. If the OpenAI Agents SDK is convenient, use the OpenAI Agents SDK.

But if you really want to put an Agent into your own business, you cannot only ask whether the model is strong.

You also have to ask where its memory lives.

Who manages tool permissions.

How failure recovers.

Whether logs can be replayed.

Whether humans can step in at critical points.

Whether this workflow can be carried over when you change models later.

This question will only become more important.

Because after using Agents for a long time, the most valuable thing may not be any single answer, but the working style accumulated around it. It knows how you write code, how you organize materials, your team's process, which commands are dangerous, and which files should not be touched.

If all of that sinks into a closed-source product, you may feel very comfortable in the short term.

In the long term, you may feel uneasy.

This is not to say closed-source products are bad.

I use them too, and in many cases they are indeed good.

But we need to know what we are handing over.

We used to say that model vendors sell intelligence. Now it seems that, in the future, they may sell an entire work environment. The longer you work inside it, the more it understands you, and the harder it becomes to leave.

Big times, my friends.

So my mindset when looking at Agents is no longer quite the same.

When I saw a new Agent demo before, I would first ask what model it used.

Now I first ask what its harness is.

Does it have a planning system?

Does it have external memory?

Does it have tool permissions?

Does it have observability?

Does it have rollback and evaluation?

Is it a smart brain that can chat, or a small system that can truly do work reliably?

That is the difference.

After reading the article on April 7, an image kept staying in my mind.

A model stands at the center of the stage, and everyone is watching how smart it is.

But lighting, sound, teleprompter, backstage coordination, stage machinery, and emergency brakes are all hidden in the dark.

The audience sees the actor.

What really decides whether the performance can finish smoothly is the whole theater.

Agents are the same.

We will of course keep chasing stronger models.

But perhaps from now on, we also need to look carefully at the theater beneath the model's feet.