By Amir Elion

Three Real AI Agents Worth Studying

Three honest case studies: what real AI agents have done, the fine print, and what an executive should take from each.

Why this paper exists

Most AI agent case studies are marketing. A vendor shows a clean demo, quotes a number with no baseline, and leaves out what a human had to do to make it work. For a leader trying to judge what agents can actually do, that is worse than nothing, because it sets an expectation no real deployment will meet.

This paper does the opposite. It takes three real agent projects, all public and documented, and tells each one straight: what the agents achieved, what the fine print says, and what to take from it. Two are genuine breakthroughs. One is a documented failure, and it may be the most useful of the three.

1. Stanford's Virtual Lab: agents that designed a real drug candidate

At Stanford, James Zou's group built what they call a Virtual Lab: a team of AI agents run like a research group. One agent acts as the principal investigator and directs the work. Others play specialist roles, an immunologist, a computational biologist, a machine-learning engineer. A human scientist sets the direction and gives feedback at key points, the way a department head would.

They pointed it at a hard, real problem: designing nanobodies, small antibody-like proteins, that bind to recent variants of the COVID virus. The agents assembled a computational pipeline from real tools (ESM, AlphaFold-Multimer, and Rosetta), used it to design 92 candidates, and then human researchers made those candidates in a wet lab and tested them. More than 90% expressed and were soluble, and two bound tightly to the newer JN.1 and KP.3 variants. The work was published in Nature in 2025, with the code open for anyone to check.

This is about as real as it gets. The output was not a slide. It was a molecule a lab built and measured, described in a peer-reviewed journal.

Look at the shape of it, because it is the management model from this series running at the frontier. The agents had a clear job, the right tools, and real autonomy to run the pipeline. A human stayed on the loop to set direction and judge the results. The human did not do the work. The human decided what work was worth doing and whether it was any good. That is Managing AI Agents Like Teammates, in a Nature paper.

2. Anthropic's Project Vend: the agent that ran a shop into the ground

Anthropic gave a Claude agent, nicknamed Claudius, a small real business: a shop in their office. It could message staff on Slack, search for products, email wholesalers, set prices, and place orders. Then they let it run and wrote up what happened, honestly.

It went badly. Staff talked it into discounts and then into giving stock away. It made strange calls, including stocking metal tungsten cubes, and at one point became confused about whether it was a person. Across the experiment it lost around a thousand dollars. A later phase, with changes to how it was set up, did better.

This is the most useful case of the three, because it failed and was reported straight. Most companies bury this kind of result. Anthropic published it, which is itself a lesson in how to think about agents honestly.

Read the failure through the framework and it stops being mysterious. Claudius had a job that was too broad, too little context about how a real shop protects itself, and far too much autonomy with far too little oversight. This was not a failure of intelligence. Claudius runs on one of the most capable models in the world. It still lost money, because the job was vague, the context was thin, and nobody was watching closely enough. Give a brilliant agent those conditions and it will fail, at speed and at scale. That is the warning inside the framework, demonstrated in a fridge full of tungsten.

3. Google's AI co-scientist: a real result, read carefully

Google built a multi-agent system called the AI co-scientist, made of specialized agents that generate research hypotheses, argue against each other, rank the survivors, and hand a scientist a short list worth testing.

The reported results are strong. Working with real labs, the system proposed drug-repurposing candidates for acute myeloid leukemia that were then validated in experiments, and it reproduced an unpublished finding about how bacteria share genes, reaching in two days a conclusion that had taken the original researchers years.

Read that last one carefully, because this is where hype usually hides. The original researchers already knew the answer. The system was not handed a blank page and asked to discover; it generated a hypothesis that matched a known but unpublished result. That is a real and impressive validation. It is not the same as cracking an open problem from scratch, and a careful leader keeps the two apart.

The honest description is in the name. It is a co-scientist. The value showed up as a human and a set of agents working together, with the human choosing the goal and judging the output. When you read any agent claim, look for two things: what the human did, and what the agent was already pointed toward. The interesting question is rarely whether the AI did it alone. It is whether the pair did better than either would have.

What the three have in common

Three different fields, three different outcomes, and the same thing deciding each one. The agents did real work in every case, and the result tracked the quality of the management around them. Stanford and Google got breakthroughs from a clear job, good tools, and a human setting direction and checking results. Anthropic got a thousand-dollar loss from a vague job, thin context, and no real oversight. The intelligence was roughly equal across all three. The management was not.

That is the executive takeaway, and it is the same one the rest of this series keeps reaching from different angles. The frontier of what agents can do is genuinely exciting. Whether you get the Stanford result or the Project Vend result comes down to management, not the model: whether you set the job, the context, the autonomy, the tools, and the oversight on purpose.

How to use this

Next time someone shows you an agent case study, run it through three questions before you let yourself be impressed. What did the agent actually produce, and did anyone independent verify it. What did a human do that the story is not emphasizing. And what was the agent already pointed at, as opposed to what did it find on its own. Strong cases, like Stanford's, answer all three cleanly. Weak ones get vaguer the harder you look. Looking is the job.

Frequently asked questions

Are these the most advanced AI agents that exist? They are among the most public and best documented, which matters more for learning than raw capability. Plenty of impressive agent work happens inside companies and is never written up. These three are valuable because you can check them: a Nature paper, a published post-mortem, and a documented research system.

Is the Stanford result really AI doing science on its own? No, and its authors do not claim it is. A human set the goal and judged the work, and humans ran the wet-lab validation. The agents did a large amount of skilled work inside that frame. The collaboration is the point, not a machine working alone.

Why feature a failure like Project Vend? Because honest failures teach more than polished successes, and this one has a clear cause a leader can act on. It is the cleanest demonstration that agent outcomes are decided by management, not just model quality.

What should I take from all this for my own organization? The same agent will succeed or fail depending on the job, context, autonomy, tools, and oversight you give it. Before you judge whether agents are ready for a task, judge whether you are ready to manage one. That is the subject of Managing AI Agents Like Teammates.