When Agents Work in Teams

What changes when you stop managing one agent and start running a team of them, from a teammate in your chat to a panel that deliberates.

Why this paper exists

The rest of this series has been about one agent: how to give it a job, the right context, the right amount of autonomy, the tools it needs, and real oversight. That is Managing AI Agents Like Teammates, and it is where every leader should start.

But the more interesting question, and the one organizations are starting to hit in practice, is what happens when there is more than one. Agents are showing up in teams now. They sit in group chats alongside your people, run whole processes in formation, and in research labs they hold meetings with each other. The patterns are real and shipping in 2026. They also quietly break some of the management instincts you spent a career building, because those instincts were shaped by the limits of human attention, and agents do not share those limits.

This paper works through three of them, each a step larger than the last: an agent as a member of your team, a fleet of agents running a process, and a panel of agents deliberating to produce original work. Then it gets to the part that matters most, which is how teamwork itself changes once the people in the room are no longer the scarce resource.

1. An agent as a member of the team

Two products shipped in 2026 that put an agent directly into the space where your people already talk.

Anthropic's Claude Tag gives a Slack channel a permanent Claude. It is multiplayer: within a channel there is one Claude that works with everyone, so anyone can see what it is doing and pick up where the last person left off. It builds context simply by living in the channel over time. It has two voices, one where you tag it with a request and one where, if you allow it, it speaks up on its own to flag something or chase a thread that has gone quiet. It works asynchronously, taking on a task and pursuing it over hours or days. Administrators set which tools and channels it can touch, and there is a log of everything it has done and who asked for each task.

Base44's superagents do a similar thing inside a WhatsApp group. The agent studies the group's context, and you choose whether it answers every message or only when mentioned. It runs scheduled jobs like daily summaries and reminders, and it acts on behalf of members for everyday tasks, booking a table or arranging a meeting.

Notice what happens the moment an agent joins a shared space. It stops being a tool and becomes a member, with all the social questions that implies. When does it speak and when does it stay quiet. What may it do, and on whose behalf. Who can see what it did. The single most important setting in both products turns out to be its voice, because an agent that responds to everything becomes noise, and one that never speaks up is just a search box. And because it works while nobody is watching, the audit log is doing the job that watching a colleague used to do. Visibility is how you replace the supervision you can no longer give in real time.

2. A fleet with a conductor

Abundly runs its own product development on a team of named agents working alongside human engineers, and ships a new version of its platform every day. One agent, Cursor, writes most of the code. Another, Backlogger, turns messy Slack threads into clean, well-structured tickets. A third, Releaser, runs the daily release end to end: pull requests, release notes, changelog, announcements. And Grace, the one that matters most here, carries a stakeholder request from Slack all the way to a finished pull request, coordinating the other agents as she goes.

That is the real jump from one agent to a team. Someone has to own the flow. Grace is a conductor. Without a named lead, a group of agents is a swarm rather than a team, and there is no single thread you can hold accountable when the work goes wrong. There is a caution sitting inside this example too, and it is worth saying out loud: Grace can edit code, which means she is improving the platform she herself runs on. That is powerful, and it is exactly the kind of boundary a leader should draw on purpose rather than discover by accident.

The humans at Abundly deliberately keep the high ground. They own architecture, quality, and the call on what good looks like. The agents execute, and the people decide what is worth executing. That division is the whole game here.

3. A panel that deliberates

I told the story of Stanford's Virtual Lab in Three Real AI Agents Worth Studying for what it produced: real COVID nanobodies, built and measured in a wet lab, published in Nature. Here I want a different part of it, which is how the team itself was built.

A principal-investigator agent leads. It recruits specialists, each with a distinct expertise: an immunologist, a computational biologist, a machine-learning engineer. And, importantly, a scientific-critic agent sits in every meeting with a single job, which is to challenge the team's reasoning and find the holes. A human chairs the whole thing, sets the agenda, supplies the papers, and gives feedback at the seams, the way a department head would.

Two findings carry straight into your business. The first is that the specialists were valuable precisely because they disagreed. Distinct expert roles create friction, and the friction is where the quality comes from. A room of identical generalists smooths over exactly the complexity you needed surfaced. The second is that the critic was singled out as the most useful member of the team. Building dissent in on purpose did more for the output than adding another worker would have. If you take one staffing lesson from the frontier of agent research, it is that a good team disagrees with itself, and you should design it to.

The part that changes everything

Then the Virtual Lab did something you simply cannot do with people, and it is the real lesson of this paper. It ran each team meeting five separate times from an identical starting point. Because the model is non-deterministic, each of those five runs reached a somewhat different conclusion. The team treated that spread not as a malfunction to be fixed but as exploration of the possibilities, and then held a separate merge meeting to synthesize the five runs into one answer.

Sit with why that is impossible for a human team, because the reason is not the one you would reach for first. It is not really cost, though five runs of any meeting would be expensive. It is that you cannot reset a person to the starting point. Once your experts have been through the discussion, they cannot un-know where it led. Run the meeting again and they do not begin fresh; they begin from the answer they already reached, rehearsing it rather than genuinely finding a second one. An agent carries no memory from one run to the next. You can start it five times over from the identical blank state, so each run is a real and independent take on the same question, and the spread across them is true coverage of the possibilities instead of one mind repeating itself.

Resettable memory is one human limit we never designed around, because we could not. Notice it and the others come into view, and most of them are about the cost of human attention. We keep meetings rare and short because people's time is expensive. We keep teams small because coordination gets costly fast. We avoid redundancy because duplicated effort wastes scarce hours. We build hierarchy to economize on attention. None of these are laws of good thinking. They are accommodations to human limits, and we mistook the limits for the method.

When the team is made of agents, those limits lift at once. Because an agent starts each run fresh, one pass becomes a distribution: run the question many times and read the spread. Because a meeting costs almost nothing, you can hold a hundred of them. A small standing team becomes a large panel assembled for a single question and dissolved when it is answered. Redundancy stops being waste and becomes an instrument, in the form of the critic, the ensemble, the replication. This is, not by accident, the world that Abundly named itself after.

Where the cost actually goes

Here is the part a careful leader cannot skip, because the cost does not vanish. It moves. When you run a hundred parallel meetings, someone still has to merge them into one coherent answer, and a human still has to understand that answer, trust it, and own the decision that follows. The bottleneck shifts from doing the work to synthesizing and absorbing it. The scarce resources of the agent era are different ones: the quality of the synthesis, the coherence across all those runs, and above all your own attention at the moment you decide.

So the design problem inverts. The old question was how to economize on doing. The new question is how to economize on deciding and trusting. The craft becomes building the funnel, from wide parallel exploration down to something one person can act on without having read all of it. The synthesizer becomes the most important seat at the table. That is why the Virtual Lab put a critic in the room and a human at the head of it.

One warning rides along with this, and it is sharp. Variance is coverage only for open, divergent work. For a question that has a single right answer, running it many times and merging can launder a confident error into a consensus that looks reliable and is not. Sampling without an adversary just averages your mistakes into something that feels safe. Knowing which kind of question you are holding is the new core skill, and the critic is how you protect yourself when you get it wrong.

How to use this

You do not need a research lab to act on this. Start with the one you are actually on.

If you are adding an agent to a channel, decide its voice and its mandate before you add it, not after, and make sure everything it does leaves a record. If you are running a process across several agents, name the conductor and keep a human owning architecture and quality. If you are using agents to think, give them distinct roles rather than cloning one generalist, put a critic in the room, and for anything that matters, run it more than once and synthesize instead of trusting a single pass.

And underneath all of it, the mindset shift. Stop copying your org chart onto your agents. Your org chart is an answer to a question agents no longer ask, which is how to get the most out of scarce and expensive human attention. Ask the better question instead. If meetings and headcount were free, and the only scarce thing left were your judgment about what to build and whether it is any good, how would you design the team. Then build that one.

Frequently asked questions

Do I need many agents to get value? No. Most organizations should master one agent, managed well, long before they run a fleet. The three are in order for a reason, and almost everyone should still be working on the first one.

Who is accountable when a team of agents gets it wrong? A named human, every time. Coordination spreads the work; it does not spread the accountability. Name a conductor to own the flow and a person to own the outcome, and keep those roles clear before anything ships.

Is running the same task many times not just wasteful? For routine, convergent work, often yes. For open and high-stakes questions, the spread across runs is information, and synthesizing it beats a single confident answer. The judgment is in telling the two kinds of question apart.

What is the one thing to take from this? The teamwork habits you trust were shaped by the cost of human time. Agents change that cost, so the habits are worth revisiting from scratch. The job that stays firmly human is framing the question and owning the answer.