What Good Looks Like: Worked Examples

Four agents that earned their place, read through the five things that made them work.

Why this paper exists

There is a simple test for whether an AI agent will earn its place or quietly waste a quarter: whether you can answer five questions about it. What is its job. What context does it have. How much autonomy. Which tools. And who provides oversight. Those five questions are the lens of Managing AI Agents Like Teammates, and they are the lens of this paper.

What follows is four agents I have helped leadership teams put to work, all anonymized, each read through those same five questions. None of them is a demo. Each earned its place because those five things were set on purpose, and looking at them through one lens shows you what "good" actually looks like in practice.

How to read these

Each example is told the same way: what the agent does, how the five elements were set, and the one choice that made the difference. Watch how often that choice is about context or oversight, not about the model.

1. The multilingual support agent

A company with a digital product gets customer questions in many languages, around the clock. The agent answers them.

Its job is narrow on purpose: answer questions about the product, in the customer's language, and hand off anything outside that scope. Its context is the product's own documentation and help material, so its answers reflect how the product actually works rather than a general guess. Its autonomy is bounded: it answers the questions it can answer well and escalates the rest to a human instead of improvising. Its tools are read access to the knowledge it needs and nothing more. Its oversight is a person who reviews the escalations and samples the rest, plus built-in evaluation tests that run and assess the quality of the outputs and outcomes: has the customer's question actually been answered, and what is their sentiment?

What made the difference was the line between the agent and a human. The job was defined as much by where it stops as by what it does: it answers what it can answer well and routes everything else to a person. Draw that boundary deliberately and the agent is a genuine asset. Leave it vague in pursuit of a deflection number, and the questions that matter most are the ones it will get wrong.

2. The invoice-triage agent

A large enterprise receives a high volume of incoming invoices. The agent handles most of them and escalates the rest.

Its job is to triage and process the routine invoices and to flag anything unusual for a human. Its context is the organization's own rules for what a valid invoice looks like, including the exceptions that normally live only in a finance team's heads. Its autonomy is matched to reversibility: it processes the clean, routine cases and stops at anything ambiguous or high-value, which goes to a person. Its tools are scoped access to the finance systems it touches, with a trail of what it did. Its oversight is the finance team, checking the escalated cases and auditing a sample of the rest.

What made the difference was treating the escalation as the design, not the afterthought. The value is not in the agent touching every invoice. It is in handling the routine majority reliably and routing every exception to the person who should see it. The autonomy was matched to reversibility, and the oversight lived in the exceptions.

3. The regulatory-watch agent

A medical company manages its regulatory requirements through a shared repository, where different stakeholders discuss and agree changes over time. The agent watches that repository and anticipates how upcoming regulatory changes will affect the product roadmap.

Its job is to monitor the regulatory discussion and surface implications early, not to make decisions. Its context is the differentiator: it is pointed directly at the repository where the real discussion happens, so it reasons over the actual source of truth rather than a summary. Its autonomy is deliberately low, because this is a regulated, high-stakes domain. It informs people, and people decide. Its tools are read access to that repository, plus messaging tools to reach the relevant internal stakeholders. Its oversight is the product and regulatory leads, who review what it flags and decide what it means for the roadmap.

What made the difference was the wiring and the restraint. Connecting the agent to the place where the requirements are genuinely debated is what makes its anticipation useful. Keeping it advisory, in a domain where a wrong call is a compliance problem, is what makes it safe. It is a foresight tool, not a decision-maker.

4. The RFP-response agent

Salespeople receive incoming requests for proposals, often long and detailed. The agent helps them respond.

Its job is to analyze an incoming RFP, compare it against the company's product and team profiles, and draft a response that lines each requirement up against a real capability. Its context is the proprietary material that makes the draft good: the product's actual features and the team's actual profiles, not generic marketing. Its autonomy is to draft, never to send. Its tools are the RFP documents and the internal product and team data. Its oversight is the salesperson, who owns the final response and the relationship behind it.

What made the difference was leaving judgment with the human. The agent does the heavy, tedious synthesis of matching dozens of requirements to capabilities, which is where hours disappear and mistakes creep in. The salesperson keeps the part that needs judgment: what to emphasize, what to promise, and how to win.

What these four have in common

None of them replaced a person. Each had a job defined as much by where it stops as by what it does. In every case the differentiator was context and tools, wiring the agent to real, proprietary information, and oversight, giving it a clear human owner and a real escalation path. The model was rarely the interesting part. And together they cover the three places agents create value: serving customers, speeding internal work, and sharpening knowledge work.

Read through the five questions, the pattern is plain. These agents earned their place because someone set their job, their context, their autonomy, their tools, and their oversight on purpose, and made the hardest call, where the agent stops and a human takes over, deliberately.

What this changes for the leadership team

Good does not look like a remarkable demo. It looks like an agent you can describe in five answers: a job you could write on a page, context drawn from your own reality, autonomy matched to the stakes, tools it actually needs, and a named human watching. If an agent impresses you but you cannot answer those five questions about it, you are looking at a science-fair project, not a working one.

What to do this week

Take one agent you are considering and write its five elements on a single page before anyone builds anything. Pay special attention to the line where it should escalate to a human. In all four examples here, that line was where the value, and the safety, lived.

Frequently asked questions

Are these real? Yes. They are real engagements, described in anonymized form to protect the organizations involved. The details that matter for you are the choices, not the names.

Why do these escalate to humans instead of fully automating? Because autonomy should match the stakes. Each agent acts freely where actions are routine and reversible, and defers where a mistake would be costly or hard to undo. The escalation path is the design, not a weakness in it.

Is there a pattern across the four? One lens fits all of them: job, context, autonomy, tools, and oversight. Where these agents earned their place, those five were set on purpose, and the differentiator was usually context and oversight rather than the model.

How do I start on one of my own? Use Why / What / How to choose the right agent to build, and Managing AI Agents Like Teammates to set the five elements once you have chosen.