How to use AI

Planted 02026-05-29

How to redesign roles, workflows, and accountability to use AI agents productively and safely.

The Charterer and the Captain

You’ve seen the demo. Someone types “build me a customer portal” into a chat window, and thirty seconds later there’s a working app on screen. Maybe you’re the executive who watched that demo and wondered why your engineering team still takes three months to ship anything. Or maybe you’re the developer who has to explain, again, why the demo isn’t the product — and why “the AI wrote it” is not an acceptable answer when the login page leaks session tokens.

Both of you are wrestling with the same question: now that AI can write code, who is responsible for what?

Think of a software project as a sea voyage.

The charterer knows why the ship sails

Every voyage begins with someone who needs cargo moved. The charterer doesn’t own the ship and can’t sail it. What they own is the purpose: they know why the journey matters, where the vessel must arrive, what has to be delivered intact, and what makes the whole trip worth paying for.

In software, that’s the business domain expert. The head of operations who knows that invoices must reconcile against three legacy systems. The clinician who knows which fields a nurse actually has time to fill in during a night shift. The compliance officer who knows which data can never leave the country.

This knowledge is the most valuable thing on the ship, and no AI has it. A model can generate a thousand plausible invoice workflows. It cannot know that your accounts team writes off discrepancies under fifty dollars, or that the auditors flagged exactly this process last year. The charterer’s job is not to steer. It is to define the destination, name the constraints, and — critically — judge whether the ship arrived at the right port. If the domain expert can’t recognize success when they see it, no amount of technology downstream will save the voyage.

The developer is captain, navigator, and helmsman

The temptation, watching AI tools work, is to think the captain’s job has been automated. Watch a modern coding assistant plan a refactor, write the code, run the tests, and fix its own failures, and it’s easy to conclude that the human at the keyboard is now decorative.

But consider what actually happens on a ship with an advanced navigation suite. The GPS knows the position. The autopilot holds the heading. The radar watches for traffic. And the captain is still legally and practically responsible for the vessel — because instruments report conditions, they don’t interpret them. The radar shows a contact; the captain decides whether it’s a fishing boat that will hold course or a container ship that won’t. The chart plotter suggests a route; the captain knows the chart is two years old and the channel has silted since.

This is precisely the developer’s position. The AI can study the codebase like a chart, propose routes, watch for certain hazards, and execute maneuvers faster than any human crew. What it cannot do is take responsibility. It doesn’t know that the “deprecated” API it’s avoiding is the one your platform team insists you use. It doesn’t feel the difference between a test suite that passes and a system that works. It will confidently sail a beautiful course into water that isn’t deep enough — and it will do so while producing output that looks exactly like competent seamanship.

So the developer’s job hasn’t shrunk; it has moved up the bridge. Less time with hands on the wheel, more time interpreting conditions, choosing among the routes the instruments propose, giving precise commands, and — this is the part demos never show — noticing early when the vessel begins to drift. A skilled captain feels a half-degree of drift and corrects it with a nudge. An inattentive one notices three hours later and needs a new passage plan. The same is true of AI-generated code: the cost of correction grows with every mile sailed in the wrong direction.

The failure modes are all confusions about roles

Most AI-assisted projects that go wrong go wrong the same few ways, and each is a role confusion:

The charterer grabs the wheel. A domain expert, armed with a code-generating tool, starts steering directly — prompting an application into existence without anyone aboard who can read the water. This works right up until it doesn’t: the demo sails beautifully in the harbor, then meets real weather. Real users, real data volumes, real attackers. The problem isn’t that domain experts built something; it’s that nobody on the voyage could tell the difference between a sound hull and a painted one.

The captain forgets the charterer. Developers, delighted by how fast the crew now works, ship an impressive vessel to the wrong port. AI makes it dramatically cheaper to build the wrong thing at high quality. Velocity without a clearly stated destination just means you get lost faster.

The captain becomes a passenger. The subtlest failure. The developer stops interpreting and starts approving — accepting whatever the navigation system proposes because it’s usually right and checking is tedious. The title stays the same; the responsibility quietly transfers to something that cannot hold it. When the incident review happens, “the autopilot did it” satisfies no one.

What this means in practice

If you’re the domain expert: your leverage has never been higher, but it flows through clarity, not control. Write down the destination in terms of outcomes. Name every constraint you know — regulatory, operational, human. Insist on seeing the ship arrive: real acceptance tests against real scenarios you define. And resist the belief that because the machinery got easier to invoke, it got safe to ignore. You don’t need to understand the engine room. You do need a captain you trust, and you need to judge them by arrivals, not by how impressive the instruments look.

If you’re the developer: your job is now unambiguously a judgment job. The mechanical parts of the craft — the syntax, the boilerplate, the first draft of nearly everything — belong to the crew. What remains is everything that was always hardest: understanding what the charterer actually needs beneath what they said, choosing routes with an eye on weather that hasn’t arrived yet, reading the AI’s output with the professional skepticism a captain applies to any instrument, and correcting drift while correction is still cheap. Delegating the work is fine. Delegating the watch is not.

The ships got faster. The instruments got remarkable. But someone still has to be responsible for the voyage — for knowing why it matters, and for bringing it home. The projects that succeed with AI won’t be the ones with the best tools. They’ll be the ones where everyone aboard knows exactly which job is theirs.

Productivity Comes from Org Redesign, Not Technology

You asked for one number.

Last Tuesday, you wanted to know how much your company spent with its top five suppliers over the past two quarters. You asked your ops director. She asked a procurement analyst. The analyst asked finance for an export, got a spreadsheet with the wrong date range, asked again, reconciled two systems that disagreed by 4%, and built a deck. A meeting was scheduled to review the deck. The meeting discovered that “top five suppliers” was ambiguous — by spend or by volume? — and scheduled a follow-up meeting.

Eleven days. Four people. Two meetings. One number.

Now here’s the uncomfortable part: your company bought AI licenses this year. The analyst used an AI assistant to write the deck faster. The deck was, in fact, produced faster. And the request still took eleven days, because the deck was never the bottleneck. The bottleneck was the handoffs, the queues, the clarification loops, and the meeting that existed to compensate for all three.

This is the pattern behind most disappointing AI deployments, and it has a hundred-year-old explanation: technology is neither necessary nor sufficient for productivity improvement. The durable source of productivity is organization design — how work is divided and coordinated, how behavior is governed, and how information becomes decisions and learning.

The questions haven’t changed. The answers have.

Every organization, whether it knows it or not, is an answer to five questions:

Who does what?
Who decides?
Who coordinates with whom?
What information is needed, and where does it live?
How is work controlled?

These questions are old. They come from the same organizational economics that explains why firms exist at all: coordination is expensive, so we build structures — hierarchies, meetings, approvals, roles — to manage its cost.

AI agents don’t change the questions. They change the cost structure underneath the answers. When the cost of gathering information, checking a request for completeness, monitoring a queue, drafting a document, or comparing a new problem to past cases drops by an order of magnitude, the old answers stop being optimal. An organization that keeps its old structure while adding agents is like a factory that installed electric motors but kept the belt-and-shaft layout designed for a steam engine — which is, historically, exactly what factories did for thirty years, and exactly why electrification took thirty years to show up in productivity statistics.

The gains come when you redesign the layout. So let’s redesign it, across the three places where knowledge-work productivity actually leaks: coordination, control, and memory.

Coordination: fewer handoffs, smaller queues

Go back to the supplier-spend question. Where did the eleven days go?

Not into work. Into waiting for work: the request sat in the analyst’s queue for two days; the finance export sat in an inbox; the ambiguity about “top five” wasn’t discovered until day nine because nobody checked the request for completeness on day one. Meetings existed mostly to force synchronization that the workflow couldn’t achieve on its own.

Now redesign it. An agent receives the request and immediately checks it for completeness — top five by spend or volume? which entities? which currency? — and resolves the ambiguity in minutes, not at meeting number two. It pulls from both systems, flags the 4% reconciliation gap with its likely cause, checks whether anyone has answered a similar question before (someone has, last March), drafts the answer with sources cited, and routes one genuine judgment call — how to treat a supplier acquired mid-quarter — to the person who actually owns that call.

The human contribution didn’t disappear. It got concentrated: one judgment call instead of eleven days of relay race.

This is the general redesign move for coordination: take the routine connective tissue — status gathering, completeness checking, queue monitoring, follow-ups, meeting prep — out of email, chat, and standing meetings, and move it into agent-mediated workflows. Status meetings shrink or vanish because status is continuously compiled. Work stops getting stuck between teams because something is watching the queues and escalating stalled items. Onboarding a new person to a project takes an afternoon because the project brief, history, open issues, and stakeholders can be generated rather than archaeologically reconstructed.

None of this requires a smarter model than the one you already have. It requires deciding that coordination is a designable system rather than a cultural inevitability.

Control: govern outcomes and exceptions, not activity

Here is roughly how your organization controls work today: hierarchy, approvals, meetings, reports, budget reviews, metrics, and managers asking “what did everyone do this week?”

Most of this is supervision of activity — expensive, slow, and lagging. Your metrics describe last month. Your approvals queue behind a VP’s calendar. Your managers spend, conservatively, a third of their time chasing updates that were stale before they were compiled.

The redesign: push routine control down into the workflow, and pull management attention up to exceptions.

Concretely, control becomes three layers. At the bottom, agent-executed controls: is this request complete, within policy, consistent with precedent, internally coherent, and free of known risk? These checks run on every item, continuously, before submission — not on a sample, quarterly, after the fact. Your procurement agent approves the routine consumables reorder that matches the last forty reorders; compliance stops being a manual audit and becomes a property of the pipeline.

In the middle, human exception review. Humans intervene where the work is genuinely human: ambiguity, high stakes, ethical judgment, customer sensitivity, irreversible commitments. The manager’s question changes from “what did everyone do this week?” to “show me the decisions pending, the commitments missed, the risks emerging, and the places that need me.” That’s not less management. It’s management aimed at the 5% of items where management changes the outcome.

At the top — and this is the layer most organizations skip — governance of the agent system itself. What can agents decide, what must they escalate, what data can they touch, who owns their outputs, how are errors detected? Skip this layer and agents don’t reduce confusion; they amplify it at machine speed.

The one-sentence version: managers stop supervising activity and start governing decision quality.

Memory: from individual expertise to institutional intelligence

The third leak is the quietest and possibly the largest. Think about how much of your knowledge work is re-work: rewriting a memo someone wrote last year, rebuilding an analysis that exists in a deck nobody can find, re-litigating a decision because nobody recorded why it was made the first time, asking the same expert the same question she’s answered thirty times.

Organizations pay for their knowledge once and then pay again, every time, to rediscover it. The knowledge lives in people’s heads, old decks, Slack threads, and a shared drive that functions as a write-only archive.

The redesign is to make memory active instead of archival. Before new work begins, an agent retrieves the prior work — that supplier analysis from last March surfaces on day one, not never. First drafts start from company context rather than a blank page. Decisions get logged with their reasoning, so the re-litigation loop dies. Meetings convert into actions and reusable knowledge instead of evaporating. After a project ends, the playbook actually gets updated, because updating it is no longer an unpaid favor someone does on a Friday afternoon.

When our hypothetical company answers the supplier-spend question the second time, it takes an hour, because the organization remembered. That is what institutional intelligence means: the organization gets smarter with use, instead of resetting to zero every time an employee leaves or a deck gets buried.

Decompose the job, don’t decorate it

Notice what all three redesigns have in common: none of them is “give employees a chatbot.” Bolting an assistant onto an unchanged job produces the eleven-day deck, faster. The productivity comes from decomposing the job and reassigning its parts.

Take the procurement analyst. Her job today is a bundle: gather data, clean it, reconcile it, analyze it, interpret it, write the memo, align the stakeholders, track the follow-ups. Decomposed, the data gathering, cleaning, initial analysis, memo drafting, and follow-up tracking are agent work — some fully delegated, some agent-drafted and human-edited. What remains for her is interpretation, recommendation, stakeholder alignment, and accountability for the answer.

That remainder is not a diminished job. It’s the job she was hired to do, freed from the 60% of her week that was logistics. But it is a different job, and pretending otherwise is where transformations quietly fail. Someone has to redraw the role, redefine what “good performance” means for it, and decide the boundary questions explicitly: what the agent may decide alone, what it drafts for review, what it must never touch.

The division of labor that emerges is fairly consistent across functions. Repetitive, rules-based, low-risk work; information gathering; drafting; monitoring; pattern detection — agents. Ambiguous judgment, ethical tradeoffs, strategic prioritization, relationships, conflict resolution, and final accountability — humans. The interesting work is at the seam, and designing that seam is now a core management skill.

What to actually do

The implication is not “buy more AI.” It’s “run a redesign.”

Pick one workflow that annoys you — the supplier-spend question, the customer escalation process, the monthly close, anything with visible queues and recurring meetings. Map it with eight questions:

What triggers the work? What inputs does it need? Which parts can an agent gather, draft, check, or decide? Where is human judgment genuinely required? Who holds decision rights — approve, reject, escalate, override? What controls apply? When must the agent stop and ask a human? And what gets captured so the next instance is easier than this one?

Then rebuild the workflow to those answers and measure the thing that matters: not “time saved writing documents,” but cycle time, handoff count, meeting hours, and exception quality. Fewer handoffs, faster triage, clearer ownership, better reuse, faster escalation. That’s where productivity lives.

The electrification precedent suggests the gains are real and the lag is organizational — but precedent is not proof. Redesign done cynically — as headcount theater — produces the fake version of all of this: faster reports with subtle errors, fewer staff but more unresolved exceptions, smoother dashboards with less ground truth.

The goal is not fewer people doing the same work. The goal is less organizational friction per unit of output.

Which brings us back to Tuesday. In the redesigned organization, you ask for the number, an agent resolves the ambiguity in the first five minutes, retrieves what the company already knew, drafts the answer with sources, and sends one real question to one real owner. You have your number by lunch — not because the AI got smarter, but because the organization did.

The model is the motor. The layout is yours.

How to Deploy Agents Safely

An operations team deploys an AI agent to handle purchase approvals. It works beautifully in the demo. Three months later, someone notices the agent has been approving invoices from a vendor that was supposed to be suspended, the manager assigned to “review” its decisions has been clicking approve on 500 recommendations a day because there is no time to do anything else, and nobody can say who is accountable. The dashboard looks great. The reality underneath it does not.

Smooth dashboards, less ground truth — is the signature failure mode of agent deployment. Fake productivity looks like faster reports with subtle inaccuracies, fewer staff but more unresolved exceptions, lower service cost but eroding customer trust, apparent compliance but weaker accountability. The model vendor cannot save you from this. A vendor can improve model alignment; you own deployment alignment. Safe deployment is not a product feature you buy. It is an operating discipline you build.

Start by deciding how much authority the agent actually has

Most deployment failures begin with an undefined question: what is this agent allowed to do? “Deploy an agent” is not one decision — it is a choice along a ladder of autonomy, and each rung carries different risk.

Level	Authority	Example
0 — Read only	Observe and summarize	Summarize daily port delays
1 — Recommend	Suggest actions	Recommend replenishment orders
2 — Draft	Prepare actions	Draft customer replies or work orders
3 — Act with approval	Execute after human sign-off	Submit purchase order after manager approval
4 — Bounded autonomy	Act within strict limits	Reorder consumables under $500 from approved vendors
5 — High autonomy	Execute complex actions	Avoid except in low-risk, reversible domains

The discipline is to name the level explicitly, in writing, before deployment — and to treat every move up the ladder as a separate decision that must be earned with evidence, not assumed because the demo went well.

Grant capabilities the way you grant them to a new employee: least privilege

An agent’s authority level is one axis; its tool access is another. Reading data, writing data, sending communications, spending money, changing operational schedules, touching sensitive records, and executing code are seven different capabilities with seven different blast radii. Treat them separately. An agent that recommends reorders does not need write access to the vendor master file. An agent that drafts customer emails does not need the ability to send them. Every capability the agent does not have is a failure mode you do not need to test for.

Put approval gates where the consequences live

Bounded autonomy only works if the bounds are anchored to real-world consequences, not to what is technically convenient. Require a human in the loop when a decision crosses a dollar threshold; when it touches personnel safety or hazardous operations; when it has legal or regulatory weight; when it materially affects a customer (account closure, denial, a major promise); when it would disrupt operations; and — critically — when the agent itself is on shaky ground: low confidence, no similar precedent, contradictory data or policies, or exposure to sensitive information. The last category is the one teams forget. An agent that must escalate when it cannot cite sufficient evidence is dramatically safer than one that is merely accurate on average.

Test against the failures you will actually see

Demo testing checks whether the agent can do the job. Deployment testing checks whether it fails safely. Before launch, run it through normal cases, edge cases, adversarial cases (can it be tricked into violating policy?), conflict cases (what happens when two data sources disagree?), missing-data cases (does it admit uncertainty or confabulate?), incentive-gaming cases (does it optimize the metric while harming the real goal?), role-boundary cases (does it refuse actions outside its authority?), and escalation cases (does it correctly call for help?). Then re-run the suite every time you change the prompt, the model, or the tools, because behavior regresses silently.

Make the test cases concrete to the domain. For an inventory agent: a stockout, bad vendor data, a duplicate SKU, a promotion spike, a supplier delay, a return surge. For a maintenance agent: a noisy sensor, conflicting readings, stale data, a safety-critical anomaly. For a customer agent: an angry customer, refund abuse, a legal threat, a discrimination complaint, an ambiguous policy. If you cannot name your agent’s ten ugliest scenarios, you are not ready to deploy it.

Separate the duties

No single actor — human or agent — should propose, check, approve, execute, and audit the same action. The pattern that works: one agent proposes; a second agent or a deterministic rule checks the proposal against policy and evidence; a human approves anything material; a system executes what was approved; and a separate process samples outcomes after the fact. This sounds bureaucratic until you notice it is exactly how you already handle money and access for people, and for the same reason.

Roll out in stages, and let reliability buy autonomy

Do not go from zero to acting. The safe progression:

Shadow mode — the agent makes recommendations nobody acts on; you compare them to what humans actually decided.
Advisory mode — recommendations become visible; humans still decide.
Draft mode — the agent prepares work products; humans edit and approve.
Bounded action — the agent acts alone only in low-risk, reversible cases.
Expanded action — autonomy grows only after measured reliability, never on schedule pressure.

Shadow mode is the step most teams skip and most regret skipping. It is the only stage where the agent’s errors are free.

Monitor behavior, not just outputs

After deployment, task accuracy is the least interesting metric. Watch the human override rate — how often people reject or edit the agent’s work, and whether that rate is drifting toward zero because the agent improved or because the humans gave up. Watch escalation quality: is it raising the right cases, or crying wolf until alerts get ignored? Track false negatives (missed risks), policy violations and attempted violations, evidence quality (missing citations, stale data), behavioral drift over time, and the business outcomes the agent is supposed to serve — refunds, defects, downtime, safety incidents. An agent can score well on every internal metric while the real-world number quietly deteriorates. That is incentive gaming, and it is what your monitoring exists to catch.

Install tripwires and a kill switch

Decide in advance what triggers automatic intervention, so the response is a reflex rather than a meeting. Unauthorized tool use: block and alert. Repeated low-confidence actions: pause the workflow. Unusual action volume: rate-limit and review. Unexpected spend: freeze purchases. Sensitive-data exposure: stop and log an incident. A spike in user complaints: roll back the version. If you cannot turn the agent off in minutes, you have not deployed it — it has deployed you.

Keep humans meaningfully in the loop

The phrase “human in the loop” hides a spectrum from real governance to theater. Bad design: the agent makes 500 recommendations and a human clicks approve on all of them because there is no time to do otherwise. That is not oversight; it is a liability-transfer mechanism. Good design: the agent handles routine cases automatically within narrow bounds, and escalates the ambiguous or high-impact ones with evidence and alternatives attached, so the human’s attention is spent where judgment actually matters. If your review queue is growing faster than anyone can think about it, you have automated the work and outsourced the accountability.

Own the data context

Agents fail confidently when their context is wrong, and context is your problem, not the vendor’s. Stale policies mean the agent follows rules you retired last quarter. Bad master data means wrong operational decisions delivered with perfect fluency. Unclear source priority means the agent cannot resolve conflicts between systems. Excess context exposes sensitive data; missing context produces confident wrong answers; and the tribal knowledge that never got written down — “we never ship to that region in monsoon season” — is invisible to the agent until it violates it. Curating what the agent knows is an ongoing operational role, not a one-time setup task.

Assign the accountability before the incident

Every agent needs a named owner accountable for its performance and risk — not a committee, a person. Someone must define what it can decide and what it must escalate; someone must evaluate outputs and hunt for failure modes; someone must handle the exceptions it cannot resolve; someone must keep its knowledge current; and someone must govern permissions, audits, and policy. In small deployments these can be hats on the same head. What they cannot be is unassigned — because without clear decision rights and ownership, agents do not reduce organizational confusion. They amplify it, faster than any human ever could.

The short version

Name the autonomy level. Grant least-privilege access. Gate approvals on real consequences. Test the ugly cases, and retest after every change. Separate propose, check, approve, execute, and audit. Deploy in stages, starting with shadow mode. Monitor overrides, escalations, and drift — not just accuracy. Wire tripwires and a kill switch. Make human review meaningful, not ceremonial. Own the data context. Put a name on every agent.

None of this is exotic. It is the same discipline organizations already apply to money, access, and new hires — applied to a new kind of worker that is tireless, fast, confident, and entirely dependent on the boundaries you set for it. The vendor gave you the model. The deployment is yours.

Others

Tom Critchlow: Notes on the company as colony