There's a version of this that everyone keeps describing. You take a workflow, document the steps, connect the right tools, and it runs. The AI handles the process, the output lands where it's supposed to, and the person who used to do that job is freed up for higher-value work. Cleaner margins, less dependency on headcount, a more scalable agency.
That version is real. It's just not what most people are actually experiencing when they sit down and try to build it.
What it actually looks like
It usually starts with an SOP. You take a process someone on your team runs, document every step end-to-end, and give the AI access to the tools it needs to execute. Project management software, Google Docs, data sources, web search. You find the MCPs, connect everything up, and watch it start pulling data and moving through the workflow.
And then you hit the first wall.
Maybe it needs to write output into a Google Doc, but the access level isn't there. So that step gets pulled out and flagged as manual. Fine. You adjust. Then it needs to pull something from the web, and every time it tries, it asks for permission. You give permission. It gets blocked. You find a different tool that works. The next time you run it, that one gets blocked too. So you write a fallback into the instructions: if the first tool fails, try the second. Then the second gets blocked. The model goes looking for a third option on its own, or it quietly skips the step entirely and keeps going.
That last part is the one that tends to catch people off guard. You get to the end of the run and something is missing. You ask why. The model tells you it tried, ran into an issue, and moved on. It didn't flag it. It just kept going.
So you strip out the parts that can't be reliably accessed, focus it on what it can actually do, and try to get at least the data gathering and organization to run cleanly. But then you're chasing tool failures, rewriting instructions to handle edge cases, swapping MCPs for direct API calls, and testing variations to figure out which combination holds. Version after version after version.
Eventually you end up with something that took significant time to build, doesn't complete the workflow end-to-end, and isn't stable enough to run without someone watching it.
The math that doesn't always work out
The natural answer at that point is to have someone on the team manage the process. Let the AI handle what it can, have a human monitor and catch the gaps.
But that person didn't build the thing. They don't know where it tends to break or what a failure looks like versus normal output. So now there's a training problem on top of everything else. And once they're trained, you're paying a person to monitor a system, plus paying for the model usage, and using a high-quality model because you want the output to actually be good. Add it up, and you can end up spending more, not less, than you were before any of this existed.
That's not a reason to stop pursuing automation. It's just an honest picture of where things are right now.
Where this is actually going
The fundamental issue is that LLMs are probabilistic. The same instructions don't guarantee the same output every time. That's a core characteristic of the technology, not a configuration problem. Agents can use tools, connect to external systems, take actions, and move through multi-step processes. That capability is real. But reliable, unsupervised, end-to-end execution on complex workflows is still much further out than the current conversation suggests.
What most agencies actually have access to right now is a capable assistant that needs supervision, not an autonomous operator that can be set loose on a process.
What to actually do with this
Not every workflow is worth automating at this stage, and trying to force automation on the wrong ones is where most of the time gets lost. The workflows worth pursuing right now tend to have a specific profile. They're high enough in volume that even partial automation creates real leverage. The steps are structured enough that the AI isn't being asked to make judgment calls. And when something goes wrong, it's visible and recoverable before it causes downstream damage.
For everything else, the better move is to identify where the review points need to be and build them in deliberately. Not as a workaround, but as part of the design. Figure out which steps carry the most risk if the output is wrong, and put a human checkpoint there. That's not a failure of the system. That's an accurate read of what the system can actually do.
The reliability problem will get solved. But designing your workflows around where things actually are, not where they're headed, is what keeps you from burning time chasing a version of AI that doesn't exist yet.
