I thought building AI agents would be the hardest part. I was wrong

by Vishnu K

When I started building AI agents, I thought the main challenge would be making them actually work.

Reasoning properly.
Using tools correctly.
Completing tasks end to end without breaking.
That part is still hard, but it was not the real problem.
Things changed when we scaled Agent37 and moved to multiple agents running in parallel. Research, coding content, and background workflows all running at once.
Individually it was fine. Together, harder to manage than expected.
The question stopped being “can the agent do the task?”
It became:
What is running right now?
What is stuck?
What needs review?
What failed while I was away?
That is when it clicked.
The problem was not intelligence anymore. It was visibility and supervision.
Most of the work is now about building systems you can actually track and control, not just make smarter.
Did anyone else feel this shift from building agents to managing them?
At what point did supervision become harder than capability for you?

Vishnu K

on June 18, 2026

Say something nice to an_engineer_log…

Post Comment

1

Building AI agents is only half the game—the real challenge is getting people to use and trust them consistently. Just like in cricket, having talented players isn't enough; execution wins matches. I’ve seen a similar focus on usability with Truecaller Mod APK, where user experience matters as much as the technology itself. see more datial. https://trucallermodi.com/

alexandercharli

·
8 minutes ago
·
Reply
1

what does stuck actually look like for you in practice? an agent that's silently looping, one that's waiting on an external API that never responds, and one that completed but produced garbage output are three different failure modes that need three different detection strategies. curious which of those you're catching reliably right now and which ones still slip through

adin_builds

·
an hour ago
·
Reply
2

Great read! Your content is clear, engaging, and valuable. Wishing you continued success with your blog.

AjayKumar

·
4 hours ago
·
Reply
1. 1
  
  Thanks for the encouragement. Much appreciated!
  
  an_engineer_log
  
  ·
  2 hours ago
  ·
  Reply
2

this matches what we're seeing too. the individual agent problem is mostly solved at this point — the hard part is knowing whether the output is actually good when you have 5 of them running. execution scales easily, quality evaluation doesn't.

Ozzie

·
5 hours ago
·
Reply
1. 1
  
  That's a great way to frame it. We found that the challenge was not whether the workflows could produce output but whether we could efficiently identify which outputs needed attention and which were ready to trust. Execution got easier as the systems improved. Evaluation and supervision didn't scale nearly as cleanly.
  
  an_engineer_log
  
  ·
  2 hours ago
  ·
  Reply
2

This is very relatable. The agent usually is not the problem anymore, keeping track of multiple workflows is. Once you have several things running at the same time, knowing what's finished, what's stuck, and what needs attention becomes a challenge on its own.

MaryamShafaqat

·
15 hours ago
·
Reply
1. 2
  
  That was the surprising part for us as well. The workflows were generally doing what they were supposed to do but maintaining visibility across everything became increasingly difficult as usage grew.
  
  an_engineer_log
  
  ·
  15 hours ago
  ·
  Reply
  1. 2
    
    This may be a dumb question: isnt the inherent idea behind agential AI that execution will be no concern and only oversight is left?
    
    talkingbern
    
    ·
    12 hours ago
    ·
    Reply
    1. 1
      
      Not a dumb question at all. I think that's the direction many of us are aiming for. What surprised me was that even as execution improved, oversight didn't disappear it just became a different problem.
      
      Instead of asking "can the agent do this?", we started asking "what is it doing right now?", "what needs review?" and "where should I focus my attention?". That's where the supervision challenge started to show up for us.
      
      an_engineer_log
      
      ·
      7 hours ago
      ·
      Reply
2

Completely agree that the bottleneck shifts. Getting an agent to do work is one thing, managing ten of them is another. Coordination and oversight start taking more effort than the actual task execution.

Indiehacker7802

·
15 hours ago
·
Reply
1. 1
  
  Exactly. A single workflow is usually manageable but once multiple agents are running across different projects, coordination becomes a challenge of its own. We found ourselves spending more time tracking work than actually initiating it.
  
  an_engineer_log
  
  ·
  15 hours ago
  ·
  Reply
2

Feels similar to traditional operations problems. Once the system scales, visibility and control become more importabt than raw capability. The technology improves but the management layer becomes the new challenge.

Buildingblock

·
16 hours ago
·
Reply
1. 1
  
  Yeah we spent a lot of time thinking about agent capabilities but once multiple workflows were running in parallel, visibility became the bigger problem. The challenge shifted from execution to supervision.
  
  an_engineer_log
  
  ·
  16 hours ago
  ·
  Reply
1

Spot on. Kind of wild

TurboLobster

·
11 hours ago
·
Reply
1. 1
  
  Exactly. That shift happened much sooner than we expected.
  
  an_engineer_log
  
  ·
  7 hours ago
  ·
  Reply
1
Are you trying to validate the idea of building a GUI for all the agents running? If so you may find the following helpful:
I have tried that before. Didn't work out for me because many of the assumptions you have are built on the model's current capabilities. When the models become stronger, two things happen:
1. Many people including myself don't need to supervise them anymore. I don't care about how many agents are doing my work behind the curtain and how they did it as long as they get the work done.
2. These models can explain themselves better and better. For example, ClaudeCode can tell you exactly what is each subagent doing and supervise them now. Whatever GUI that's prebuilt lack such flexibility.
BonanKou

·
11 hours ago
·
Reply
1. 1
  
  That's a fair point, and I can definitely see that becoming true for certain types of workflows as model capabilities improve.
  The challenge we kept running into wasn't necessarily understanding every reasoning step, but maintaining visibility across multiple long-running tasks and projects at once. For us, it was less about inspecting agents and more about knowing what was in progress, what needed review, and where attention was required.
  I do agree that as models get better, the supervision layer will probably evolve as well. The interesting question is how much visibility people will still want once agents become significantly more reliable
  
  an_engineer_log
  
  ·
  7 hours ago
  ·
  Reply
1

Yeah this hits. Same exact thing happened with distributed systems, the second you go from one process to a bunch running at once the hard part stops being "does it work" and becomes "can i even see whats happening and jump in when something breaks."

Im in the cloud infra world and watching agents run into this is kinda wild. its the same supervision/observability problem infra spent like a decade figuring out, just way faster now since agents are non-deterministic on top of being parallel.

what'd you end up building for the visibility side? dashboards, logging, some
human in the loop thing?

noah_digital

·
11 hours ago
·
Reply
1. 1
  
  That's a great comparison. The more we worked with multiple workflows, the more it started to feel like an observability problem rather than an agent problem. We ended up focusing on a mission control approach where we could see task status, review queues and workflow progress in one place. The goal wasn't to control every step but to make it obvious where attention was needed and when human intervention made sense.
  
  an_engineer_log
  
  ·
  7 hours ago
  ·
  Reply
1

This is very relatable. The agent usually is not the problem anymore, keeping track of multiple workflows is. Once you have several things running at the same time, knowing what's finished, what's stuck, and what needs attention becomes a challenge on its own.

Guestposting12

·
12 hours ago
·
Reply
1
Felt this exact shift. The moment you go from one agent to a fleet running in parallel, the bottleneck stops being "is it smart enough" and becomes "can I trust what it did while I wasn't looking."

What helped me most was flipping it from monitoring to designing for supervision up front — building the agents so the dangerous parts can't fail silently:
- every action writes an audit trail, so "what happened while I was away" is actually answerable
- anything irreversible (touches money, sends something external, deletes) sits behind an approval step instead of fire-and-forget
- actions are idempotent, so a retry after a crash doesn't double-execute
Dashboards tell you what's stuck — but that design is what stops a stuck or confused agent from quietly doing damage. Capability you can improve later; a fleet you can't see into burns you on day one.

For me the line crossed the moment agents could take real actions, not just generate text. Read-only parallelism is easy to babysit; the second they can do things, supervision becomes the whole game. Where did it tip for you — the number of agents, or the actions they were allowed to take?
syednoor

·
12 hours ago
·
Reply
1. 1
  
  I think it was a combination of both, but the number of concurrent workflows made the problem impossible to ignore. Individually, most tasks were manageable. The challenge appeared when multiple agents were running across different projects and we no longer had a clear picture of what was completed, blocked, or waiting for review. I completely agree on designing for supervision upfront. We found that visibility, review points and clear task status became just as important as the agent capabilities themselves. Once agents start taking meaningful actions, knowing when and where to intervene becomes critical.
  
  an_engineer_log
  
  ·
  7 hours ago
  ·
  Reply
1

This resonates. The shift you're describing — from "can the agent do the task" to "what's running, what's stuck, what failed while I was away" — is the moment an agent stops being a tool and becomes a team you manage. And managing always scales worse than building.

For us the inflection was that parallel runs make failure quiet. One agent failing is obvious; one of six failing silently at 2am is an ops problem, not an AI problem. What helped: treat every run like a job in a queue — explicit states (queued / running / needs-review / failed), one timeline view, and forcing each run to end in a reviewable artifact instead of just "done." Supervision got easier once "what happened" was a record, not a memory.

Curious what you landed on for the review step — human-in-the-loop gating per task, or let it run and triage after?

seamangonna

·
13 hours ago
·
Reply
1. 1
  
  "One of six failing silently at 2am is an ops problem, not an AI problem" is a great way to put it. That's very close to what we experienced.
  We ended up leaning toward explicit task states and a review queue rather than treating completion as the end of the workflow. Having a clear view of what was running, blocked, waiting for review or completed turned out to be far more valuable than simply knowing a task had finished.
  For review, we've generally found human approval works best at key decision points rather than on every step. Otherwise the overhead starts defeating the purpose of the automation.
  
  an_engineer_log
  
  ·
  7 hours ago
  ·
  Reply
1

We ran into something similar. Once you have several tasks running at the same time, visibility becomes a much bigger issue than execution.Its easy to underestimate how much context switching happens when you are monitoring everything manually.

farwaabbas

·
13 hours ago
·
Reply
1

I feel like this is where most teams eventually end up. The more capable the agents get, the more important monitoring becomes. Reliability matters but knowing what's happening across the system matters just as much.

buildtheory

·
14 hours ago
·
Reply
1

At what point did supervision become a bigger challenge than the actual agent performance? Was there a specific workflow or project that made this problem obvious?

shipstack2016

·
14 hours ago
·
Reply
1. 1
  
  It became obvious once we started running multiple workflows in parallel across different projects. Individually they were manageable but keeping track of progress, reviews and blocked tasks across all of them quickly became the bigger challenge.
  
  an_engineer_log
  
  ·
  14 hours ago
  ·
  Reply