We have been in the AI space long enough to be skeptical of big numbers. So when we say the AI agents market is on track to hit $47.1 billion by 2030 (MarketsandMarkets, 2024), we are not saying it to impress you. We are saying it because we are watching the spending happen in real client budgets right now.
|
$47.1B
|
44.8%
|
33%
|
78%
|
|---|---|---|---|
|
Projected AI Agents Market by 2030 (MarketsandMarkets, 2024) |
Annual Growth Rate 2024 to 2030 |
Enterprise Apps with Agentic AI by 2028 (Gartner) |
Enterprises Piloting Agentic AI in 2024 (McKinsey) |
Gartner puts it plainly: 33% of enterprise software will include agentic AI automation by 2028. That is up from less than 1% in 2024. We are not in a slow-burn adoption curve. We are in the steep part.
What most people picture when they hear ‘AI agent‘ is still a chatbot with better memory. Something that answers questions faster, or drafts an email without being asked twice. That picture is wrong, and if you build on it, your implementation will fail.
At ThirdEye Data, we have spent the last several years moving clients past that mental model. The real thing is different. These are systems that plan, make decisions, use external tools, and finish multi-step work without hand-holding. The shift is not just technical. It changes how you think about workflow design, how much you trust AI outputs, and how you keep people in the loop without turning them into rubber stamps.
This piece is about what we have actually learned doing this work. Not frameworks, not diagrams from a research paper. What happens when you put one of these systems into production.
The word ‘agentic’ gets attached to almost anything with an LLM inside it now. For practical purposes, we use four criteria. A system has to clear all four to earn the label.
|
Criterion
|
What It Means
|
Why It Matters for Engineering
|
|---|---|---|
|
Goal-Directed |
Works toward an objective across many steps, not just a single prompt |
The system can run unattended without re-prompting after each step |
|
Planning Capable |
Breaks a big objective into a sequence of smaller tasks |
Handles workflows that are too long or complex for one-shot prompting |
|
Tool-Using |
Calls APIs, queries databases, runs code, writes files |
Moves from producing text to producing real-world outcomes |
|
Feedback-Responsive |
Reads the result of what it just did and adjusts the next step |
Self-corrects without a human pointing out the error |
Hit two or three and you have something useful. Hit all four and you have a categorically different engineering problem on your hands.
The most common mistake we see: A team builds a retrieval-augmented generation pipeline, puts it inside a loop, and calls it an agent. The loop is a start, not a finish. A real agent maintains state. It can back out of a dead-end subtask. It picks different tools depending on what it finds along the way. That is not what you get from a RAG pipeline in a loop, and the failure modes that show up six months later are very different.
We have rebuilt these systems more than once. After enough iterations across different industries and model providers, we have settled on a structure that holds up. It is not exciting to look at on a slide. It works in production, which is the only thing that matters.
The most important decision in any agentic architecture is whether you separate planning from execution. We always do.
One layer does the thinking. It receives the objective, figures out what needs to happen, decides which tools to call in what order, watches for failures, and pulls the results together at the end. This layer never touches an external system directly.
Another layer does the work. Individual workers execute specific tasks. A database worker runs queries. A document worker reads files. A communication worker handles messages. Each worker is narrow by design. Narrow workers are easy to test, easy to replace, and easy to audit.
"When something breaks in production, and it will, your first question is whether the failure came from bad reasoning or bad execution. If those two things live in the same place, you will spend days figuring out which one failed. If they are separate, you know in minutes."
Sanghamitra Majumder, AI Engineer @ThirdEye Data
Agents need memory in a way that a chatbot simply does not. We work with four kinds.
The episodic memory gap causes more production failures than anything else we have seen. Five steps into a task, the agent does something. Ten steps in, the environment has changed. If the agent has no record of what it did earlier, it contradicts itself, repeats work it already did, or gets stuck in a loop. A structured event log for the current session fixes most of these problems. Build it first, not after your first production incident.
Every tool the agent can call should live in a central registry. Each entry in the registry describes what the tool does, what inputs it takes, what it returns, and what can go wrong.
If a capability is not in the registry, the agent cannot use it. Full stop. This feels like a constraint. It is actually what makes these systems safe enough to put in front of enterprise clients. You can audit the registry, lock it down by role or department, version-control it, and monitor every call against it. An agent that invents its own tools at runtime is an agent nobody can govern.
The rule we enforce
|
Component
|
What It Does in Practice
|
|---|---|
|
Orchestrator |
Takes the objective, builds the plan, delegates to workers, watches for failures, puts results together. No direct contact with external systems. |
|
Workers |
Each handles one type of task. Database queries, API calls, file operations, communications. Narrow scope, easy to swap. |
|
Tool Registry |
The master list of what the agent is allowed to call. Explicit schemas. Nothing outside the list gets used. |
|
Memory Layer |
Manages all four memory types. Most critical for coherence on long-running tasks. |
|
Validation Layer |
Checks every tool call before it goes out, checks every result before it comes back in. Catches bad parameters and fabricated outputs. |
|
Human Review Gates |
Stops the agent at configured points before any action that cannot be undone or carries real stakes. |
Every team we have worked with has hit at least two of these. Usually more. Here is what to expect and what to do about it.
Complex tasks generate a lot of intermediate content. Financial analyses, multi-day procurement workflows, anything that involves reading a large body of documents will eventually fill up the model’s context window. When that happens without a plan, the agent quietly drops the oldest content or stops cold.
We build a context management layer that tracks how full the window is, summarizes completed steps when space gets tight, and writes key state to episodic memory before anything important gets dropped. No model provider ships this out of the box. You have to build it yourself. Build it before you go live, not after the first failure.
This one surprises people the first time they see it. The agent decides it needs to call a tool. It specifies parameters that do not exist. Or it calls tools in an order that breaks dependencies. Or it invents a result for a tool call it thinks should have happened.
The dangerous part is what comes next. The agent keeps going, acting on the invented result. It is not like a text hallucination where the output is obviously wrong. The agent just proceeds, confidently, on bad data.
We put a validation layer between the orchestrator and every tool call. Input validation before the call, structural integrity check on the result before it goes back to the orchestrator. This layer catches 15 to 20 percent of tool invocations that would otherwise go through on bad data. That number surprised us the first time we measured it.
Agents that write to databases, send emails, move money, or delete files are working in a world where mistakes stick around. We sort every tool in the registry into one of three buckets.
Anything irreversible or high-stakes gets a human review gate. The agent writes out what it plans to do and why, then waits. The gate can be set by threshold (flag any financial transaction above a set amount) or by action type (all outbound communications need approval).
One client set up approval gates on every agent action. Within a week, reviewers were clicking through without reading. The control was worthless. We rebuilt the logic to only surface genuinely novel or risky actions, which cut approval volume by 80 percent. Quality of review went up because people were only seeing things that actually needed a human decision. Too many gates is as bad as none.
Field Note: Approval fatigue is real
An agent hits an obstacle and tries something else. That fails too. It tries a third approach. Without any ceiling on this, the agent can run through its entire token and API budget without finishing the task. Worse, its plan can drift so far from the original objective that the work it does complete is not actually what was asked for.
Two controls handle this. First, a hard iteration limit per subtask. Configurable, but firm. Second, a coherence check every few iterations that compares the current plan to the original goal. If drift crosses a threshold, the agent stops and asks for clarification rather than continuing on its own.
Some tasks are genuinely too big or too varied for one agent. The domain knowledge required spans too many areas. The subtasks can run in parallel. A single agent trying to hold all of it together becomes slow and fragile. That is when multi-agent systems make sense.
A manufacturing client needed to speed up their sourcing process. A single procurement request had to touch vendor databases, pricing history, inventory systems, and contract templates. We built four specialist agents working under one orchestrator.
Each specialist had its own tools and its own memory. The orchestrator handled sequencing, managed handoffs between agents, and produced the final output.
Sourcing cycle time dropped 60 percent. Analyst hours per RFQ fell 75 percent. Vendor shortlist quality scores improved 40 percent based on a post-implementation audit. None of those results were achievable with a single-agent or RPA-based setup.
What the Numbers Looked like Post-deployment
Multi-agent systems add coordination problems that single-agent systems do not have. Agents can come back with conflicting outputs. Results need to be formatted so the receiving agent can actually use them. A failure in one agent needs to stop cleanly without taking down the others. We solve this with explicit schemas for inter-agent handoffs and a shared state store that all agents read from and write to through the orchestrator.
We have watched a lot of AI projects fail after a promising demo. The gap between the two is almost always the same thing: no real evaluation infrastructure.
The 2024 Stanford AI Index found that fewer than 30 percent of enterprise AI deployments had a formal evaluation framework in place at launch. That number maps almost exactly to the failure and rollback rates we hear about from clients who come to us after something went wrong.
We will not start building an agent until the evaluation framework exists. That is not a rule we invented to be difficult. It is a rule we invented because we built agents without it, and those projects cost more to fix than they would have cost to build correctly.
|
Metrics
|
What It Tells You
|
What the Number Should Do
|
|---|---|---|
|
Task Completion Rate |
Did the agent finish what it was asked to do? |
Trend up as the system matures |
|
Subtask Accuracy |
Were the individual steps done correctly? |
Catches failures that overall completion rate misses |
|
Tool Call Precision |
How many tool calls were valid and needed? |
High waste rate points to a reasoning problem |
|
Cost and Latency per Task |
What does a completed workflow cost? |
Required for any honest business case |
|
Failure Mode Distribution |
When it fails, what kind of failure is it? |
Points engineering effort at the right layer |
|
Human Escalation Rate |
How often does the agent give up and ask for help? |
Too high means poor capability; too low means poor judgment |
Every system we put into production has a golden dataset. A set of test cases with known correct answers. Edge cases that have caused problems before. Adversarial inputs designed to find weak spots.
We run the full agent against this dataset every time we make a significant change. Not just unit tests on individual components. The whole thing. It costs more than standard testing because every run involves real model inference. It is the only way to catch failures that only appear when the model, the tools, and the memory are all working together. Nothing else finds them.
Research prototypes can ignore compliance. Production systems in financial services, healthcare, or any industry with real regulatory exposure cannot. We build governance in from the beginning. Clients who ask us to add it later spend a lot more money.
Every action the agent takes needs to be logged in enough detail to reconstruct exactly what happened and why. In finance, SEC Rule 17a-4 and FINRA requirements apply. In healthcare, HIPAA audit controls. In Europe, GDPR accountability rules. These are not suggestions.
We write audit logs as append-only event streams with cryptographic integrity checks. Retention follows the client’s data governance policies. Some industries require seven years. You want to design for that before go-live, not explain to a regulator why you did not.
We do not build directly against any single model provider’s API. We build against an abstraction layer that sits in front of whichever provider we are using.
The reason is simple. The model landscape has moved fast enough that systems built directly against one provider’s API in 2023 needed significant rework in 2024. An abstraction layer means swapping models is a configuration change, not a code change.
It also lets us route different tasks to different models. Orchestration work goes to a reasoning-strong frontier model. Classification and extraction tasks often run cheaper and faster on smaller models. That routing meaningfully reduces operating costs on high-volume systems.
Pick the least glamorous workflow you can find that meets these four criteria: people understand it well, you can measure whether it worked, it currently eats significant manual time, and a mistake can be corrected.
Build the evaluation framework and golden dataset before you write the first line of agent code. Then work through this sequence.
|
Steps
|
What to Do
|
How You Know It Is Done
|
|---|---|---|
|
1 |
Choose one well-understood workflow with measurable outcomes |
Team agrees on exactly what a successful run looks like |
|
2 |
Write the evaluation framework and golden dataset |
50 or more test cases: happy path, edge cases, adversarial inputs |
|
3 |
Build the orchestrator and tool registry with 2 to 3 tools |
Every tool has a schema; orchestrator produces structured plans |
|
4 |
Add the memory layer and test multi-step coherence |
Agent completes 5-step tasks without contradicting itself or looping |
|
5 |
Deploy to a small internal group with full human review |
Real production failures get logged and added to the golden dataset |
|
6 |
Reduce human review gradually as reliability data builds |
Escalation rate stays in the target range |
|
7 |
Add tools and complexity only after the baseline is stable |
Core metrics hold steady or improve after each change |
Every time we have seen a team skip a step in this sequence, the production incident that followed cost more to fix than the step would have cost to do. Every time. The sequence is not slow. Recovering from a skipped step is slow.
On shortcuts
Agentic AI is not software you buy and configure. It is a capability you build, and you earn reliability through architecture discipline and real evaluation infrastructure.
The organizations we see succeeding with this are not necessarily using the best models. They are the ones who built solid evaluation before they built the first agent, who treated compliance as part of the design, and who resisted the urge to scale before the foundation was stable.
The technology does real work. The failure modes have real consequences. Holding both of those facts in your head at the same time is what actually gets you to production.
Written By:
Sanghamitra Majumder
AI Engineer, At ThirdEye Data
From a specific use case to a full-scale modernization, share your requirements, and our engineers will take it from there. We typically respond within 24 hours with a transparent, detailed assessment of what's possible for your business.
333 West San Carlos Street, San Jose, CA 95110 USA
6000 Rome Blvd, Brossard, Quebec J4Y 0B6 Canada
Technopolis, Kolkata, India
CTIE, Hubli, India
We are a full-stack AI development company that helps enterprises make better decisions, reduce costs, and operate more efficiently.
333 West San Carlos Street, San Jose, CA 95110 USA
India: Kolkata, WB & Hubli, KA
Canada: Brossard, Quebec