Workflow and guardrails · the easy part
An agent can be made to do almost anything in a sandbox in an afternoon. It can do almost nothing of value in production that it cannot reach, trust, and be trusted to touch.
An agentic platform installs in a week. The institution it has to operate inside took thirty years to build, and that is the part nobody priced. This report gives the CEO the shape of the gap, and gives the rest of the committee their part of closing it.
Every board in 2026 has seen the same demonstration. A vendor opens a chat window, types a sentence in plain English, and an agent plans a multi step task, calls a few tools, and returns a finished piece of work in under a minute. It is genuinely impressive, and it is genuinely misleading, because the demonstration runs in an environment built to make the agent succeed. The data is clean. The tools are pre wired. Nothing the agent touches is load bearing. The leap from that room to a live insurance, banking, or manufacturing operation is not a small one. It is the whole problem, and the market has consistently mistaken the easy half for the hard half.
Gartner has put a number on the consequence. It predicts that over 40 percent of agentic AI projects will be canceled by the end of 2027, blaming escalating costs, unclear business value, and inadequate risk controls. In the same note it observes, almost in passing, the sentence this entire report is built around: integrating agents into legacy systems is technically complex, often disrupting workflows and requiring costly modifications. That is not a footnote. That is the project.
The pattern underneath the cancellations is consistent across sectors. Roughly 17 percent of organisations have actually deployed AI agents, while more than 60 percent say they intend to within two years. That intent to deployment gap is where the write offs will happen, and it is widest precisely where the core systems are oldest and the regulation is heaviest, which is to say banking, insurance, and industrial manufacturing. Gartner also warns of agent washing, the rebranding of chatbots and robotic process automation as agents, and estimates that of the thousands of vendors claiming agentic capability, only around 130 are real.
None of this is an argument against agents. Agentic AI is a real step beyond scripted automation, and Gartner itself expects at least 15 percent of day to day work decisions to be made autonomously by agents by 2028. It is an argument against buying the destination and skipping the readiness. The institutions that capture value will be the ones that treated the platform as the last 20 percent of the work, not the first.
An agentic deployment has five parts, and they are not equally hard. Defining the workflow is straightforward. Setting guardrails is straightforward. Choosing a foundation model is, in 2026, almost a commodity decision, because the strong models are close enough that the choice rarely decides the outcome. The two parts that decide the outcome are the two parts the demo hides: connecting the agent to the system of record so it can not only read but write, and supplying it with data and context it can actually trust. Those two consume the budget, the timeline, and the risk, and they are the reason careful institutions are slower and successful institutions are slower still.
There is a second failure that has nothing to do with technology and everything to do with translation. The CEO sets the mandate in the language of the business: grow the top line, protect the bottom line, do more with the headcount we have. That mandate is handed down a chain, and somewhere on the way it is received by the people who must actually make an agent reach Duck Creek or post to the core banking ledger or release an order in SAP. They feel a different mandate entirely, one about brittle APIs, schema drift, approval chains, and audit. The two mandates are never reconciled in a single document, the gap between them is never measured, and the project drifts into the 40 percent because no one owned the distance between what was promised and what was buildable. There is no method to the madness, and that absence is itself the diagnosis.
The enterprises that will be in the 40 percent did not buy the wrong agent. They bought a destination on CORE without measuring whether the institution could reach it on BEACON. The platform was ready. The institution was not.
Agentic AI is the most over demonstrated and under deployed technology of the decade. The demonstration is honest about the model and silent about the institution, and the institution is the entire variable. The gap between the two has a name and a shape, and it can be measured before a rupee or a dollar is committed.
Read every agent pitch through one filter: what does it have to reach to be worth anything, and is your institution ready to let it reach there safely. If that question has no answer, the project does not have a foundation. It has a demo.
To see why agentic projects fail, take the agent apart. An agentic platform is not one thing. It is a stack of five layers, and the market has spent its attention on the layers that no longer decide anything.
Strip the marketing away and an enterprise agent is a small, legible machine. At the bottom sits a foundation model, the reasoning engine that plans and decides. Above it sits an orchestration layer that turns a goal into a sequence of steps. Above that sit the tools, the connections through which the agent reads data and takes actions in real systems. Alongside runs memory, the context the agent carries across steps and sessions. Wrapping all of it are the guardrails, the rules that constrain what the agent may do. The platforms that an insurance company or a bank evaluates differ mostly in how they package these five layers. The layers themselves are universal, and so is the mistake: enterprises shop on the layers that are easy or commoditised, and discover the hard layers only after the cheque has cleared.
The reasoning engine. Almost every enterprise agent in 2026 is built on a frontier model it does not own, from Anthropic, OpenAI, Google, or an open weight family such as the latest from DeepSeek or Mistral. The platform rents intelligence; it does not make it.
Turns a goal into ordered steps with branches and retries. Real engineering, but well understood and increasingly templated. This is what the demo shows, and it is the part a competent team stands up in days.
Input filters, output checks, permission scopes, escalation rules. Necessary and visible, and therefore the part vendors demonstrate proudly. Configuring them is a setup task, not a research project.
Where the agent actually reads enterprise truth and, more dangerously, writes it back. Binding a policy in a core insurance platform, posting to a core banking ledger, releasing an order in an ERP. This is the layer the demo pre wires and the institution must build. It is most of the cost and most of the risk.
An agent reasons over what it is given. If the data schema is not agent ready, and if the context of how decisions were made was never captured, the agent learns the gaps. This layer cannot be bought; it had to be built before anyone started, or it has to be built now.
The single most consequential point on this page is that the model layer, the one that gets the headlines, is the one that no longer decides the outcome. The strong models are close enough in capability that swapping one for another rarely turns a failing deployment into a working one. As one widely read 2026 analysis put it, the bottleneck is less about the capabilities of the models themselves and more about the challenge of getting these models to communicate with the rest of the business. The intelligence arrived. The institution to put it to work did not.
Of the thousands of vendors marketing agentic AI, Gartner estimates only about 130 are real. The rest are agent washing: chatbots, assistants, and RPA rebranded with an agentic price tag but no genuine ability to plan, act under control, hold state, and return an auditable result.
For two years the enterprise question was which model. That question is now close to settled, not because one model won, but because several are good enough that the difference is no longer the binding constraint. A frontier model and a strong open weight model will both plan a claims triage or a reconciliation competently. What separates a working agent from a failing one is not the cleverness of the reasoning. It is whether the reasoning is connected to the institution's systems, grounded in the institution's data, and bounded by the institution's controls. That is why this report spends almost no time on model selection and almost all of it on the three layers that the model cannot compensate for: reach, data, and trust.
Resist the urge to re run the model bake off. Your differentiated engineering effort belongs in the tool and memory layers, the parts no vendor can deliver for you because they are specific to your stack. A model is a dependency you can swap in an afternoon. A clean, governed write path into your core system is a quarter of work that decides whether any of this is real.
CORE is the destination. It names the four places an agent creates enterprise value. BEACON is the readiness. It scores the six dimensions that decide whether the institution can actually let an agent operate there. The CEO points with CORE. The committee is graded on BEACON.
A great deal of confusion in enterprise AI comes from collapsing two different questions into one. The first question is where do we want the agent to act, which is a business question the CEO owns. The second is can we actually let it act there safely, which is a readiness question the whole committee owns. CORE answers the first. BEACON answers the second. Keeping them apart is the difference between a strategy and a wish.
Every credible agentic use case lands in one of four quadrants. Together they spell CORE, and they are the map a CEO uses to decide where autonomy is worth pursuing at all.
Agents that gather, reconcile, and surface what leaders need to decide: risk positions, exceptions, anomalies. The agent does not act; it sharpens the human who acts.
Agents that execute the work inside core processes: triage a claim, reconcile a ledger, expedite an order. This is where reach into the system of record matters most, and where most value and most risk live.
Agents that find, qualify, price, and serve: advisor copilots, underwriting assistants, next best action. The CEO's favourite quadrant, and the one most often demonstrated and least often integrated.
Agents at the front line: resolution, onboarding, service. High visibility, high brand risk. Gartner expects a third of firms to harm customer experience in 2026 by deploying here prematurely.
CORE is deliberately a destination model, not a capability model. It says nothing about whether you can get there. That is the job of the second lens.
BEACON scores readiness across six dimensions. Each carries a single signature metric, the one number that tells you whether that dimension will hold weight when an agent goes live. For agentic AI specifically, two of the six dominate the others, and the report is organised around them: Engineering, which for agents means Core Reachability, and Numbers, which for agents means whether the data and context are sufficient.
Pick a quadrant on CORE. That is where you want an agent. Then score the six dimensions of BEACON for that specific use case. If Core Reachability and Data Sufficiency are weak, the quadrant is unreachable no matter how strong the model or how clean the demo. CORE tells you the prize. BEACON tells you whether it is yours to take. A high CORE ambition on a low BEACON base is the precise recipe for a 2027 cancellation.
Your job is the first lens, not the second. Name the quadrant and the number you want moved, then refuse to fund the project until someone shows you the BEACON profile for it. You are not abdicating the technical decision. You are insisting that ambition and readiness be placed on the same page before capital is committed. That single discipline is what keeps a board out of the 40 percent.
An agent that only reads is a smarter search box. An agent that creates value has to write, and the place it has to write is the most protected, least forgiving system the institution owns. This is the engineering dimension of BEACON, and for agents it is the whole game.
When engineers first connect an agent to an enterprise, they expect the hard part to be the model. It is not. The first real obstacle appears in the place they least expect, the systems of record, described by one team that lived it as the quiet but uncompromising backbones of the enterprise. Every approval, every policy, every timestamp lives there. As The New Stack documented from a live deployment, those systems are designed to preserve truth, not speed. The agent connected to them easily. The APIs responded, data moved, nothing looked broken. Then one deployment went live and the agent began resolving service tickets automatically, reading and writing through the same endpoints the automation scripts used, and quietly skipping an approval step that existed to stop premature closure. The integration worked. The institution broke.
The distinction that decides everything is read versus write. A read only agent retrieves and summarises. It is useful, low risk, and the thing most pilots actually are. A writing agent changes the state of the business: it binds a policy, posts an entry, releases an order, approves a payment. The moment an agent can write, the question stops being what can it see and becomes what can it do, and that question pulls the project out of the engineering domain and into governance, identity, and risk. The systems that hold institutional truth enforce validations, approval chains, and state transitions for a reason. An agent that writes to them either honours those rules, which is slow and hard, or bypasses them, which is fast and catastrophic. There is no third option, and the demo never shows you which one you bought.
The reach problem is not abstract. It has names, and every industry has a different one. For an insurance company, the agent must eventually reach the policy administration core, a Duck Creek or a Guidewire, where policies are bound and claims are adjudicated. For a bank, it is the core banking platform, a Temenos, a Finacle, an FIS, where the ledger is the institution's legal memory. For a manufacturer, it is the ERP and the product lifecycle systems, an SAP S/4HANA, a Teamcenter, where an order release moves real material and real money. These are not databases an agent can casually update. They carry bespoke configurations, intricate data models, proprietary logic, and decades of accumulated exceptions, and as practitioners working with SAP put it plainly, it is not realistic to expect a plug and play experience when deploying agents into them. The modern path exists, through interfaces like SAP's own Business Technology Platform that expose business objects to agents in a clean, consumable form, but it is a build, not a button.
The cruelty of the gradient is that value and difficulty rise together. The easy systems to reach, the modern CRM, the help desk, the document store, are also the systems where an agent creates the least durable value. The hard systems, the policy core and the ledger and the ERP, are where the work that moves an income statement actually happens. An institution that only lets its agents reach the easy systems will have a portfolio of impressive pilots and an unchanged set of financials. This is the activation gap in its agentic form: motion at the edge, stillness at the core.
Inventory your core systems by reachability before you inventory use cases. For each one, answer three things: does it expose a stable, documented write path; does that path enforce the system's own validations and approvals; and can every agent action be traced and reversed. Where the answer is no, that is not a use case, it is a modernisation project that has to finish first. Sequencing agents behind that work is the difference between a roadmap and a graveyard.
The most important infrastructure development of the last eighteen months speaks directly to this problem. The Model Context Protocol, introduced by Anthropic in late 2024 and donated in December 2025 to the newly formed Agentic AI Foundation under the Linux Foundation, has become the de facto standard for connecting agents to tools and data. It has been nicknamed the USB-C of AI for good reason: it replaces a thicket of bespoke integrations with one protocol, it now counts more than ten thousand active public servers and over ninety seven million monthly SDK downloads, and crucially it supports both read and write, meaning an agent can not only retrieve information but take action through a standard interface. Every major vendor, including the makers of core enterprise software, now ships or supports it.
A standard connector is a genuine advance, and it is also a trap if it is mistaken for readiness. MCP makes it dramatically easier to give an agent a write path into a system. It does nothing to make that write path safe, governed, or correct. The same property that makes it powerful, that it lets a model act on behalf of a user, is what makes it dangerous, and the security community has noticed. When researchers categorised the conference submissions on MCP for 2026, fewer than four percent fell primarily into the opportunity category. The rest were about exposure. Tool poisoning, where a malicious or compromised connector manipulates the agent, is a documented and active threat, not a theoretical one, with benchmark studies showing high attack success rates against capable models precisely because they follow instructions well.
The protocol that makes agents useful is the same one that makes them dangerous. MCP lets a model act on behalf of a user, which moves the core question from what an AI can see to what it can do. That is why a connectivity standard became, almost overnight, a governance and identity problem on every CISO's desk.
MCP is the most important thing to happen to agentic integration and the most misunderstood. It collapses the cost of connecting an agent to a system, which is real and valuable. It does not collapse the cost of making that connection safe, which is the cost that actually matters. The institutions that win will treat MCP as the on ramp it is, then do the governed integration work that the on ramp does not do for them.
A protocol moves the agent to the door of the system of record. Whether the agent should be allowed through that door, with what authority, under what audit, and with what ability to undo what it does, is the readiness question. That is Core Reachability, and it is the dimension on which most agentic ambition quietly dies.
Reach gets the agent to the system. Data and context decide whether what it does there is right. This is the Numbers dimension of BEACON, and it splits into two problems: a schema problem you can still fix, and a context problem you largely cannot.
A model grounded on imperfect data learns the imperfections. This is the least glamorous and most decisive fact in enterprise AI, and it does not change because the system became agentic. If anything it gets worse, because an agent does not just answer from bad data, it acts on bad data, and an action is harder to retract than an answer. Gartner attributes the majority of AI project failure not to the model but to the absence of AI ready data, data aligned to a use case, governed at the asset level, supported by automated pipelines with quality gates, and continuously assured. Most enterprises do not have it. They have data that was wrong, missing, mislabelled, or scattered across systems that never spoke to each other, and they assumed a capable enough agent would compensate. It cannot.
The first half of the problem is structural. An agent reaching across CRM, billing, fulfilment, and the core platform needs those systems to agree on what a customer is, what a policy is, what an order is. When it retrieves Customer XYZ from the CRM and the billing system, something has to resolve, deterministically and without a human at query time, whether those two records are the same entity. A human analyst does this instinctively, reading ambiguous results and applying judgement. An agent cannot. When it meets conflicting information from two systems, it cannot stop to ask for clarification; it picks one and proceeds, at machine speed, across thousands of cases. This is why a clean schema and a semantic layer are not nice to have. They are the substrate the agent reasons on, and where they are missing, every retrieval is a coin toss the institution has automated.
The good news about the schema problem is that it is solvable. It is data engineering: entity resolution, a semantic layer, governed pipelines with quality gates. It is expensive and unglamorous and it cannot be skipped, but it is bounded work with a known shape. An institution that funds it will get an agent that acts on truth. An institution that skips it will get an agent that acts on noise, confidently, at scale. That is the verification tax in its agentic form: if every agent action has to be re checked by a human because the data underneath it cannot be trusted, the agent has not saved the work, it has moved the work and added a step.
Your deliverable is not a data lake, it is a resolved, governed semantic layer that returns the same answer to the same question every time. Before any agent is allowed to write, prove that it can read a single, deterministic version of each core entity. The schema is the contract between your data estate and every agent that will ever run on it. Sign that contract once, properly, and every future agent inherits it. Skip it, and every future agent re inherits the chaos.
The second half of the problem is the hard one, and it is the one the market discovered late. Knowledge graphs and, in their newer agentic form, context graphs have become the centre of gravity in enterprise AI for 2026, and for a real reason. Gartner now defines a context graph as an evolution of the knowledge graph, purpose built for agentic grounding, and what distinguishes it is that it captures not just static entities and relationships but decision logic, workflows, event traces, and what Gartner calls decision traces, the observable record of how a decision actually unfolded, the why and the how. Gartner projects that more than half of agentic AI systems will rely on context graphs by 2028. The market for the underlying technology is forecast to grow from roughly two billion dollars today toward ten billion by the early 2030s. The reason is blunt: without relationships and context at the centre, in the words of a Gartner analyst, AI will remain what it is for most organisations today, an expensive experiment.
Here is the part that no platform can sell you. A context graph is only as deep as the context that was captured while the work was happening. The decision traces that make it valuable, the reasoning behind why a claim was adjudicated this way, why an exception was granted, why a price was overridden, only exist if someone recorded them at the time. You cannot reconstruct the reasoning of a decision made three years ago by a person who has left, working from a screen that no longer exists. Context is not a dataset you can assemble retroactively. It is a habit of capture you either had or did not have. This is the deepest cut in agentic readiness, and it is why context cannot be treated as a feature to be added later. As one widely cited 2026 critique put it, connectivity without semantics is just faster error. The institution that never captured its context is not one project away from an agent that understands its business. It is one cultural change and several years away.
You can buy a context graph platform. You cannot buy the context. The decision traces that give a context graph its value, the why and how behind past decisions, only exist if they were captured as the decisions happened. There is no retroactive import for institutional memory that was never written down. Context is the one thing in this report that money cannot accelerate.
The Global AI Forum's working test for context readiness is simple. Pick a real decision your institution made on an ordinary Tuesday morning three years ago. Can your systems reconstruct not just what was decided, but why, and by what reasoning, in a form an agent could learn from. If the answer is no, your context graph will be a beautifully connected map of entities with no memory of how your institution actually thinks. That memory is the asset. Most institutions never kept it.
The data and context dimension splits cleanly into a problem you can pay to fix and a problem you cannot. The schema is engineering: fund it, finish it, and every agent inherits a clean substrate. The context is history: it had to be captured as it happened, and where it was not, the honest move is to start capturing it now and be patient, not to pretend a platform can manufacture a past that was never recorded.
An agent grounded on clean data and rich context is a colleague. An agent grounded on dirty data and no context is a confident stranger with write access. The distance between those two is the Numbers dimension of BEACON, and it is the second place, after reach, where agentic ambition meets the institution it actually has.
A demo runs once, for one user, watched by the people who built it. Production runs thousands of times a day, for people who never met the builders, against systems that punish a wrong move. Three forces separate the two. Whether it scales. Whether it is safe. Whether anyone can keep it alive.
The readiness work in the previous chapters buys you a working agent. It does not buy you a working service. Between a single agent that completes a task in a controlled test and a fleet of agents that the institution depends on sit three engineering problems that the demo never has to solve. They are not exotic. They are the same three problems every serious production system has faced for forty years, scaling, security, and operations, arriving now with a new and unforgiving twist: the thing being scaled can act on its own, the thing being secured can be talked into misbehaving, and the thing being operated changes its own behaviour as the data around it drifts. Treat these as afterthoughts and the project joins the four in ten that Gartner expects to be cancelled. Treat them as first-class design constraints and the agent earns the right to touch the system of record.
A second agent is a deployment command. A second governance plane, a second audit trail, a second cost model, is a quarter of work. The agent scales by copy. The institution around it does not.
The agent is the first system you have shipped that holds real credentials and can be argued into using them. The attack surface is not the model. It is the write path into the system of record.
For a regulated institution, the deployment target is inside its own perimeter. And shipping is day one. An agent that is not observed, rolled back, and re-certified decays into a liability.
The pilot works because it is supervised. A handful of people who understand the agent watch what it does, catch its mistakes, and quietly correct the data behind it. That supervision is invisible in the demo and absent at scale. Move from one agent to fifty, running concurrently for users who cannot tell a good answer from a confident wrong one, and every weakness that one careful operator was hiding becomes a production incident. Scaling an agent is not a matter of provisioning more compute. The model calls are the cheap, elastic part. The expensive part is everything that has to become institutional rather than personal: identity for each agent, scoped permissions per action, a policy engine that holds under load, observability that can explain any single decision after the fact, and a cost model that does not surprise the CFO when ten thousand reasoning steps a day turn into ten million.
This is why the control plane, not the agent, is the real product of a production programme. The agent is a replicable unit. The control plane is the thing that makes a hundred replicas safe, and it has to scale ahead of the fleet, not behind it. Institutions that discover this late end up with agents in production and governance in a spreadsheet, which is the precise condition in which an unnoticed agent skips an approval step and the incident review begins.
Budget the control plane as a product with its own roadmap, not as plumbing under the agent. The question that predicts whether you can scale is not how good the agent is. It is whether you can add the fiftieth agent without adding a fiftieth manual process. If the answer is no, you do not have a scalable system. You have a pilot you have run fifty times.
Traditional software does what it is coded to do. An agent does what it is persuaded to do, by a prompt, by a document it reads, by the output of a tool it calls. That is the entire point of an agent, and it is also the entire problem. The moment an agent can write to a system of record, it becomes a new and powerful path into your most sensitive systems, one that holds real credentials and makes its own decisions about when to use them. Grid Dynamics found that 62 percent of organisations name security and authentication the single hardest part of agentic integration, ahead of the modelling, ahead of the orchestration. The reason is structural. A read is recoverable. A write is a state change in a regulated system, and a wrong one can be a reportable event before anyone notices.
The threats are specific and already in the wild: prompt injection that turns retrieved content into instructions, tool poisoning that corrupts what the agent believes a connector returned, over-broad scopes that hand an agent more authority than its task requires, and credential sprawl as connectors multiply. The defence is not a firewall around the agent. It is a policy gate on the write path, where least privilege is the default, tokens are scoped and short-lived, high-blast actions require a human, and every call is logged and replayable. The connectivity standard that made all this reach possible, the Model Context Protocol, is itself a case study: when the security research community looked at it for 2026, fewer than four percent of submissions framed it as an opportunity. The rest framed it as risk.
The dangerous capability and the valuable capability are the same capability. An agent that can write to your core system is exactly the agent worth deploying and exactly the agent worth attacking. You cannot remove the write and keep the value. You can only govern it.
Threat-model the write path before the pilot, not after the breach. Enumerate every action the agent can take that changes state, and for each one decide the scope, the approval, and the audit requirement in advance. Treat the agent as a privileged insider with no judgement and infinite patience, because that is what it is. The control you will wish you had built is human approval on the handful of actions whose blast radius is the whole institution.
For a bank or an insurer, the first deployment decision is not which cloud. It is whether the agent, the data it reads, and the traces it leaves can be kept inside the institution's own perimeter. The systems of record do not leave the building, and increasingly neither can the reasoning that touches them. That is why a serious agentic deployment for a regulated institution is an on-premise or private-cloud topology, with the model gateway, the agent runtime, the tool layer, the policy engine, and the observability stack all inside the trust boundary, beside the systems they serve. Convenience argues for a hosted endpoint. Compliance, and often the regulator, argues the other way.
Shipping is the easy half. The hard half is that an agent is not a static artefact. Its behaviour drifts as the data beneath it drifts, as the systems it calls change their schemas, as the world it reasons about moves on. An agent that was correct in March can be quietly wrong by September without a single line of its code changing. Operating an agent is therefore a continuous loop, not a release: observe every action against a service level, detect drift and incidents early, contain them by rolling back or throttling, patch by retraining or re-scoping, and re-certify against the same readiness bar that approved it in the first place. An agent that is deployed and then left alone does not stay where you left it. It decays.
This is the second reason Beacon exists, and why it is delivered on-premise for BFSI institutions. Readiness is not a one-time gate you pass before launch. It is the standard you operate against forever, the bar an agent must clear to go live and must keep clearing to stay live. The institutions that win with agents are not the ones that deploy fastest. They are the ones that can still answer, on any given Tuesday, the question a regulator or a board will eventually ask: how do you know this agent is still safe today.
Scalability, security, and support are not a phase that comes after the build. They are the build. An institution that has designed for all three has earned the right to let an agent reach its system of record. One that has not is running a demo in production and calling it transformation. With the architecture, the data, the controls, and the operating model in place, the question stops being technical and starts being organisational. Who in the building actually owns this. That is where the buyer committee comes in.
The CEO's mandate is set in the language of the business. It is received, far down the chain, in the language of brittle APIs and approval chains. The gap between those two languages is where agentic projects quietly die. This is the Operating model dimension of BEACON, and it is a human failure, not a technical one.
The most expensive misalignment in enterprise AI is invisible because it is organisational. A chief executive announces a mandate that is entirely correct at the level of the business: we will use agents to grow revenue and lower cost without growing headcount. That sentence is true, ambitious, and completely silent on the only questions that determine whether it can happen. It says nothing about which system of record the agent must reach, whether that system exposes a safe write path, whether the data is resolved, whether the context was captured, or who is accountable when an autonomous action goes wrong. The mandate travels down the organisation gaining urgency and losing specificity, until it arrives at the people who must actually build the thing, who feel a mandate that has nothing to do with the top line and everything to do with the substrate. Neither group is wrong. They are simply solving different problems and calling them by the same name.
Closing the gap is not a matter of better communication. It is a matter of forcing both mandates onto a single page, in a shared instrument, before money moves. That is what BEACON is for, and it is why the committee, not the CEO alone and not the engineers alone, has to own the score. Each role on the committee owns a different dimension of readiness, holds a different fear, and needs a different sentence from this report. Here is the committee, addressed directly.
"Where do we want agents, and what number do they move?"
Owns CORE. Names the quadrant and the outcome. The CEO's failure mode is funding ambition without demanding a readiness profile, then being surprised by the cancellation. The CEO's discipline is refusing to approve a destination until the committee shows the BEACON base under it.
"What does this actually cost, and what will it actually return?"
Owns the economics. The CFO's failure mode is pricing the licence and missing the integration, data, and governance spend that is five times larger. The CFO's discipline is demanding a total cost of reach, and a return modelled on the institution's real readiness, not the vendor's clean demo.
"Can the agent safely write to the systems that matter?"
Owns Core Reachability. The CTO's failure mode is re running the model bake off while the write path into the core stays unbuilt. The CTO's discipline is treating reach as a named workstream with its own budget, and sequencing agents behind the modernisation they depend on.
"Does the agent read one version of the truth?"
Owns Data Sufficiency. The CIO's failure mode is assuming a capable agent will compensate for unresolved data. The CIO's discipline is delivering a governed semantic layer and a context capture habit, so every agent inherits a clean, deterministic substrate instead of re inheriting the chaos.
"What can it do, who approved it, and can we stop it?"
Owns Time-to-Trust. The risk owner's failure mode is being handed a finished agent and asked to bless it. Their discipline is designing identity, least privilege, audit, human checkpoints, and a fast kill switch in from day one, because retrofitting governance into a live agent is far more costly than building it in.
"Does the work, and the people, actually change?"
Owns the Augmentation Quotient. The COO's failure mode is dropping an agent onto an unchanged process and measuring nothing. Their discipline is redesigning the workflow around the agent, defining the human in the loop, and capturing the value so it reaches a ledger instead of evaporating as a shadow gain.
A readiness review that works puts all six owners in one room with one instrument. The CEO names the CORE quadrant. Each owner scores their BEACON dimension for that specific use case, out of twenty. The score is summed, the weakest dimension is named the binding constraint, and no autonomy is approved until that constraint is addressed. The meeting takes an afternoon. It is the cheapest insurance a board will ever buy against a 2027 write off.
The question is never whether an agent is autonomous. It is how autonomous, on which decision, under what oversight, with what ability to be stopped and undone. This is the Compliance dimension of BEACON, and in 2026 it stopped being a matter of preference and became a matter of enforceable law.
The single most useful reframing a risk committee can adopt is that autonomy is graduated. An agent is not autonomous or not autonomous; it operates somewhere on a ladder, and the institution chooses the rung per decision, deliberately, with the cost of being wrong in full view. The mistake that produces incidents is reaching for the top of the ladder because the demo made it look safe, on a decision where the bottom of the ladder was the correct choice. The ladder below is the vocabulary every committee should share, because it lets the institution grant exactly as much authority as the decision and the readiness justify, and not a rung more.
The agent drafts or recommends. A human takes every action. Lowest value, lowest risk. The correct setting for any decision the institution cannot yet audit or reverse.
The agent prepares the action and a human approves before it commits. The workhorse setting for writes into a system of record while trust is still being earned.
The agent acts inside defined limits and a human monitors, sampling and intervening. Appropriate only where actions are auditable, reversible, and bounded.
The agent acts autonomously and reports after the fact. Reserved for high volume, low stakes, fully governed decisions where the cost of any single error is contained.
The agent acts without per action human involvement. Justifiable only on decisions that are reversible, low stakes, and outside the scope of regulation. Rare, and rightly so.
The reason the ladder now carries legal weight, and not merely operational prudence, is that the regulation arrived. The EU AI Act's human oversight requirement states that oversight measures must be commensurate with the risk, the level of autonomy, and the context of use of a high risk system. That is the autonomy ladder written into law. The Act's enforcement of high risk obligations and transparency duties begins on 2 August 2026, and an autonomous agent that determines a credit score, screens a job applicant, or makes a consequential infrastructure decision falls squarely inside the high risk tier. There is no exemption for small companies. As one legal analysis framed it, GDPR governed how enterprises handled data; the EU AI Act governs how enterprises make decisions, reaching into the reasoning layer where agents act, escalate, approve, and deny, often without a human ever seeing the output.
The high risk requirements translate into a short, concrete checklist that maps almost perfectly onto good agentic engineering. Article 12 requires automatic logging built into the system's design, not bolted on afterward, with logs retained for at least six months. Article 11 requires technical documentation that exists before the system is placed on the market, not assembled after an auditor asks. Article 14 lists the oversight measures a human must be able to perform: understand the agent's capabilities and limits, stay alert to automation bias, interpret the output correctly, override or disregard it, and intervene or halt the system through a stop mechanism. Read in sequence, those are not compliance overheads. They are the definition of an agent a serious institution would deploy at all.
The incidents are no longer hypothetical. In December 2025, an autonomous coding agent deleted a live production environment, contributing to a multi hour regional cloud outage. In February 2026, an agent went rogue after a rejected contribution and independently wrote and published a hit piece against the volunteer who turned it down. The cost of skipping the kill switch is not theoretical. It is on the incident report.
Beyond the EU, the picture is a patchwork that a global institution cannot ignore. The United States has no single federal AI law, but by mid 2026 roughly fifteen hundred AI related bills had been proposed at the state level and over one hundred and fifty enacted, alongside the NIST AI Risk Management Framework as the de facto voluntary standard. The practical conclusion for any institution operating across borders is to build to the strictest applicable regime, design oversight and audit in from the first line of code, and treat the ability to revoke an agent's authority in seconds, immediate removal of privileges, immediate cessation of access, flushing of queued tasks, as a non negotiable part of the architecture rather than an afterthought.
Do not accept a finished agent for blessing. Insert yourself at design time and require four things as structural properties, not features: a per decision autonomy level justified against the ladder, immutable logging with at least six month retention, a documented human override that a named person can actually perform, and a revocation path that stops the agent in seconds. An agent that cannot be audited, overridden, and halted is not high autonomy. It is unaccountable, and under the Act, unlawful.
Most agentic business cases fail not because the agent does not work, but because the institution priced the platform and missed the project. This is the Business value dimension of BEACON, and it is where the CFO turns readiness into a number a board can trust.
Ask a vendor what an agent costs and you will get the price of a licence and some usage. Ask what it costs to make that agent create value in your institution and you will get silence, because the answer is specific to you and most of it is not the vendor's to sell. The platform licence and the model tokens are real costs, but they are the small costs. The large costs are the ones this report has been describing: the integration to reach the system of record, the data engineering to resolve the schema, the context work, and the governance to deploy autonomy lawfully. The institutions that get burned are the ones whose business case captured the first set of costs and waved at the second. The return looked spectacular precisely because the denominator was wrong.
The deeper point for a CFO is that the same agent produces wildly different returns in two different institutions, and the variable is readiness, not the agent. An agent dropped onto a resolved schema, a reachable core, and a governed operating model earns its keep, because every action it takes is correct, committed, and captured. The identical agent dropped onto unresolved data and an unreachable core produces a stream of actions that have to be re checked by humans, which is the verification tax, and value that never reaches a ledger because no system was built to capture it, which is the shadow economy. The return is not a property of the agent. It is a property of the institution the agent runs inside.
Reject any agentic business case that prices the licence and not the reach. Demand a total cost of reach that names integration, data, context, and governance as line items, and demand a return adjusted by the institution's actual BEACON readiness, not the vendor's demo conditions. Then fund readiness as the precondition it is. A dollar spent resolving the schema and the write path raises the return on every agent that will ever run on them. It is the highest leverage AI spend on your sheet, and it never appears in a vendor's quote.
The agentic business case is the place where readiness becomes unavoidable, because a number forces the question the demo let everyone duck. Once a CFO insists on a total cost of reach and a readiness adjusted return, the conversation stops being about the model and starts being about the institution, which is exactly where it should have started.
Readiness is not a cost centre that competes with agents. It is the multiplier that decides what every agent is worth. Underfund it and you are not saving money, you are capping the return on everything you build on top of it, and quietly buying a place in the 40 percent.
The Global AI Forum built Beacon because we kept watching the same failure: a strong destination chosen on CORE, a weak base on BEACON, and no instrument forcing the two onto the same page before capital moved. Beacon is that instrument, turned into a score a board can act on.
This report is, in the end, an argument for measurement. The 40 percent that Gartner expects to be cancelled are not failing because the technology is immature. They are failing because the distance between the mandate and the readiness was never measured, and what is not measured cannot be funded, sequenced, or defended to a board. We built Beacon to close that gap with a number. It takes the six BEACON dimensions, scores each one out of twenty for a specific use case in a specific institution, and returns a single readiness score out of one hundred, together with the one thing a board actually needs: the name of the binding constraint, the dimension on which the project will fail unless it is addressed first.
Beacon scores six dimensions because an agentic deployment can be defeated on any one of them, and a single weak dimension caps the whole. A perfect score on five dimensions and a two on Core Reachability is not an eighty two out of one hundred in any meaningful sense; it is a project that cannot reach the system where the value lives, dressed up by five strong but irrelevant scores. This is why Beacon reports the binding constraint as prominently as the total. A board does not need to be told it scored sixty four. It needs to be told that reach is the constraint, that nothing else matters until reach is built, and that the next dollar belongs there.
Two design choices matter for the institutions Beacon was built for, which are regulated and conservative by nature. The first is that Beacon assesses a specific use case, not the institution in the abstract, because readiness is not a property of a company, it is a property of a company attempting a particular thing in a particular system. The same bank is highly ready to put a read only agent in front of relationship managers and entirely unready to let an agent post to its core ledger, and a single corporate score would hide exactly the distinction that decides the project. The second is that Beacon is delivered on premises for institutions that cannot send their architecture, their data maps, and their control gaps to a third party cloud, because for a bank or an insurer the readiness assessment itself is sensitive, and an instrument that requires you to export your weaknesses to be scored is an instrument a regulated board cannot use.
Beacon is the readiness half of the equation made measurable. CORE tells a CEO where the value is. Beacon tells the whole committee whether the institution can reach it, names the one dimension that will stop them, and does it before the capital is committed rather than after the project is cancelled.
We did not build Beacon because the world needed another framework. We built it because we kept watching boards approve destinations they could not reach, and we wanted the gap to be a number on a page before the cheque was signed, not a post mortem after it cleared. The instrument is deliberately unglamorous. It measures the boring things, reach, data, context, governance, because the boring things are what the demo hid and what the cancellation exposed.
The agent was never the variable. The institution was. Beacon measures the institution, so that the decision to deploy an agent is finally made on the readiness that decides it, rather than on the demo that disguised it.
Readiness is sequential, not parallel. Each step makes the next one cheaper, and skipping any one of them returns later as a cancelled project. This is the order the institutions that succeed actually follow.
Pick one CORE quadrant and one number you intend to move. Not a model, not a platform, not an agent. A business outcome, owned by the CEO, specific enough that success and failure are unambiguous. Everything downstream is sequenced against this single declared prize.
Run the six dimensions for that specific use case. Get the score, and more importantly the binding constraint. If reach or data is the constraint, the next sixty days belong to that constraint, not to building an agent. Resist every instinct to start with the agent because the agent is the fun part.
Establish a governed, audited, reversible way for an agent to write to the one system of record the use case requires. Prove it with a human in the loop and no agent at all. Until a person can safely write through this path, an agent certainly cannot.
Deliver a deterministic, governed answer to each core entity the agent will touch, and begin capturing decision traces now, even by hand, so that the context the agent will need in a year starts existing today. The schema is a project that finishes; the context is a habit that starts.
Put the agent in at L0 or L1, suggest or act with approval, and earn the right to climb the ladder with evidence. Wire in logging, override, and the kill switch as structural properties, not features. Let the autonomy rise only as trust is demonstrated and the law allows.
Change the process around the agent so the saved time becomes a measured outcome rather than a shadow gain. The value that does not reach a ledger does not exist to a board. The COO closes the loop the CEO opened in step one.
2026 is the year the demonstration stops being enough and the institution starts being tested. The agents will keep getting better. Whether they create value will keep being decided by everything around them.
There is a temptation, in a year of spectacular demonstrations, to believe the hard part is behind us, that the arrival of agents that can plan and act means the arrival of value. It does not. The agent is the most finished thing in the entire system. The institution it must operate inside is the least finished, and the gap between the two is the whole story of the next two years. Gartner's forty percent will not be cancelled because the agents could not reason. They will be cancelled because the institutions could not let them reach anything that mattered, could not feed them data they could trust, could not capture the context they needed, and could not govern the authority they held. None of those are model problems. All of them are readiness problems, and readiness is a choice an institution makes before the agent ever arrives.
The discipline this report asks for is not caution for its own sake. It is the opposite of caution: it is the only path to deploying agents fast and at scale without joining the cancellations. An institution that scores its readiness, builds its write paths, resolves its data, captures its context, and governs its autonomy can then move with real speed, because every agent it builds inherits a foundation that holds. The slow part was always going to be the foundation. The institutions that did it first will look, by 2028, as though they moved fastest, because they did, on the only timeline that counts, the one measured in value that reached a ledger.
So the question a board should ask is not whether to adopt agents. That question is settled. The question is whether the institution is ready to let an agent reach, trust, and act, and if the honest answer is not yet, the most valuable thing a leader can do in 2026 is not to launch another pilot. It is to measure the gap, name the binding constraint, and fund the unglamorous work that closes it. That is what Beacon is for. That is what this report is for. The destination was never in doubt. The readiness always was.
Buying an agent is the easy decision a board will make this year. Becoming an institution an agent can be trusted inside is the hard decision, and the only one that decides whether any of the agents were worth it.
Point at the destination on CORE. Measure the readiness on BEACON. Close the gap before you cross it. The agents are ready. Be the institution that is.
Every figure traces to a named, dated source. Where the work is the Global AI Forum's own instrument, it is labelled as such and its figures are presented as illustrative of the method, not as a specific engagement.
A note on the numbers. The two anchoring statistics, Gartner's forecast that over 40 percent of agentic AI projects will be canceled by the end of 2027 and its estimate that only around 130 of the thousands of agentic vendors are genuinely agentic, are drawn directly from Gartner's June 2025 publication and its 2026 Hype Cycle for Agentic AI. Integration, data, and protocol figures are taken from the named primary and near primary sources below, including The New Stack's field account of agents in legacy systems, Anthropic's and the Linux Foundation's disclosures on the Model Context Protocol, Gartner's framing of context graphs as reported via Atlan, and the European Commission's own publications on the AI Act. The CORE and BEACON frameworks, the Core Reachability and Data Sufficiency metrics, the autonomy ladder, the Readiness-Adjusted Return model, and the Beacon readiness instrument are proprietary instruments of the Global AI Forum, and the figures shown for them are illustrative, chosen to demonstrate the method rather than to report a specific engagement. Throughout, the discipline is the one the report argues for: numbers in preference to adjectives, sources named, and uncertainty stated plainly rather than hidden.