How a CEO, CFO, CTO, CIO, or compliance lead actually chooses a foundation model. Every major model on the board. Seven parameters. One decision you can defend in a boardroom.
The most expensive mistake in enterprise AI is not picking the wrong vendor. It is believing one model should win every job. In 2026 the frontier is a portfolio, not a podium. The leaders trade places by the week. A model that writes the cleanest board memo may be the wrong one to run a ten-hour migration. A model that aces a reasoning benchmark may quietly leak data your regulator will ask about.
This guide does one thing well. It hands the buyer a vocabulary. Seven parameters that decide everything. A full atlas of the models that matter. Five lenses, one for each seat at the table. And a way to walk into the room with a choice you can explain in plain words and still defend under audit.
Pick by elimination, not reputation. Strike every model that fails a non-negotiable first: data residency, latency ceiling, cost cap. Then match what survives to the intelligence the task actually needs. Most teams run three to five models at once and route each job to the one that leads it.
Forget the leaderboards for a moment. Every model choice in the enterprise comes down to seven dials. Learn these and you can read any spec sheet, cut through any sales deck, and ask the one question the vendor hoped you would not.
Capability is not one number. A model can sit at the frontier for coding and mid-pack for long multimodal reasoning. The honest way to read it is per task type: agentic coding, deep reasoning, writing and tone, multimodal, multilingual. The benchmarks that matter in 2026 are SWE-Bench Pro and Terminal-Bench for coding agents, GPQA Diamond and FrontierMath for reasoning, and human-preference arenas for writing.
Price is quoted per million input and output tokens, and the spread is enormous: from roughly ten cents per million on a value model to ten dollars on a top tier, a hundred-fold gap. But the sticker lies. Frontier models fan a single prompt into dozens of internal calls, so a "one" request can bill like fifty. Output tokens cost three to five times input. Long-context requests often jump to a higher rate above a threshold.
The context window is how much text a model considers in a single request, measured in tokens, each roughly three-quarters of a word. In 2026 a million tokens is table stakes: most frontier models hold around 1M, some reach far higher. That lets you drop an entire codebase, a full contract set, or a research corpus into one pass. But watch the gap between the advertised window and the effective window. Providers differ sharply in how well a model actually uses the back half of a long prompt.
The first fork in the road is proprietary versus open weight. Proprietary models reach peak capability through an API, with no infrastructure to run, but your data leaves the building and you depend on the provider's roadmap and uptime. Open-weight models can be downloaded and run on your own hardware, giving full control, privacy, and zero per-token cost, in exchange for the burden of running them. For regulated work in healthcare, government, and financial services, self-hosting is now a legitimate path, not a capability sacrifice.
Two numbers matter: time to first token, which is how responsive it feels, and output throughput, how fast it finishes. A customer-facing assistant lives or dies on the first; a nightly batch job on the second. The trick is that the smartest model is rarely the fastest. Reasoning models that "think" before answering pay for depth with delay. For high-volume, latency-sensitive work, a smaller distilled model often wins on experience even though it loses on paper.
The defining shift of 2026 is from assistant to agent. The question is no longer "can it write the answer" but "can it run the whole job": call tools, hit your systems, chain hundreds of steps, and keep working through ambiguity without a human touching each one. The leaders now sustain multi-hour autonomous runs and hundreds of tool calls in a single chain. Persistent memory is the new differentiator, models that use notes, logs, and stored context across a task that spans days.
The least glamorous parameter quietly decides the most. Does the model retain your prompts, and for how long? The most capable tiers are starting to require 30-day data retention with no zero-retention option, even for enterprises that previously negotiated one. Do safety classifiers reroute or refuse some requests, and how often? Can you fine-tune, or are you frozen on the provider's roadmap? Per-token API pricing is a form of lock-in; open weights are insurance against a provider changing prices or deprecating a model you depend on.
The board, laid out. Eight families, five tiers, and the open-weight challengers rewriting the price floor. Filter by what you need. Figures are mid-2026 and move fast; treat them as a map, not a contract.
Tier colors: ● Mythos-class ● Frontier ● Value / fast ● Open weight
Highlighted cells mark the current leader in that column. No single model owns the table. That is the whole point.
It hands a model real, unsolved bug reports from actual open-source software and checks whether its fix makes the project's own test suite pass. A score of 80% means it correctly resolved 80 of 100 real engineering tickets, the closest thing to "can it do a junior engineer's day job." The catch: scores swing hard with the scaffold around the model. The same model scores differently inside a purpose-built coding harness than in a raw setup, so read coding numbers as directional, never absolute.
"Graduate-level Google-Proof Q&A." PhD-level science questions written so you cannot simply search the answer. It measures genuine multi-step reasoning, not recall or memorized facts. The catch: the frontier now clusters in the mid-90s, which means the test is nearly saturated. When every leader scores 94 to 95, the benchmark has stopped telling you who is actually better. Treat a near-perfect GPQA as table stakes, not a tiebreaker.
Other names you will see: Terminal-Bench (can it operate a real command line), FrontierMath (the hardest unsolved math, still far from saturated), GDPval (economically valuable knowledge work), and human-preference arenas for writing and tone. No single score tells the whole story, which is exactly why the table below has more than one column.
| Model | Maker | Released | Tier | Coding (SWE-Bench Pro) | Reasoning (GPQA) | Context | Price in / out | Best at |
|---|---|---|---|---|---|---|---|---|
| GPT-5.6 Sol | OpenAI | Jun 26 2026 | Frontier | leader* | ~95% | 1.5M | $5 / $30 | Agentic coding, cyber, biology |
| GPT-5.6 Terra | OpenAI | Jun 26 2026 | Frontier | strong | high-90s | 1.5M | $2.50 / $15 | GPT-5.5 class at half the cost |
| GPT-5.6 Luna | OpenAI | Jun 26 2026 | Value | good | high | 1.5M | $1 / $6 | Fast, cheap, high-volume |
| Claude Fable 5 | Anthropic | Jun 9 2026 | Mythos | ~80% | ~94% | 1M | $10 / $50 | Long-horizon agents, hardest work |
| Claude Opus 4.8 | Anthropic | May 28 2026 | Frontier | ~69% | ~94% | 1M | $5 / $25 | Coding, high-stakes writing |
| Claude Sonnet 4.6 | Anthropic | Mar 2026 | Value | ~58% | high-80s | 1M | $3 / $15 | Near-Opus quality at value price |
| GPT-5.5 | OpenAI | Apr 23 2026 | Frontier | ~59% | ~95% | 1M | $5 / $30 | All-round knowledge work, research |
| Gemini 3.1 Pro | early 2026 | Frontier | ~54% | ~94% | 1M+ | $2 / $12 | Multimodal, long context, value | |
| Gemini 3.5 Flash | 2026 | Value | good | high | 1M | $1.50 / $9 | Best price-per-intelligence | |
| Grok 4.3 | xAI | Apr 17 2026 | Frontier | ~55% | competitive | 2M | $2 / $15 | Live data, real-time web/X search |
| DeepSeek V4-Pro | DeepSeek | Apr 24 2026 | Open | ~58% | strong | 1M | $0.27 / $1.10 | Frontier-ish quality, lowest cost |
| DeepSeek V4-Flash | DeepSeek | Apr 24 2026 | Open | good | solid | 1M | $0.14 / $0.28 | Cheapest 1M-context model |
| Llama 4 | Meta | 2025 | Open | good | solid | 10M* | self-host | Self-host, data never leaves |
| GLM-5.2 | Z.AI | Jun 16 2026 | Open | ~58% | ~91% | 200K | self-host | Open-weight reasoning leader |
| Qwen 3.7 Max | Alibaba | 2026 | Open | strong | high | 256K | $1.25 / $3.75 | Cheapest top-10 reasoner, math |
| Kimi K2.7 | Moonshot | Jun 12 2026 | Open | ~59% | solid | 256K | self-host | Long tool-call chains, agents |
| MiniMax M3 | MiniMax | 2026 | Open | ~59% | solid | 1M | $0.60 | Cheapest self-host frontier coder |
| Mistral Large 3 | Mistral | Dec 2025 | Open | good | solid | 256K | $0.50 / $1.50 | EU sovereign, Apache 2.0, on-prem |
| Command A+ | Cohere | May 20 2026 | Open | fair | strong | 256K | self-host | Enterprise RAG & search, citations |
| Amazon Nova 2 Pro | Amazon | 2026 | Value | fair | solid | 300K | low | Native to AWS Bedrock, video |
| Sarvam 105B (Indus) | Sarvam AI | Feb 2026 | Open | fair | solid | 128K | self-host | 22 Indian languages, sovereign |
*GPT-5.6 Sol leads Terminal-Bench 2.1 (command-line agentic coding) at 91.9% in Ultra mode, edging Claude Mythos 5; its SWE-Bench Pro figure was not broken out at preview. Sol, Terra, and Luna launched June 26 2026 under a US-government-coordinated limited preview, broad availability expected within weeks. Llama 4 Scout advertises up to 10M tokens. Coding figures use SWE-Bench Pro where available; scaffolding changes scores materially, so read them as directional. Pricing, benchmarks, and dates verified late June 2026 and change frequently. Confirm against provider docs before production.
A foundation model is bought by a committee that does not share a vocabulary. Here is what each seat is really asking, and the model traits that answer it. Hover a card.
The fastest way to a defensible choice is to start from the workflow and work backwards. A few common enterprise jobs and where the strength sits today.
| Function / workflow | What it needs most | Lead choices today | Value alternative |
|---|---|---|---|
| Software engineering & migrations | Agentic coding, long runs | GPT-5.6 Sol, Claude Fable 5 | DeepSeek V4-Pro, Sonnet 4.6 |
| Financial modeling & analysis | Step-by-step reasoning | GPT-5.6 Sol, Opus 4.8 | Gemini 3.1 Pro |
| Legal redlines & contract review | Long context, careful tone | Claude Opus 4.8, Fable 5 | Gemini 3.1 Pro (1M) |
| Customer support at scale | Low latency, low cost | Gemini 3.5 Flash, Haiku 4.5 | DeepSeek V4-Flash |
| Market & competitive research | Multi-step, live data | GPT-5.6, Grok 4.3 | Gemini 3.1 Pro + search |
| Board materials & long-form writing | Prose rhythm, subtext | Claude Opus 4.8 | GPT-5.5, Sonnet 4.6 |
| Document-heavy / multimodal ops | Vision, video, audio | Gemini 3.1 Pro | Gemini 3.5 Flash |
| High-volume first drafts | Cheap, fast, good-enough | DeepSeek V4-Flash | GPT-5.4 mini, Haiku 4.5 |
| Regulated / air-gapped workloads | Data never leaves | Llama 4, Qwen 3.5 (self-host) | GLM-5.2, DeepSeek (self-host) |
| Indian-language & sovereign service | 22 languages, local infra | Sarvam 105B (Indus) | Krutrim, BharatGen Param 2 |
Three more markets, each a real buying decision with its own leaders and its own compliance traps. Verified mid-2026, and faster-moving than any other corner of this guide.
A text model is one purchase. The enterprise that stops there misses three more whole markets, each with its own leaders, prices, and compliance traps. Here is the full board for image, video, and voice, the models behind the products your teams are already signing up for.
Data residency: the strongest video models from Kling, Hailuo, and Seedance process your prompts and assets on servers in China. Fine for personal creative work, a problem for client work under NDA or sensitive brand content. IP indemnification: most paid plans grant commercial rights, but only Adobe Firefly will legally cover you if an output is claimed to infringe. For brand-facing work, that distinction decides the vendor.
Most guides present a binary: rent a proprietary API, or self-host an open model. There is a third path that matters most to regulated industries, and it is new. Train a frontier-grade model on your own data. Mistral's Forge platform supports the full training lifecycle, pre-training, post-training, and reinforcement learning, on a company's internal datasets, going well beyond fine-tuning. An insurer can train a model from scratch on its own claims and contracts. Early adopters include ASML, Ericsson, and the European Space Agency. Cohere's Model Vault deploys inside your own private cloud so sensitive data never leaves the network.
Prompt the model and you change nothing. RAG feeds it your documents at query time. Fine-tune nudges a small slice of the weights toward your domain. Full custom training (Forge-style) builds the model around your data from the ground up. Cost and control rise at every rung. Most enterprises never need the top rung, but the ones with proprietary data and a hard residency mandate increasingly do, and it is no longer science fiction to reach for it.
Not a verdict, a starting shortlist you can take into the room. It eliminates on your non-negotiables first, exactly as you should.
Tap one option per row. The pick updates live.
Mythos is a class, not a model. In April 2026 Anthropic introduced a tier that sits above its Opus line, with capabilities it judged too strong to put in everyone's hands at once. The first member, Claude Mythos Preview, went out to roughly fifty vetted cyber-defenders and infrastructure providers through a program called Project Glasswing, run in collaboration with the US government. It was never offered to the public.
Then on June 9, the tier reached everyone, through a clever split. Mythos and Fable are the same underlying model. The difference is the wrapper. Mythos 5 is the raw model with safeguards lifted in some areas, still reserved for Glasswing partners and trusted defenders. Fable 5 is that same model made safe for general use: safety classifiers watch for high-risk requests in cybersecurity, biology, chemistry, and model distillation, and quietly reroute those to the safer Opus 4.8. Anthropic expects that to touch under 5% of sessions.
The name tells the strategy. A myth is the dangerous original. A fable is the version with a moral attached, the one you can hand to anyone. For a buyer, the practical fact is that Anthropic's lineup now spans five tiers, and that creates a real routing decision rather than a single default.
Mythos-class traffic carries a mandatory 30-day data retention policy, with no zero-retention option, even for enterprises that previously negotiated one. Anthropic says the data is not used for training, only to catch novel jailbreaks and reduce false positives. If you hold a zero-retention agreement for regulatory reasons, it does not apply to Fable or Mythos. Factor it into your data review before routing anything sensitive through the top tier.
There is one more wrinkle a buyer should track. Access to Mythos-class models has become entangled with export policy. The same US authority that gates advanced chips has issued directives touching this tier, and access has been adjusted in response to export-control rules. The lesson is not the specific directive. It is that the most powerful models now sit close enough to national security that geopolitics, not just price, can decide what you are allowed to run.
On June 26, 2026, OpenAI launched its next frontier family, GPT-5.6 Sol, Terra, and Luna, and shipped it the same way: a limited preview to roughly 20 government-cleared partners, while US agencies run a security review of up to 30 days under a June executive order. Sol tops the agentic-coding benchmarks, edging Claude Mythos 5, with a 1.5M-token context window. The takeaway for a buyer is structural, not about one lab: the very top tier from both leaders now ships through a government access gate first. EU, UK, India, and APAC teams could not touch GPT-5.6 on normal tiers at launch. Plan procurement around the broadly-available tier (Terra, Opus, Fable) and treat frontier access as a roadmap item, not a given.
If Mythos and Fable show where the frontier is heading, four forces will decide who reaches it, and from where. The next models will not just be smarter. They will be sovereign, specialized, and shaped as much by policy as by research.
Mythos opened a class above Opus. Expect every major lab to ship a "too powerful for default release" tier, gated by safeguards, trusted-access programs, and government partnership. The frontier becomes a velvet rope, and capability is metered by trust earned, not money paid.
Open weights now deliver near-frontier quality at a fraction of the cost. A task that bills fifteen dollars on a flagship can cost cents on an open model. This does not just save money. It changes what is economically worth automating, which pulls AI into industries that could never justify the price before.
The next wave is vertical. Smaller models tuned for one domain, one language, one regulatory context, beating giant generalists on their home turf. Procurement stops asking "which model is best" and starts asking "which model is best at my Tuesday-morning job."
Nations and regions are building their own models so intelligence does not arrive only through someone else's API. Europe's answer is Mistral, Apache 2.0 open weights plus a Forge platform to train private models, backed by 13,800 GPUs near Paris and Cohere's merger with Germany's Aleph Alpha into a transatlantic sovereign stack. The contest is shifting from raw IQ to ecosystem control: who owns the infrastructure, who sets the default rails, whose chips you depend on. Geopolitics is now a model parameter.
The frontier is no longer one zip code in California. In February 2026 over a hundred nations signed the Bangkok Declaration committing to AI sovereignty. The reasons repeat everywhere: a global model does not natively understand local dialects, legal frameworks, or culture; sending citizen data to a foreign API raises law and security questions; and intelligence is now infrastructure, like electricity, that nations do not want to rent forever. Here is the world a buyer actually chooses from.
Still the capability peak: the Mythos and Opus tiers, GPT-5.5, Gemini 3.1. The strategy is closed, API-first, and roadmap-driven. Canada plays a quieter "Switzerland of AI" role: talent-rich and neutral, home to Cohere's enterprise-RAG stack.
The most crowded open-weight ecosystem on earth, and it reset the global price floor. DeepSeek crossed 80% on coding benchmarks with downloadable weights; GLM, Qwen, and Kimi trade the open-source lead month to month. Strong on Asian languages and cultural nuance. The catch for buyers: hosted APIs route through servers in China, so regulated work means self-hosting.
The regulation-first bloc, shaped by the EU AI Act. France's Mistral is the clear champion: Apache 2.0 open weights, a Forge platform to train private models, and 13,800 GPUs going live near Paris. Cohere's merger with Germany's Aleph Alpha created a transatlantic sovereign stack; Switzerland proved a fully in-house national model is feasible.
The Gulf is buying its way to the frontier, treating compute as the new oil. Abu Dhabi's Falcon scales to 180B with permissive licensing; Jais delivers strong bilingual Arabic-English with dialect switching; Saudi Arabia's ALLaM powers the national HUMAIN Chat assistant with deep Arabic cultural nuance.
India is not waiting for the frontier to be handed down. Sarvam open-sourced a 30B and a 105B model on government compute under the IndiaAI Mission; the flagship ships as the Indus chatbot, fluent in 22 Indian languages and a founding member of NVIDIA's Nemotron coalition. The edge is not raw scale but frugal architecture plus Digital Public Infrastructure that reaches a billion people through rails that already exist.
Backed by a Korean government sovereign-AI fund, SK Telecom's A.X processes Korean a third more efficiently than Western models; Upstage's Solar Pro packs frontier performance into a compact 31B. Japan focuses on Japanese-language fluency through models like Fujitsu's Takane and lightweight on-prem options.
Launched February 2026 by Chile's CENIA with 60+ institutions across 15 countries, Latam-GPT is the region's first collaborative model, around 50B parameters trained on 8TB of Spanish, Portuguese, and regional data: Buenos Aires court rulings, Colombian textbooks, Peruvian library records. Built for $550K, it is a public good for citizen services and education, with indigenous languages planned.
Inclusion-first and built for constraint. Africa's InkubaLM is a compact 0.4B model spanning Hausa, Swahili, isiXhosa, isiZulu, and Yoruba; Kenya's UlizaLlama delivers Swahili health services. Singapore's SEA-LION covers Southeast Asian languages. These prove a model does not need to be huge to matter where no giant ever bothered to look.
For broad capability today, the North American frontier still leads and is available now. But for legal analysis in a regional language, government service automation, healthcare in vernacular tongues, or any workload that cannot sit on foreign cloud, a sovereign or regional model is no longer a compromise. The smartest 2026 architecture blends them: a global frontier model for the hardest reasoning, a regional model for language and residency, routed by the job.
The teams that win are not the ones with the single smartest model. They are the ones who route each job to the model that leads it, and cap spend with a gateway. Build that muscle first.
Expect more "above-the-flagship" classes gated by trusted access, and more value models that erase the quality gap on routine work. The middle gets crowded; the top gets exclusive.
National and regional models reach enterprise-grade for language and residency workloads. Procurement checklists start asking where the model was trained and whose hardware it ran on.
Export controls, retention mandates, and security gating decide access to the very top as much as capability or price. The compliance lead becomes the most important seat in the room.
Do not buy the smartest model.
Buy the one you can route, afford, and defend, and keep the freedom to change your mind.