The Month AI Grew Up: GPT-5.4, Claude 4.6, and the March 2026 Model Race

← Back to AI Landscape

March 2026 will probably get its own chapter in whatever book eventually gets written about this era. In the space of roughly two weeks, OpenAI, Anthropic, and Google all released new flagship models. Not incremental updates — genuine step changes in what these systems can do.

If you've been following AI casually, the headlines probably blurred together. So here's the plain-English version of what actually happened, what's different, and what it means for anyone using these tools at work.

GPT-5.4: AI That Can Use Your Computer

OpenAI dropped GPT-5.4 on 5 March. The headline feature is computer use — not in the "early preview with caveats" sense that's been floating around since late 2024, but as a native, built-in capability of the flagship model itself.

What this means practically: GPT-5.4 can now control a computer the way a person would. It can open applications, fill in forms, navigate websites, run workflows across multiple tools — not by describing what it might do, but by actually doing it. You can hand it a task like "pull last month's invoices from the supplier portal and add them to this spreadsheet" and leave it to run.

"GPT-5.4 can now control a computer the way a person would. You describe the task. It executes it."

The model also supports up to one million tokens of context — roughly the length of three average novels — which is what allows it to plan and execute long, multi-step tasks without losing track of what it's doing. OpenAI describes it as the most token-efficient reasoning model they've built, which translates to faster responses and lower costs for developers building on it.

Available in ChatGPT and via API. The standard version is priced at $2.50/$15 per million tokens. GPT-5.4 Pro, aimed at complex agentic tasks, runs at $30/$180 per million tokens.

→ Standout: native computer use built into the flagship model

Claude 4.6: The Best Coder, By a Margin

Anthropic followed with Claude 4.6 Opus — and the benchmark that's getting the most attention is software engineering. On the SWE-bench evaluation, which tests how well AI can fix real bugs in real codebases, Claude Opus 4.6 scores 80.8%. GPT-5.4 scores 77.2%. For anyone building software with AI assistance, that gap is significant.

The other headline is the context window: Claude 4.6 Opus ships with a one-million-token context window in beta. That's not just about length — it's about the kind of work you can do with it. Feeding an entire codebase, a full financial report, a year's worth of documents, and having Claude reason across all of it at once.

Claude Sonnet 4.6 is the new free default on Claude.ai. Opus 4.6 is available to Pro users and via API at $3/$15 per million tokens for Sonnet.

→ Standout: strongest coding benchmarks of any commercial model

Gemini 3.1: The All-Rounder Wins on Price

Google's Gemini 3.1 Pro is, according to the Artificial Analysis Intelligence Index as of this writing, the strongest general-purpose AI model available. It ties GPT-5.4 Pro on the top benchmark — but at roughly a third of the cost. For anyone making build vs. buy decisions about which model to use inside their products, that cost difference is hard to ignore.

The Deep Think variant of Gemini 3.1 is also notable: it scores 77.1% on ARC-AGI-2, one of the harder reasoning evaluations designed specifically to resist pattern-matching. Google's approach to multimodal tasks — handling text, images, audio, and video in the same context — continues to be the most capable in the market.

Available via Google AI Studio, Vertex AI, and built into Google Workspace. Deep Think and Flash-Lite variants available for heavy reasoning and lightweight tasks respectively.

→ Standout: top benchmark performance at significantly lower cost than competitors

What's Actually Different This Time

Every major model release gets described as a step change. Most of them aren't. This one mostly is — and the reason is that the improvements aren't just on abstract benchmarks. They show up in the kinds of tasks these models can reliably handle.

Six months ago, asking an AI to complete a multi-step workflow that involved opening applications, navigating websites, and writing output to a document was an experiment — interesting to watch, but not something you'd trust with anything that mattered. That has changed. Computer use capabilities, now built natively into GPT-5.4 and maturing across the other major platforms, mean the gap between "AI gave me a good answer" and "AI did the work" has closed considerably.

The gap that's closing

In early 2025, the main thing AI was good for was generating text. By late 2025, it could reliably reason through complex problems. In March 2026, it can take actions in the world — control software, run workflows, execute tasks autonomously. Each step is qualitatively different from the last.

What This Means for How You Work

The practical upshot depends on what you're doing. If you use AI mainly for writing, research, and summarisation, the day-to-day difference is marginal — all three models were already excellent at those tasks, and they're now slightly better. That's nice, but not transformative.

The bigger shift is for anyone who uses AI as part of a workflow that involves multiple steps or multiple tools. Computer use and agentic capabilities — the ability to execute tasks, not just describe them — are now production-ready in a way they weren't three months ago. If you've been holding off on trusting AI with anything that involves actual doing rather than just answering, March 2026 is a reasonable time to reconsider.

The other practical note: the gap between the top models has narrowed. A year ago, there was a meaningful difference in quality between the flagship model and everything else. Now, the differences between GPT-5.4, Claude 4.6, and Gemini 3.1 on most real-world tasks are small enough that cost, integration, and personal preference are reasonable tiebreakers. That's a healthier place for the market to be.

The Honest Caveat

Computer use and autonomous agents are genuinely impressive. They are also genuinely unreliable for anything that matters — not most of the time, but some of the time, in ways that are hard to predict. The models hallucinate. They misread context. They occasionally do the right thing in the wrong order.

The useful frame is: these tools are good enough to handle tasks where a mistake is recoverable. They're not yet good enough to be trusted on tasks where a mistake is costly and you won't notice until later. That boundary is moving — but it hasn't vanished. Build your workflows accordingly.

Not sure which model to use?

See the full AI Landscape

A plain-English breakdown of every major AI tool — what it does, who it's for, and how the pricing works.

View the Landscape

The Month AIGrew Up