The frontier is a tax

Sun, 21 Jun 2026 13:49:33 +1000

TL;DR - The most powerful AI models from Anthropic (Claude) and OpenAI (GPT-5.5) are no longer clearly ahead. An open-weight challenger, Z.AI’s GLM 5.2, now matches the priciest flagships, beats Anthropic’s mainstream workhorse outright, and costs a fraction as much. At the same time the closed labs are quietly ending the all-you-can-eat deal on their heaviest models and moving to pay-per-use as they chase profit. If you pay for the top tier out of habit, now is the moment to check whether you still need to.

A quick orientation for anyone who does not live in this stuff. “Frontier” models are the biggest, most capable AI systems, the ones the headlines are about. Anthropic’s Claude Opus. OpenAI’s GPT-5.5. They are closed: you rent access, you cannot see inside them, and you cannot run them yourself.

“Open-weight” models are the opposite. The company releases the actual model so anyone can download it, run it on their own hardware or host it cheaply through a dozen competing providers. Z.AI’s GLM 5.2 is one of these.

For two years the deal with the frontier was simple. They were plainly the best, so you paid for the best, usually through an all-you-can-eat subscription. That deal is ending from both directions at once.

On capability, the open-weight challengers have caught up. GLM 5.2 is the example I will walk through. On price, the closed labs are heading the other way. When Anthropic briefly released Fable 5 it was bundled in for a limited window, then it moved to pay-per-token, billed by usage through the API or by credits. The smorgasbord is closing, at least on the heaviest models, because the companies serving them need to turn a profit and those models are expensive to run.

Which leaves an obvious question for anyone still paying frontier prices. What is the premium actually buying? Start with the numbers.

Claude Opus 4.8 costs 5.2 times more than GLM 5.2 to do the same work. For that premium you get about nine percent more measured intelligence and, if you believe the humans, worse output.

The frontier is starting to look less like a destination and more like a tollgate.

Here is the whole argument in one table.

Measure	GLM 5.2 (open-weight)	Claude Opus 4.8 (closed)	Winner
Intelligence benchmark (Artificial Analysis)	51	56	Opus, just
Coding benchmark (Artificial Analysis)	69	74	Opus, just
Human preference, websites (Design Arena score)	1357	1282	GLM
Human preference, UI components (Design Arena score)	1354	1282	GLM
Price, blended per million tokens	$1.93	$10.00	GLM, 5.2x cheaper

Artificial Analysis: higher is better. Design Arena is a human-preference score, where higher means people picked it more often. Blended price weights input and output three to one, OpenRouter’s convention. Figures from the OpenRouter compare page, June 2026.

I was reading the OpenRouter comparison page for the two models, the way other people read the property listings for houses they will never buy. Z.ai’s GLM 5.2 against Anthropic’s Claude Opus 4.8. Same context window, near enough. Same tool use, same caching. The numbers underneath tell two completely different stories depending on who is doing the scoring.

The benchmarks still favour the frontier

On Artificial Analysis, the automated scores everyone quotes, Opus 4.8 wins. Intelligence 56 to 51. Coding 74 to 69. Agentic 47 to 43. Three axes, three wins.

But look at the size of the wins. GLM lands at 91 to 93 percent of Opus on every axis. This is not a gulf. It is a sliver. The kind of gap that vanishes the moment a model has a good day or a bad prompt.

If the benchmarks were the whole story you would shrug, pay the premium for the top of the table and move on. They are not the whole story.

The humans disagree

There is a second scoreboard on the same page, and it points the other way.

Design Arena rates models by human preference. Real people are shown two anonymous outputs and pick the better one, and after thousands of those head-to-head votes each model settles on a single score: the more often it gets chosen, the higher it sits. On that board GLM 5.2 beats Opus 4.8 in every single category it has been scored in.

Websites 1357 to 1282. UI components 1354 to 1282. 3D, game development, code, data visualisation, all GLM, usually by 70 to 80 points. When humans look at the actual thing the model built, they prefer the one that costs a fifth as much.

So we have two scoreboards in open contradiction. The machines say Opus is smarter. The humans say GLM builds the better website. Both are measuring something real. The benchmarks measure reasoning in the abstract. The humans measure whether they like what came out the other end.

For anyone shipping front-end work, the second number is the one that pays the bills.

Against the workhorse it simply wins

Opus is the headline because it sits at the top of the range. But almost nobody runs Opus all day. The Anthropic model doing the daily work is Sonnet 4.6, the cheaper workhorse that most people and most companies actually point at their real tasks. On OpenRouter it burned through 8 trillion tokens last month, nearly twice Opus’s volume. This is the model that matters.

Line GLM 5.2 up against Sonnet 4.6 and the careful “matches on quality” line falls away. GLM wins outright. Intelligence 51 to 47. Coding 69 to 63. Agentic 43 to 41. Three measures, three wins, on Anthropic’s home ground of automated scoring.

The humans agree this time too. GLM tops Sonnet on websites, UI, 3D, game development and code in the preference test, and ties it on data visualisation. No contradiction here, both scoreboards point the same way.

And it is still cheaper. Sonnet 4.6 lists at $3 in and $15 out, a blended $6 per million tokens. GLM 5.2 blends to $1.93, about a third of the price.

So against the flagship the open-weight model matches and undercuts. Against the workhorse it wins on every scoreboard and undercuts as well. That second comparison is the one that should keep Anthropic up at night. Losing a price war on your flagship is survivable. Losing the scores, the human vote and the price on the model that carries your daily volume is a different kind of problem.

What you are actually paying for

Here is the price, the part the scoreboards politely leave out. GLM 5.2 runs at $1.20 in and $4.10 out per million tokens. Opus 4.8 runs at $5 and $25. Blend that at a normal three-to-one input mix and Opus costs 5.2 times more for the same job.

And this is not only Anthropic’s problem. Line the open-weight challenger up against both closed flagships and divide capability by price, the same rough heuristic I used a fortnight ago when the frontier labs started quietly repricing.

Model	Type	Blended $/M	Intelligence	Capability per $
GLM 5.2	open-weight	$1.93	51	26.5
Claude Sonnet 4.6	closed workhorse	$6.00	47	7.8
Claude Opus 4.8	closed flagship	$10.00	56	5.6
GPT-5.5	closed flagship	$11.25	55	4.9

Capability per dollar is the Artificial Analysis intelligence score divided by blended price. A heuristic, not a benchmark, but it makes the trade visible. June 2026 figures.

The open-weight model returns three to five times the intelligence per dollar of every closed model here, the workhorse included. GPT-5.5, the newest and dearest, comes last. That is the squeeze the closed labs are walking into: their best models cost the most to serve, and the thing chasing them is free to download and run.

The premium used to buy a real capability gap. A frontier model could do things the cheaper ones simply could not. That was worth paying for. What does the premium buy now? A rounding error on the benchmarks and a loss on human preference.

I put GLM 5.2 through my own agent battery this week, the fixed set of routing, tool-call and extraction tasks I use to vet a model before it goes anywhere near my automation. For the work that actually matters in an agent pipeline, picking the right sub-task, emitting clean structured output, pulling an exact figure out of a long document, it was indistinguishable from models costing a fraction. The routing decisions were identical. The expensive option did not decide anything better. It just decided it slower and charged more for the privilege.

I am not the only one noticing. Nate Herk dropped GLM 5.2 into Claude Code, Anthropic’s own coding tool, and called the result “GLM 5.2 in Claude Code is Blowing My Mind”. An open-weight model, running inside the closed lab’s flagship workflow, holding its own.

▶ Watch on YouTube

The adoption numbers show the lag. Opus 4.8 is still pulling four times the token volume of GLM on OpenRouter. That is not a verdict on quality. That is inertia, procurement, and the very human assumption that the dearest option must be the safest one.

The one place the premium still earns its keep is the genuinely hard reasoning chain, the one job in ten where a single correct answer beats five cheap near-misses. For that, pay up. For the other nine, you are paying for the postcode, not the work.

The frontier is still worth paying for. The trick is knowing how rarely that is.

Sources:

GLM 5.2 vs Claude Opus 4.8 comparison - OpenRouter
GLM 5.2 vs Claude Sonnet 4.6 comparison - OpenRouter
Artificial Analysis model benchmarks - Artificial Analysis
Design Arena leaderboard - human-preference model rankings
GLM 5.2 in Claude Code is Blowing My Mind - Nate Herk, YouTube
Frontier AI keeps getting pricier - subscribers are quietly winning - rrows.net

Rambling Rows

The frontier is a tax

The benchmarks still favour the frontier

The humans disagree

Against the workhorse it simply wins

What you are actually paying for