Published on 12 Apr 2026 [Permalink]
Reading time: 5 minutes

When you swap AI models, you don't change tools. You change staff.

tl;dr: Arthur Soares ran a 14-agent household AI fleet on Claude, got forced off it overnight by an Anthropic policy change, spent a full day migrating to GPT-5.4, and wrote up everything he learned. The short version: Claude and GPT-5.4 require fundamentally different prompting styles. A rigorous 8-model benchmark shows Claude Sonnet 4.6 still leads at 92% overall, but three Chinese open-weight models (GLM 5.1, Qwen 3.5, Kimi K2.5) are right behind GPT-5.4 at roughly 8x lower API cost. And every model tested, without exception, confidently answered questions it didn’t actually understand. Worth a read if you’re building any kind of agent stack or switching between agents.

Arthur Soares spent a day migrating a 14-agent household assistant from Claude to GPT-5.4, and wrote one of the more useful deep dives on AI agent behaviour I’ve read. It’s long. Very long. But if you’re planning to run any kind of multi-agent setup, it will save you making some fundamental mistakes.

The forced trigger was Anthropic cutting off third-party harnesses from Claude Pro/Max plans via OAuth in early April. OpenClaw, the orchestration layer Arthur was using, lost access overnight. So he migrated to GPT-5.4 and discovered the framework is model-agnostic. The instructions are not.

The core finding on prompting is important. Claude infers intent from loosely worded instructions. GPT-5.4 does not. It reads ambiguity as optional, resolves contradictions by narrating instead of acting, and defaults to recommending what you should do rather than doing it.

Think of it like the difference between Australian English and German. Australians communicate in “yeah-nah” (meaning no) and “nah-yeah” (meaning yes) and somehow everyone gets the point. We’re comfortable with loose language, implied meaning, context doing the heavy lifting. German precision requires the thing to be said exactly as it is meant, in the correct order, with the correct structure. Claude speaks Australian. GPT-5.4 speaks German. Neither is wrong. But if you hand a German a loosely worded brief and expect them to infer your intent, you’ll get a politely structured plan asking you to clarify.

Getting GPT-5.4 to behave like an execution-first operator required rewriting every agent’s personality file, adding numbered hard gates, explicit failure conditions and purging any preference language. “Prefer,” “generally” and “when appropriate” are treated as optional by GPT-5.4 and simply ignored. Claude, by contrast, gets the point from looser language. The rewrite that took a full day on GPT-5.4 wouldn’t have been needed at all on Claude.

The benchmark is where it gets interesting. Arthur ran 52 prompts across eight models, covering reasoning, code and product management scenarios. Claude Sonnet 4.6 won overall at 92%. But the field behind it isn’t what you’d expect.

Three Chinese open-weight cloud models, all routable via Ollama, landed in the same tier as GPT-5.4: GLM 5.1 at 88%, Qwen 3.5 at 87%, Kimi K2.5 at 85%. GPT-5.4 also at 85%. A statistical tie, with the gap to Sonnet sitting at 2-4 prompts on a 52-prompt set, inside the noise floor.

The cost difference is not inside the noise floor. Sonnet and GPT-5.4 run at roughly $78/M blended tokens (at a 1:5 input/output ratio). GLM 5.1, Qwen 3.5 and Kimi K2.5 come in at $8-10/M. Call it 8x cheaper. If you’re paying API rates rather than running on a flat subscription, the Chinese models are compelling.

Arthur’s conclusion: Sonnet still leads, but barely, and against a wider field than anyone expected twelve months ago.

My take is worth adding here, because the journey Arthur describes isn’t for everyone.

OpenClaw is genuinely well built. But before you stand up your own instance, understand that out of the box it carries real security exposure. Agents that read untrusted content (email, web scraping, document triage) are prompt-injection targets. Arthur’s own benchmark showed that no model, including Claude, cleanly refuses adversarial data-exfiltration prompts. They all deliver the attack walkthrough. The mitigation lives in the prompt layer and the deployment configuration, not the model, and neither come pre-configured for you. If you’re running this on real household or business data, the security hardening is not optional.

Building and maintaining a fleet of 14 custom agents, each with a personality file, memory architecture, tool permissions and a handoff protocol, is a serious undertaking on top of that. The migration alone cost Arthur a full day of rewrites. That’s before the ongoing maintenance, the next model upgrade, the infrastructure sharp edges.

It reminds me of home automation. I could build a system that turns the lights off when I leave a room. I could also just flick the switch. The automated version is elegant right up until something breaks at 11pm and you’re debugging a Zigbee mesh instead of sleeping.

For someone managing a large stack of recurring workflows that genuinely benefit from persistent memory and orchestration, the investment makes sense. But complexity has a carrying cost. So does dependency on platform decisions you don’t control (see: Anthropic’s OAuth change).

Keep it simple. Automate the things that genuinely hurt to do manually. Be honest about what falls into that category.

The full post, benchmark data and open-source eval tool are at arthur.earth.

Sources:

It feels like a different team — arthur.earth
openclaw-llm-bench results — arthursoares.github.io

✍️ Reply by email