Published on 13 Apr 2026 [Permalink]
Reading time: 1 minute

Local models are not quite there yet

Daniel Vaughan ran Gemma 4 as a local model inside OpenAI’s Codex CLI on both a MacBook Pro and a Dell GB10 with NVIDIA Blackwell. The results are worth your time.

The headline number is striking. Google’s Gemma 4 hit 86.4% tool-calling accuracy versus Gemma 3’s 6.6%. That’s not incremental improvement. That’s a generational leap in what a local model can do inside an agentic coding workflow.

But the details tell a more familiar story. Getting llama.cpp configured required six specific flags. Ollama had streaming bugs and Flash Attention freezes on the Mac. vLLM wouldn’t play nice with PyTorch on the GB10. The faster model (52 tokens/sec) wrote worse code than the slower one (10 tokens/sec), needing more iterations and producing syntax errors.

Vaughan landed where most of us do right now: a hybrid approach. Local for privacy-sensitive work, cloud for the hard stuff.

This is where local models live today. Genuinely impressive capability wrapped in genuinely frustrating setup. Every week the gap narrows. Every week there’s one less workaround needed. We’re close. Not there, but close.

Sources:

I Ran Gemma 4 as a Local Model in Codex CLI (medium.com)

✍️ Reply by email