Infrastructure 18 min read

How to Run Qwen3-Coder 30B-A3B on RTX 5090 with Ollama

ai.rs May 22, 2026
How to Run Qwen3-Coder 30B-A3B on RTX 5090 with Ollama illustration

How to Run Qwen3-Coder 30B-A3B on RTX 5090 with Ollama

May 2026 — notes from setting up a local coding LLM on a single consumer GPU, with the bumps left in.

The goal

A coding-focused LLM running entirely on my own hardware. Reasons in descending order of weight: no per-token costs, no rate limits, no data leaving the box for routine tasks, latency that's bounded by my own GPU rather than someone else's queue. Hardware on the desk: one RTX 5090 (32 GB VRAM, Blackwell sm_120), running Arch Linux. The question was what to put on it.

A false start: the cloned repo

I'd cloned noonghunna/club-3090 — a well-maintained recipe collection for serving LLMs on RTX 3090s. Excellent documentation, real benchmarks, honest about failure modes (their docs/CLIFFS.md is the kind of writeup most serving projects could learn from). But reading the actual launch.sh, the hardcoded model list was just two entries — Qwen3.6-27B and Gemma-4-31B — and the whole architecture is built around squeezing 27B-class models through 24 GB Ampere with vLLM nightlies and Genesis patches. Wrong card class, wrong era, wrong constraints. The "model-agnostic by design" claim in the README is aspirational at the code level: the structure scales to new models, but the launcher itself is bound to specific compose files.

Right call: read the docs, skip the runtime.

Picking the pieces

Model. Qwen3-Coder-30B-A3B-Instruct. The "30B-A3B" is a Mixture of Experts: 30B total parameters, but only ~3B are activated per token. Inference cost is roughly that of a 3B dense model; quality lands much closer to a 30B dense model thanks to expert specialization. There's a 480B-A35B sibling that's outside reach for a 32 GB card. Easy choice.

Quantization: Q5_K_M. At 21.7 GB this hits the quality/size sweet spot for 32 GB. Q4_K_M is ~18 GB but takes a 1–2% quality hit on coding tasks where token-level precision matters. Q8_0 is ~32 GB and leaves essentially no room for KV cache. Q5_K_M leaves enough headroom for a useful context window.

Serving engine: ollama. This one surprised me. The "right" answer for max throughput would be llama.cpp's llama-server directly, or vLLM. But ollama wraps the same llama.cpp engine. The TPS gap between ollama and standalone llama-server is typically 0–10% — wrapper overhead, not engine difference. What you gain by going standalone is access to flags ollama hides (KV cache quantization, dedicated bench binaries). What you give up is the operational niceness: ollama list, automatic VRAM unload after idle, painless model switching, a model store that handles versioning. For daily use against one model, ollama wins on UX without paying meaningfully in speed.

The Q5_K_M gotcha

ollama maintains a curated library of pre-packaged models. ollama pull qwen3-coder works — except the curated quants for the 30B variant are Q4_K_M, Q8_0, and FP16. No Q5_K_M. Q4_K_M is the obvious "just go" option but I wanted to actually run Q5_K_M for the quality.

The workaround: download the Q5_K_M GGUF directly from one of the public re-quanters on Hugging Face (Unsloth and bartowski both maintain full quant sets), then register it with ollama via a Modelfile:

FROM /home/arch/models/Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf
PARAMETER num_ctx 32768
PARAMETER num_gpu 99

ollama create reads the FROM file, hashes it, and stores it as a content-addressed blob. Chat template, tokenizer config, and tool-calling format are read from the GGUF's metadata automatically — no TEMPLATE directive needed, and tool calling works out of the box for agent-style clients like Cline.

Disk-duplication caveat: my ollama runs as a systemd service with its model store at /var/lib/ollama/, which is a different btrfs subvolume from /home. Btrfs doesn't allow cross-subvolume hardlinks, so ollama create copies the 22 GB file into its store. You can run ollama as your user with OLLAMA_MODELS=$HOME/.ollama/models to get hardlinks and zero duplication, but for 22 GB and 346 GB free that wasn't worth the systemd-juggling. Trading disk for simplicity.

The 64K context gotcha

First attempt: num_ctx 65536 in the Modelfile, ollama create, ollama run. Result:

Error: 500 Internal Server Error: memory layout cannot be allocated with num_gpu = 99

Initial instinct: ollama's memory estimator being pessimistic on MoE models. Wrong instinct. nvidia-smi showed 5.6 GB of VRAM already in use — KDE plasmashell (660 MB), Chromium GPU process and tabs (~3 GB total), Telegram (450 MB), a few smaller apps. Normal desktop session, but enough to push the budget over the line:

Q5_K_M weights:            ~22 GB
FP16 KV cache at 64K:       ~6 GB
Activations + cudagraph:    ~2 GB
                            ─────
Total needed:               ~30 GB
Free VRAM (after desktop):  26.4 GB
                            ─────
Shortfall:                  -3.6 GB

ollama wasn't pessimistic — the math was correct. Two ways out: free the 3.6 GB by closing Chromium, or shrink the KV cache. I dropped to num_ctx 32768, which cuts KV to ~3 GB. After re-creating the model:

ollama:      24.4 GB  (weights + KV + activations)
Desktop:      5.4 GB
Free:         2.2 GB

Fits cleanly with a healthy buffer.

This is the part where local serving differs from cloud most concretely. Cloud inference has dedicated machines with 80+ GB of HBM per GPU, often 8 GPUs sharing capacity. Your local card shares with the desktop, the browser, the chat app, the screenshot tool. The first ~5 GB of VRAM is gone before the model even loads.

"Is 32K enough?"

When Claude advertises 1M context and your local model is capped at 32K, the gap looks vast. It isn't, for what coding actually needs:

  • A typical source file: 1–5K tokens
  • A file plus 3–5 related files for context: 10–20K tokens
  • A moderate-codebase summary with focused references: 25–30K tokens

32K covers all of that. The places where 1M actually pays off — read this entire 200-file repo and refactor it; ingest a 600-page document and answer questions across all of it — are where you'd be reaching for a cloud model anyway, both for context and for the qualitatively better judgment of a frontier model. The local model is for the routine 80%: "explain this function", "write a unit test", "refactor this loop", "what's wrong with this regex".

For when you do want more context locally, ollama exposes OLLAMA_KV_CACHE_TYPE=q8_0, which roughly halves KV memory at near-zero quality cost. That alone moves 64K from "won't fit" to "fits with room". I left that as an opt-in rather than the default since it requires editing the systemd unit.

How to think about quant vs context

A natural follow-up question after hitting the 64K wall: what if I gave up some weight precision in exchange for more context? Q4_K_M is ~17 GB on disk; that's 4 GB less than Q5_K_M, which is enough KV cache for an extra ~40K tokens at FP16. So a Q4_K_M build with the same VRAM budget gets roughly double the workable context. Tempting.

But there are two things that make this less obviously good than it looks.

First, the quality cost isn't symmetric across workflows. Published coding benchmarks (HumanEval, MBPP, LiveCodeBench) show Q5_K_M → Q4_K_M drops of 1–3% absolute pass rate for 30B-class models. That's small enough to be undetectable on a single prompt: blind taste tests, you'd struggle to tell them apart. But for agentic coding — Cline-style multi-step refactors, aider with edit-format tool calls, anything where the model is making chained decisions — those small per-step errors compound. A 2% wrong-token rate per decision over 10 decisions starts to look meaningfully different from the same model at Q5. So the Q5 → Q4 swap costs more in workflows where it matters most: long-running agent sessions, which are also the workflows that most want the extra context.

Second, more context doesn't translate linearly to better outputs. Coding models tend to degrade on long-context retrieval beyond their effective working window — quality on "use these 50 files to find the bug" drops sharply past ~32K, even for models trained to 256K. Published needle-in-haystack benchmarks measure something narrower than what real codebase work needs. Past ~32K, you usually get better results by being selective about what you include in context than by stuffing more in.

So the binary "Q5 with 32K context vs Q4 with 64K context" turns out to be the wrong framing. The real lever is in the middle.

What actually works:

  • Q5_K_M + q8_0 KV cache keeps Q5-level weight quality and roughly halves the per-token KV cost. With near-zero quality impact, it brings 64K into easy reach and 128K close to the edge. q8_0 isn't true FP8 (it's int8 with shared FP16 block scales) but the memory savings are FP8-class.
  • Unsloth's UD-Q5_K_XL variant, at the same 21.7 GB size as Q5_K_M, selectively keeps higher precision on critical layers. Theoretically pushes quality toward Q6 territory at Q5 cost.

The sensible progression for someone in my position: enable q8_0 KV first (a free lever — no quality tax) and live with that for a couple of weeks. If you find yourself routinely running out of context on real tasks past 128K, the workflow is asking for cloud anyway. Only consider Q4_K_M if you've actually validated that the context ceiling matters in your day-to-day, not just in theory.

Going to Q4 before trying q8_0 KV is paying the quality bill up-front for ceiling you might never touch.

The performance surprise

I'd estimated 80–120 TPS based on the model size (30B). The first benchmark shipped that estimate to the bin:

{"eval_count": 462, "eval_duration_ms": 2001.85, "tps": 230.79}

231 tokens per second for a short coding completion. Roughly double my back-of-envelope.

The reason is the MoE architecture. My mental model was anchored on dense 30B inference, where every parameter touches every token and TPS reflects that. In a 30B-A3B MoE, each token's forward pass activates only ~3B of parameters (the chosen experts plus the shared layers). Generation speed scales with active parameters, not total. On a 5090's memory bandwidth, 3B of effectively-active weights moves fast.

The catch is that prefill — reading the prompt before generation starts — still touches all the model machinery, and it scales roughly quadratically with prompt length. So a short interactive coding prompt feels blazing; a 20K-token "here's my codebase" prompt has a noticeable pause before the first token. The 230 TPS number is steady-state generation, not prefill-bound latency.

Either way, this is comfortably usable. At 230 TPS, a 1000-token response materializes in about 4 seconds. Interactive coding feels closer to typing-speed than to "wait for the assistant".

Going to 64K — and finding what 32K hid

The "what I'd try next" list above had OLLAMA_KV_CACHE_TYPE=q8_0 at the top — quantize the KV cache to int8 with FP16 block scales, halving its VRAM cost at essentially zero quality impact. I did that next.

The setup is a systemd drop-in (/etc/systemd/system/ollama.service.d/override.conf) adding two env vars to the daemon: OLLAMA_KV_CACHE_TYPE=q8_0 and OLLAMA_FLASH_ATTENTION=1 (the second is auto-enabled on Blackwell, but being explicit is cheaper than wondering later). After systemctl daemon-reload && systemctl restart ollama, I bumped num_ctx in the Modelfile from 32768 to 65536 and re-ran ollama create.

The numbers confirmed it engaged. ollama process VRAM went from 24.4 GB at 32K-FP16-KV to 25.0 GB at 64K-q8_0-KV — exactly the 3 GB savings you'd expect from halving the per-token KV cost (6 GB FP16 → 3 GB q8_0) while doubling the context. TPS sat at 223, statistically indistinguishable from the 230 at 32K. Free desktop VRAM dropped to 1.4 GB — tight but workable. Functionally I now had 2× the context for less than 1 GB more allocation.

Then I ran a real coding prompt to validate quality. And the output went off a cliff.

The model wrote a sensible function. Then emitted <|endoftext|> as literal text. Then kept generating. It hallucinated a fake user follow-up turn ("Human: Can you modify the function to also..."). Then "answered" itself. Then repeated this loop four or five times, each iteration claiming to be the "final clean version" and contradicting the previous one. At no point did ollama stop the generation.

The diagnosis was upstream of everything I'd been doing. ollama show --modelfile qwen3-coder-q5km revealed the actual template ollama had registered for the model:

TEMPLATE {{ .Prompt }}

That's the no-template default — raw user input passed through unchanged, no ChatML wrapping, no stop tokens declared. ollama is supposed to read the chat template from the GGUF's tokenizer.chat_template metadata field. Either the Unsloth re-quant doesn't populate that field cleanly, or ollama 0.19 doesn't parse Qwen3's specific Jinja template variant correctly. Either way, ollama had silently fallen back to "no template" without warning, and I hadn't noticed because:

  1. Modern Qwen is robust enough to produce sensible output even from bare prompts. The model's first response was fine.
  2. Short prompts (like the benchmark) end naturally and don't need stop tokens to halt — the model picks a reasonable conclusion and the API returns. The Sieve test had been measuring TPS on a workflow where the missing stop tokens never mattered.
  3. The model emitted <|endoftext|> — but as literal text, because ollama wasn't told it was a stop string.

The fix was a proper TEMPLATE block in the Modelfile (Qwen ChatML, ~15 lines) plus three explicit PARAMETER stop directives: <|im_end|>, <|endoftext|>, <|im_start|>. After ollama create re-registered with these in place, the same anagrams prompt produced one focused answer, the model emitted its turn terminator, ollama halted, and the REPL returned to the >>> prompt. The output quality was visibly higher too — internal doctest/code consistency held (in the broken run, the doctest expected output that contradicted the implementation), and the model used modern list[str] type hints rather than the older typing.List[str].

The lesson: when you go custom-GGUF-via-Modelfile instead of using ollama's curated library, you take on responsibility for the chat template and stop tokens that the curated tags configure invisibly. Going to ollama pull qwen3-coder:30b-a3b-q4_K_M would have given me the right template metadata for free. Going custom traded that for the higher quant. Worth the trade — but the silent fallback to the no-template default was a much sharper edge than I'd expected from "just create a Modelfile."

It also retroactively changes my reading of an earlier observation. The first time I ran the anagrams test, before fixing the template, the model wrote a function whose doctest contradicted its own code — the kind of small-but-real attention drift I'd attributed in passing to Q5 quantization. With the template fixed, the same prompt produces an internally consistent answer. That drift wasn't the quant. It was the model being forced to keep generating past its natural end-of-turn, getting derailed into self-correction loops, and accumulating contradictions across the imagined revisions. The quant was never the problem.

Where this leaves things

Final stack:

  • Hardware: RTX 5090, 32 GB
  • Model: Qwen3-Coder-30B-A3B-Instruct, Q5_K_M (21.7 GB on disk)
  • Engine: ollama 0.19.0 (wraps llama.cpp), with q8_0 KV cache and flash attention enabled via systemd override
  • Context: 64K
  • VRAM at load: 25.0 GB used by ollama, 5.6 GB by desktop, 1.4 GB free
  • Speed: ~223 TPS steady-state for short prompts (essentially unchanged from 32K-FP16)
  • Endpoint: http://localhost:11434/v1, model qwen3-coder-q5km

Coding clients (aider, Continue.dev, Cline, Cursor with custom-provider mode) all connect to the OpenAI-compatible endpoint with a dummy API key. Tool calling works because the Modelfile's TEMPLATE block renders Qwen ChatML correctly, and the embedded GGUF tokenizer handles the <tool_call> framing.

What I'd try next:

  • The UD-Q5_K_XL variant from Unsloth at the same 21.7 GB size — uses higher precision selectively on important layers, theoretically better quality for the same VRAM cost.
  • Side-by-side against Claude on real tasks — not synthetic benchmarks, just "did the local model handle this PR review / refactor / debugging session, and where did it fall short". The interesting question for local serving isn't TPS; it's "where exactly is the quality cliff vs cloud, and what tasks fall safely below it."
  • vLLM with FP8-quantized weights to actually exploit Blackwell's FP8 tensor cores. llama.cpp doesn't use them today; running on a 5090 leaves them idle. The setup cost is real (different weight format, more moving parts) but it's the only way to find out what this card can actually do on dense models.

Reflections

A few things I'd tell past-me starting this experiment.

The exotic stuff is for niche constraints. vLLM, Genesis patches, custom quant kernels — these exist because someone has a constraint that can't be fixed any other way (24 GB Ampere, prefill cliffs on specific architectures, etc.). On a 5090 with a normal model, ollama covers 95% of the value and any of the alternatives is incremental.

Estimate VRAM by what's free, not what's installed. "I have 32 GB" is misleading. You have 32 GB minus whatever your desktop and apps are holding, and that floor moves around. Check nvidia-smi before assuming. The first failure of this experiment — 64K context refusing to fit — wasn't a misconfiguration. It was the desktop quietly holding 5.6 GB that the back-of-envelope math hadn't accounted for.

MoE inference is its own thing. Dense-model intuitions about TPS don't transfer. The 230 TPS surprise was useful — it changed what I think this hardware is good for. The expensive parts of a 30B-A3B forward pass are routing decisions and shared layers, both small; the bulk of the parameter budget sits in experts that mostly idle.

The curated-vs-custom trade is sharper than it looks. When you ollama pull a tag from the curated library, you also pull the right chat template, stop tokens, and parameter defaults invisibly bundled with the weights. When you go custom — your own Modelfile pointing at a downloaded GGUF — you're responsible for those, and ollama's fallback when it can't read the GGUF's embedded chat template is no template at all, silently. It "works" for short prompts because Qwen is robust, and fails catastrophically for longer ones because there are no stop tokens. The first I knew was the model hallucinating fake user turns. Add explicit TEMPLATE and PARAMETER stop directives to any custom Modelfile, even if you think the GGUF "has it built in".

Quality bugs and config bugs look the same from outside the model. I almost wrote off the model's doctest/code inconsistency as a Q5_K_M quality limit — exactly the kind of "small attention drift that compounds in agentic workflows" I'd theorized about earlier. It wasn't. It was the model being forced to keep generating, drifting through invented follow-up turns, accumulating contradictions across imagined revisions. Once stop tokens worked, the same prompt produced an internally consistent answer. Worth a sanity check before blaming the weights: is the model actually finishing its turn, or is it being kept on the leash by missing config?

Local isn't a cloud replacement, it's a complement. The right framing isn't "can the 5090 run something as good as Claude". It's "for which tasks is the 5090 fast enough, private enough, and cheap enough that I'd rather use it than reach for the cloud, even at lower quality". For routine coding tasks the answer is "many of them" — once the stop tokens are working.

Postscript: trying Crush

After the writeup above, I went looking for a more polished alternative to Aider — something with the agentic UX of Claude Code but model-agnostic from the start. The obvious candidate was Crush from Charmbracelet — the team behind Bubble Tea, Lipgloss, Glamour, Glow, the terminal-UI shop. Go-based, single binary, AUR-installable with yay -S crush-bin. ~24K stars, daily commits, growing fast.

The install was clean. The TUI launch screen was genuinely beautiful — pixel-perfect spacing, considered colors, a Charm logo that's just the right amount of fun. Better than any other coding-assistant TUI I've seen. The two-tier "Large Task / Small Task" model picker is a nice ergonomic detail — configure cheap-and-fast for one slot, quality-for-hard-stuff for the other. I added Qwen3-Coder Q5KM under an ollama provider in ~/.config/crush/crush.json, similar shape to the OpenCode config. Crush picked it up; the model picker showed it as ✓ Configured. So far so good.

One nice UX detail worth noting: Crush also detected my ANTHROPIC_API_KEY (set elsewhere for Claude Code) and defaulted to Claude Sonnet 4.6 automatically, prioritizing cloud over local when both are available. Switching to Qwen3-Coder via the picker was a keystroke. Real respect for the dual-model dual-provider workflow.

Then I gave it a prompt: evaluate README.

Crush replied with "I'll evaluate the README.md file for you," and then immediately got stuck:

[Uses ls tool] [uses view tool] [uses view tool] [uses view tool] ...

Pages of it. Hundreds of lines of [uses view tool] in brackets. The model was outputting natural-language descriptions of tool calls instead of actual structured tool calls — and Crush wasn't executing anything, so the model never got file content back, so it kept "trying." Stop tokens didn't fire because none of <|im_end|> / <|endoftext|> / <|im_start|> was appearing in this hallucinated description format.

A bit of digging revealed this is a known Crush bug — #2936, filed by another user the day before my own attempt, with mitmproxy diagnostics proving the chain:

  1. Crush correctly sends tool definitions to the model.
  2. The model correctly responds with finish_reason: "tool_calls" and well-formed tool call JSON.
  3. Crush silently ignores the tool calls and never executes them.
  4. The model, getting no execution feedback, repeats — bounded only by default_max_tokens.

So our setup was right. The model was right. The protocol translation was right. Crush itself has a regression in its OpenAI-compatible-provider tool-call execution path that didn't exist in earlier versions — a January 2026 blog post by Meschbach documents the same setup working successfully four months earlier. The breakage is recent, the fix is pending, and multiple related issues going back to August 2025 (#447, still open after nine months) suggest the local-provider integration is a fundamentally rough surface area for Crush at the moment. Not bad faith from Charm — just not yet a fully-shipped feature.

This is the second empirical confirmation of the article's design-theory framing, on top of the chat-template gotcha earlier:

  • The first failure was at our config layer (silent fallback to no-template when ollama couldn't parse the GGUF's embedded Jinja). Fixable by adding explicit TEMPLATE and PARAMETER stop directives to the Modelfile.
  • The second failure is at the tool's layer (Crush's tool-call execution path is broken for OpenAI-compatible providers). Not fixable at our level — wait for Charm to ship a fix.

Both fit the same pattern: agentic-style tools have larger surface areas to break, particularly along the local-model integration path that isn't the developers' day-job priority. Aider's smaller, more deliberate surface area — user-driven dialog, explicit file context via /add, no autonomous tool exploration — avoids both failure modes by design. Not because Aider is "better" in some absolute sense, but because Aider's design rewards weaker models for what they can do (write code given context) instead of asking them to do what they're worst at (drive an agentic tool loop reliably).

The right next experiment is OpenCode — same agentic category as Crush, different codebase, possibly different bug surface. If OpenCode handles tool calls against ollama cleanly, "agentic + local model" works in some tool, just not Crush right now. If OpenCode also fails on the same task, the case for Aider's design philosophy gets stronger still: smaller surface area is just better for a workflow where every integration point is a potential bug, and the model itself is more constrained than the tooling assumes.

For now, Aider remains the working tool for actual coding work on this stack. Crush stays installed; I'll come back when #2936 lands.

The meta-lesson is the same one the rest of the writeup keeps pointing at: with a local 30B-class model, the surface area you can fail through is large, and the bugs are silent. Chat templates that quietly fall back to no-template. Stop tokens that aren't fired because the model emitted a non-canonical end marker. Tool-call responses that the client silently discards. None of these failures throw an exception. They all just produce subtly-wrong output, or no output at all, and you only notice when you actually try real work. The setup time isn't in the install — it's in discovering and fixing the silent gaps.

Postscript update: it was the Modelfile, not Crush

After writing the section above, I kept poking. The "Crush is broken with local OpenAI-compatible providers" framing felt too convenient — multiple tutorials documented the combo working in earlier Crush versions, and issue #2936 had been open for less than a day with no maintainer comments either confirming or denying. I tried one more controlled experiment: same Crush, same prompt, same project, but a different Qwen3-Coder variant.

I pulled ollama's curated qwen3-coder:30b-a3b-q4_K_M tag (instead of using my custom Q5_K_M Modelfile from HF), added it to Crush's config alongside my Q5, restarted, switched to it in the model picker, and re-ran the same evaluate README prompt.

It worked. Perfectly. The view tool executed, the README content came back, the model produced a coherent multi-paragraph evaluation. The same Crush that had hallucinated [uses view tool] brackets ten minutes earlier was now driving an agentic tool-call loop without complaint.

The bug wasn't Crush. The bug was my Modelfile.

Diffing the two ollama show --modelfile outputs side by side revealed exactly two lines that differed in any load-bearing way:

RENDERER qwen3-coder
PARSER qwen3-coder

These are model-aware ollama directives, added relatively recently to ollama. They tell ollama how to format prompts for a specific model and — critically — how to parse its output:

  • RENDERER wraps incoming chat messages in the model's expected format (for Qwen3-Coder, that's ChatML with <|im_start|> / <|im_end|> markers). Without it, ollama either uses the GGUF's embedded chat template or falls back to a stub. With it, ollama uses Qwen-specific logic.
  • PARSER translates the model's output before delivering it to clients. This is the critical one. Qwen3-Coder emits tool calls in its native XML format: <tool_call>{"name": "view", "arguments": {"file_path": "README.md"}}</tool_call>. OpenAI-compatible clients (including Crush) expect structured tool-call JSON in the tool_calls field of the response, not raw XML in the content. The PARSER qwen3-coder directive tells ollama to parse the XML and emit proper tool_calls JSON on the OpenAI-compatible API.

My hand-rolled Modelfile had TEMPLATE (a 15-line Jinja-ish ChatML wrapper I wrote based on what Qwen needs) and three PARAMETER stop directives. It did not have RENDERER or PARSER. So when the model emitted perfectly valid Qwen tool-call XML, ollama forwarded it raw to Crush, which saw plain text and ignored it. The model, getting no execution feedback, looped on its own attempts to invoke tools — which is the hallucinated [uses view tool] pattern.

This also retroactively explains issue #2936's diagnostic. The reporter saw finish_reason: "tool_calls" with correct tool-call data via mitmproxy, but Crush silently discarded it. Of course Crush discarded it — Crush was looking for structured JSON, ollama delivered raw Qwen XML. The bug isn't in Crush at all. The bug is that hand-rolled Qwen Modelfiles need to know about RENDERER and PARSER, and that knowledge isn't surfaced anywhere obvious in ollama's docs or in the Crush + Qwen tutorials floating around. The curated qwen3-coder tags have it; rolling your own from a Hugging Face GGUF, you don't unless you know to copy it.

I added the two lines to my Q5 Modelfile (FROM pointing at ollama's existing blob, no re-download needed) and re-ran ollama create. Then in Crush: switched to my Q5 model, ran the same prompt. Same clean tool-call execution. Three data points: Q5 broken, Q4 curated works, Q5 with the fix works. Diagnosis confirmed end to end.

The next stop on this winding path was wanting more context. I bumped Crush's context_window config from 65536 to 131072, restarted Crush, and re-ran a prompt. The model produced an awkward <function=view> <parameter=file_path> mangled output and didn't actually execute the tool — looked like another regression. But curl /api/ps told the real story: ollama had loaded the model at "context_length": 32768. Crush's context_window config field is UI-only. The OpenAI-compatible API path doesn't have a clean way to pass num_ctx to ollama, so Crush's config just affects the picker label. To actually get larger context, num_ctx has to be set in the Modelfile.

(The mangled <function=...> output was a separate but related issue: at the default temperature: 0.7 the curated Q4 tag uses, tool-call format adherence is probabilistic — the model occasionally improvises Anthropic-style XML when it should be using Qwen-style JSON. Dropping temperature to 0.2 makes format adherence essentially deterministic for tool use without hurting coding quality.)

So the canonical Modelfile that actually works ended up being a custom one built on top of the curated Q4 blob with four added overrides:

FROM /var/lib/ollama/.ollama/models/blobs/sha256-1194192cf2…
RENDERER qwen3-coder         # required for ChatML prompt formatting
PARSER qwen3-coder           # required for tool-call XML→JSON translation
PARAMETER num_ctx 131072     # required because Crush can't propagate num_ctx via OpenAI-compat
PARAMETER temperature 0.2    # required for reliable tool-call format adherence
# (plus the same stop tokens and other sampler params as the curated tag)

After registering this as qwen3-coder-q4-128k and pointing Crush at it, the agentic loop ran cleanly at 128K context with deterministic tool calls. End of investigation.

The real takeaway

This experiment ran over many hours with multiple false stops. The final working setup is a one-page Modelfile and a one-page Crush config. But the path from "model downloaded" to "agentic Crush session running cleanly at 128K context" required understanding four separate gotchas:

  1. Custom Modelfiles need RENDERER and PARSER directives for tool-call translation. Curated ollama tags have them; hand-rolled ones from HF GGUFs don't.
  2. Crush's context_window config is UI-onlynum_ctx must be set in the Modelfile, not the client config.
  3. Default temperature 0.7 makes tool-call format probabilistic. For agentic workflows, drop to 0.2.
  4. Stop tokens and chat templates still need to be right, even with RENDERER doing the work — though RENDERER makes a hand-rolled TEMPLATE block unnecessary.

None of these gotchas threw an exception or produced an error message. Each produced "looks plausible, doesn't quite work" output. The cost of running a local agentic stack isn't the disk or the install or the VRAM — it's the slow accumulation of empirical knowledge about which silent failure modes you're currently hitting, and which directive in which configuration file fixes which one.

The Aider design philosophy still holds — its smaller surface area genuinely is less likely to break on these silent gotchas. But once you've climbed the configuration learning curve, agentic-style tools (Crush, OpenCode) can be reliable too. The difference is that Aider rewards low-effort setup with reliable behavior; agentic tools demand high-effort setup but give you a richer working surface in return. Either is a valid choice. Just don't believe the install instructions when they say it's two commands. It's two commands plus a half-day of debugging Modelfile silent fallbacks.

Deploying AI for your business?

Inference, GPUs, and quantization choices look different in production. See where your business is on the readiness curve.

Take the AI Readiness Check
Share: Post Share

Read next