Research 12 min read

Gemma 4 LoRA Fine-Tuning on RTX 5090: What Works and What Doesn't

ai.rs Apr 5, 2026

llm gemma moe fine-tuning quantization qwen benchmarks

Google released Gemma 4 on April 1, 2026 — a family of models including the 26B-A4B Mixture of Experts variant that activates only 3.8B of its 25.2B parameters per token. Apache 2.0 licensed, 256K context, 140+ languages, native vision support. On paper, it's a direct competitor to Qwen 3.5's MoE lineup.

We spent two days trying to QLoRA fine-tune the MoE variant on an RTX 5090 (32 GB VRAM). It doesn't work — yet. Not because of a bug, but because of an architectural decision that the tooling ecosystem hasn't caught up with. Important caveat: the dense Gemma 4 models (E2B, E4B, 31B) fine-tune just fine with standard QLoRA. This article is specifically about the MoE 26B-A4B variant.

Gemma 4 vs Qwen 3.5: The Specs

Both models use Mixture of Experts to deliver big-model knowledge at small-model speed. Here's how they compare:

Spec	Gemma 4 26B-A4B	Qwen 3.5 35B-A3B
Total Parameters	25.2B	35B
Active Parameters	3.8B	~3B
Experts	128 + 1 shared, 8 active	256 + 1 shared, 8 routed
Layers	30	40
Native Context	256K	262K (up to 1M with RoPE scaling)
Modalities	Text + Image	Text + Image + Video
Languages	140+	201
License	Apache 2.0	Apache 2.0

Benchmarks

Benchmark	Gemma 4	Qwen 3.5	Winner
MMLU-Pro	82.6	85.3	Qwen 3.5 (+2.7)
GPQA Diamond	82.3	84.2	Qwen 3.5 (+1.9)
LiveCodeBench v6	77.1	74.6	Gemma 4 (+2.5)
Codeforces ELO	1718	2028	Qwen 3.5 (+310)
MATH-Vision	82.4	83.9	Qwen 3.5 (+1.5)
MMMU Pro (Vision)	73.8	75.1	Qwen 3.5 (+1.3)

Qwen 3.5 leads across most reasoning and knowledge benchmarks. Gemma 4 has a slight edge on LiveCodeBench v6, but loses decisively on competitive programming (Codeforces). For most practical use cases — customer support, content generation, product recommendations — Qwen 3.5 is the stronger model.

The 3D Tensor Problem

Here's where things fall apart for local fine-tuning.

QLoRA (Quantized Low-Rank Adaptation) works by loading the base model in 4-bit precision and training small adapter layers on top. This is the standard approach for fine-tuning large models on consumer GPUs. With Qwen 3.5 35B-A3B, it works perfectly — we validated this both on an RTX 5090 locally and on an NVIDIA B200 (178 GB VRAM), where Unsloth loads the model at ~17.5 GB in 4-bit with plenty of room for training.

Gemma 4 breaks this workflow because of how it stores expert weights.

Qwen 3.5 stores each expert as separate 2D linear layers — standard nn.Linear modules that bitsandbytes knows how to quantize:

# Qwen 3.5: separate 2D tensors per expert — bnb quantizes these fine
model.layers.{i}.mlp.experts.{j}.gate_proj: [1024, 2048]  ← nn.Linear ✓
model.layers.{i}.mlp.experts.{j}.up_proj:   [1024, 2048]  ← nn.Linear ✓
model.layers.{i}.mlp.experts.{j}.down_proj:  [2048, 512]  ← nn.Linear ✓

Gemma 4 fuses all 128 experts into single 3D tensors:

# Gemma 4: fused 3D tensors — bnb CANNOT quantize these
model.layers.{i}.experts.gate_up_proj: [128, 1408, 2816]  ← 3D tensor ✗
model.layers.{i}.experts.down_proj:    [128, 2816, 1408]  ← 3D tensor ✗

bitsandbytes only quantizes 2D nn.Linear layers. It ignores everything else. The result:

Component	Size	Quantized?
3D expert tensors (30 layers)	42.5 GB (bf16)	No
2D layers (attention, embeddings)	4.5 GB → 1.1 GB (4-bit)	Yes
Total with "4-bit" loading	~43.7 GB

The "4-bit" model is actually 43.7 GB because 90% of the weights can't be quantized. That's 12 GB over our RTX 5090's budget — before we even account for training overhead.

What We Tried

Five different loading strategies, all dead ends:

Gemma4ForCausalLM from multimodal checkpoint — Key name mismatch. The checkpoint stores text weights as model.language_model.* but the text-only class expects model.*. All weights loaded as "unexpected", fresh initialization OOM'd.
Gemma4ForConditionalGeneration with single GPU — OOM at 37% loading. The full multimodal model is ~48 GB in bf16.
Gemma4ForConditionalGeneration with CPU offloading — bitsandbytes 4-bit mode rejects any CPU offloading. Non-starter.
Extracted text-only weights — We wrote a script to extract and remap the 657 text-only keys. Loading works, but the 3D tensor problem remains: 43.7 GB estimated, still OOM.
Various monkey-patches — caching_allocator_warmup bypass, Params4bit compatibility fix. These solved earlier errors but can't fix the fundamental 3D tensor issue.

The Ecosystem Gap

Gemma 4 dropped on April 1, 2026 — it's brand new. The quantization ecosystem hasn't adapted yet:

Format	Available?	Fine-tuning?
GGUF (Q4_K_M, ~17 GB)	Yes	Inference only
AWQ 4-bit	Yes	Inference only
GPTQ	Not yet	—
Unsloth bnb-4bit	Skipped for MoE variant	—

Unsloth — which has custom MoE quantization that handles Qwen 3.5's fused tensors — deliberately skipped the Gemma 4 26B-A4B for their bnb-4bit releases. They published quantized versions for the dense Gemma 4 models (E2B, E4B, 31B) but not the MoE one. That confirms this isn't just a "we haven't gotten to it" situation — the 3D tensor layout is genuinely harder to handle.

What to Use Instead

If you're choosing a model for local QLoRA fine-tuning on consumer hardware (24-32 GB VRAM), here's the practical decision:

For Fine-Tuning (QLoRA)

Model	4-bit Size	Fits 32 GB?	Status
Qwen 3.5-35B-A3B (via Unsloth)	~17.5 GB	Yes	Working
Qwen 3.5-27B dense	~14 GB	Yes	Working
Qwen 3.5-9B dense	~5 GB	Yes, comfortably	Working
Gemma 4 31B dense	~18-20 GB	Tight but feasible	Working
Gemma 4 26B-A4B (MoE)	~43.7 GB	No	Blocked

For Inference Only

Gemma 4 26B-A4B works fine for inference via GGUF (Ollama, llama.cpp) at Q4_K_M (~17 GB). If you just need to run the model — not train it — it's a solid option.

What About Cloud GPUs?

On an NVIDIA B200 (178 GB VRAM), the picture changes completely. Gemma 4's text-only model is ~47 GB in bf16 — you can skip quantization entirely and train with standard LoRA (not QLoRA). No 3D tensor problem, no bitsandbytes dependency. Load in bf16, attach LoRA adapters, train.

We already validated this workflow for Qwen 3.5 35B-A3B on a B200 via Unsloth, where it loads at ~17.5 GB in 4-bit and trains comfortably. Gemma 4 in bf16 at ~47 GB would also fit with ~130 GB to spare for optimizer states, gradients, and large batch sizes.

The trade-off is cost. A cloud B200 instance runs ~$3-5/hour. For a quick LoRA fine-tune (a few hundred steps), that's $5-15. For serious training runs, it adds up. The appeal of consumer GPU training is that it's free after the hardware purchase.

Why MoE Models Are Harder to Fine-Tune

The 3D tensor issue is just the most visible problem. MoE architectures create several fine-tuning headaches that dense models don't have:

Expert routing instability. During fine-tuning, the router learns which experts to activate for which tokens. Small datasets can destabilize this routing — a few hundred patent-writing examples might cause the router to over-rely on 2-3 experts while the other 125 go dormant. Dense models don't have this problem because every parameter participates in every forward pass.

Load balancing. MoE models are trained with auxiliary losses that encourage balanced expert utilization. Fine-tuning with LoRA typically freezes the router weights, which helps stability but means you can't adapt the routing to your domain. If your use case (say, patent writing) doesn't naturally distribute across many experts, you're leaving capacity on the table.

Memory unpredictability. Even when quantization works, MoE memory usage is harder to predict. All expert weights must be resident in VRAM even though only 8 of 128 fire per token. Gradient checkpointing interacts differently with MoE layers. Batch size effects are less intuitive because the active parameter count varies per token.

Tooling maturity. The PyTorch ecosystem — bitsandbytes, PEFT, DeepSpeed, FSDP — was built for dense transformers. MoE support is bolted on and varies wildly by implementation. Qwen's 2D expert layout works because it looks like standard linear layers. Gemma's 3D fused layout is more efficient but breaks assumptions baked into every tool in the chain.

None of this means MoE models can't be fine-tuned. It means the gap between "works in a paper" and "works on your GPU" is wider than with dense models. For most practitioners doing domain-specific fine-tuning — patent writing, customer support, product descriptions — a dense model at the same active parameter count will be easier to train and more predictable to debug.

The Bigger Picture

This episode highlights a real tension in the MoE design space. Fusing experts into 3D tensors is faster for inference (single batched matrix multiply instead of 128 separate calls) and Google's engineering team made a reasonable optimization choice. But it breaks the most popular fine-tuning workflow on consumer hardware.

Qwen's approach — separate 2D expert layers — is less optimal for raw inference throughput but plays nicely with the entire PyTorch/bitsandbytes/PEFT ecosystem. For the open-source community that wants to fine-tune models locally, that compatibility matters more than a few percent of inference speed.

The fix will come. Either bitsandbytes will add 3D tensor quantization, or Unsloth will build a custom path (they did it for Qwen's fused tensors), or Google will publish a checkpoint variant with separate expert weights. Until then, Qwen 3.5 35B-A3B is the MoE model to fine-tune locally — it has better benchmarks, a working training pipeline, and fits comfortably on an RTX 5090.

To be clear: Gemma 4 is not broken for fine-tuning. The dense models — Gemma 4 E2B, E4B, and 31B — all work with standard QLoRA via bitsandbytes or Unsloth. The 31B dense model at 4-bit (~18-20 GB) fits on an RTX 5090 and trains normally. It's only the MoE 26B-A4B that's blocked, and only on consumer GPUs where quantization is required.

What Will Fix This

The MoE fine-tuning gap is temporary. Here's what's likely to happen, roughly in order of probability:

Unsloth adds a custom Gemma 4 MoE path — Most likely and soonest. Unsloth already handles Qwen 3.5's fused MoE tensors with custom quantization. They have the architecture expertise and the motivation (Gemma 4 is a high-demand model). Timeline: weeks, not months.
bitsandbytes adds 3D tensor quantization — This would fix it for everyone, not just Unsloth users. The change is non-trivial (the NF4 quantization kernel assumes 2D weight matrices) but it's a known limitation. Timeline: 1-3 months.
Google releases an unfused checkpoint — Google could publish a variant with separate 2D expert weights instead of fused 3D tensors. This is the easiest fix from a tooling perspective but requires Google to act. Timeline: uncertain, depends on community pressure.

Our bet: Unsloth will have it working within weeks. If you need Gemma 4 MoE fine-tuning before then, use a B200 or similar cloud GPU where you can skip quantization entirely and train in bf16.

Tested on: RTX 5090 (32 GB), transformers 5.5.0, bitsandbytes 0.49.2, PEFT, April 2026.

Related:

Wondering if this fits your business?

The benchmarks and architectures we cover here power real production AI assistants. See where your business stands in 2 minutes.

Take the AI Readiness Check

Share: Post Share