Infrastructure 11 min read

AI Workstation Comparison: RTX 5090 vs GB10 (HP ZGX)

ai.rs Jun 16, 2026
AI Workstation Comparison: RTX 5090 vs GB10 (HP ZGX) illustration

The RTX 5090 and the GB10 are the two machines every local-AI builder is weighing right now — and they are almost perfect opposites. The 5090 is a 32 GB bandwidth monster. The GB10 — the Grace Blackwell superchip inside both NVIDIA's DGX Spark and HP's ZGX — is a 128 GB unified-memory box on a comparatively narrow memory bus. One runs small models blisteringly fast; the other runs models the 5090 cannot even load. This is the comparison that actually decides a local AI workstation, across the five things that matter: memory, speed, MoE, long context, and power.

(The GB10 is the same silicon in the HP ZGX and the NVIDIA DGX Spark, so published DGX Spark figures apply to the ZGX. Benchmark numbers below are from published RTX 5090 and DGX Spark/GB10 testing.)

The two machines at a glance

RTX 5090 GB10 (HP ZGX / DGX Spark)
Memory 32 GB GDDR7 128 GB LPDDR5x (unified CPU+GPU)
Memory bandwidth ~1,792 GB/s (~1.8 TB/s) ~273 GB/s
FP4 compute ~3,350 TOPS ~1,000 TOPS (1 PFLOP FP4)
CPU your host PC 20-core Arm on-package (10 Cortex-X925 + 10 A725)
Power 575 W (GPU) 140 W SoC / 240 W full system
Price ~$2,000 (card only) ~$3,000–4,000 (whole box)

Same ballpark money, opposite philosophies: the 5090 maximizes bandwidth per dollar; the GB10 maximizes memory per watt.

Memory capacity: what each can actually hold

This is the first fork in the road. The 5090's 32 GB comfortably runs dense models up to ~32B at 4-bit, plus larger MoE models if they fit — but a 70B needs ~40 GB+ at 4-bit and simply will not load. The GB10's 128 GB of unified memory runs 70B models without brutal quantization, 120B-class MoE at 4-bit, and — paired over its ConnectX networking — even Qwen3-235B across two units. Capacity is the entire reason the GB10 exists. (One caveat: unified memory is shared with the OS and CPU, so usable headroom is a bit under 128 GB.)

Inference speed: read two numbers, not one (PP and TG)

Every LLM has two speeds, and they are the heart of this comparison. Prompt Processing (PP, "prefill") is the model reading your prompt — it is compute-bound and runs in the thousands of tok/s. Token Generation (TG, "decode") is the model writing the answer one token at a time — it is memory-bandwidth-bound and runs in the tens to low hundreds. If you look at only one of them, you will misjudge both machines. (New to this split? Start with Prompt Processing vs Token Generation.)

Because TG is bandwidth-bound, the 5090's ~1.8 TB/s versus the GB10's ~273 GB/s (≈6.5×) is decisive — for models that fit in 32 GB:

Phase / model RTX 5090 GB10 (ZGX)
PP (prefill) ~12,800 tok/s ~2,050 tok/s
TG — small model (8–20B, 4-bit) ~120–190 tok/s ~50 tok/s
TG — 70B (4-bit) can't load ~35–45 tok/s
TG — 120B MoE (MXFP4) can't load ~40–55 tok/s

Read it this way: when a model fits the 5090, it wins both PP and TG by roughly 5–6×. The instant a model does not fit 32 GB, the 5090's speed is moot — it cannot run the thing — and the GB10 is the only box that finishes the job at all.

MoE models: the great equalizer

Mixture-of-Experts models (Kimi K2.6, Qwen3 MoE, gpt-oss) activate only a few billion parameters per token, so they read far less memory per token than a dense model of the same size. That has two consequences here:

  • The 5090 loves small MoE. It posts ~234 tok/s on a 30B-parameter MoE — faster than its own dense 8B — because each token only touches the active experts.
  • The GB10 was built for big MoE. NVIDIA explicitly tuned GB10 for "Blackwell 4-bit MoE inference." Its 128 GB holds giant MoE models the 5090 can't (gpt-oss-120B at ~40–55 tok/s; Qwen3-235B across two units), and because MoE reads only the active experts per token, the GB10's bandwidth penalty hurts less than it would on a dense model of the same total size.

Net: small MoE → the 5090 (raw speed); giant 4-bit MoE that won't fit 32 GB → the GB10 (the only option, at genuinely usable speeds).

Long-context testing: capacity versus throughput

Long context is where the two phases and the two machines collide. The KV cache grows with every token of context (why), so a long session needs both room to store the cache and bandwidth to stream it each step:

  • Capacity (GB10 wins): 128 GB can hold a large model and a long-context KV cache the 5090 has no room for. On the 5090 you hit an out-of-memory wall well before the GB10 does.
  • Ingest throughput (5090 wins): feeding a 100K-token prompt is PP — compute-bound — and the 5090 ingests it ~6× faster. But then it has to keep the model and that cache resident, which on big models it can't.
  • Streaming under long context (mixed): as context grows, TG slows on both, but the GB10's narrow bus feels it more.

Practical verdict: for long-context work on a large model, the GB10 is often the only machine that completes the run; the 5090 is faster only inside its 32 GB ceiling.

Power consumption: not close

The 5090's GPU alone draws 575 W, and a full 5090 rig (CPU, board, fans) pulls 700–900 W from the wall, needs a beefy PSU, and dumps a lot of heat. The entire GB10 system peaks at 240 W (the GB10 SoC itself is 140 W), runs off a small power brick, and sits near-silent on a desk.

So for the big models only the GB10 can run, it does the job at a fraction of the power. For models that fit the 5090, the 5090 still wins tokens-per-watt (it's simply doing far more tok/s) — but it wins by burning a lot more watts to get there.

Which one should you buy?

If you... Buy
Run models ≤ ~32B (especially small MoE) and want maximum speed RTX 5090
Need 70B–235B, big 4-bit MoE, or long context on one box GB10 (HP ZGX / DGX Spark)
Care about performance-per-watt, silence, and desk footprint GB10
Already own a gaming PC and want the cheapest fast inference RTX 5090
Want a turnkey CUDA dev box that mirrors datacenter Blackwell GB10

The honest take: these aren't really competitors — they're complements. The 5090 is a drag racer; the GB10 is a cargo van. The dream local setup, if you can swing it, is a 5090 for fast iteration on small models plus a GB10 for the big-model and long-context jobs the 5090 can't touch (and you can pair two GB10s for 235B-class models).

Bottom line

  • Memory: 32 GB vs 128 GB — the GB10 runs what the 5090 can't.
  • Speed: the 5090 is ~5–6× faster on both PP and TG — on models that fit its 32 GB.
  • MoE: 5090 for small MoE; GB10 for the giant 4-bit MoE it was built for.
  • Long context: ingest throughput (5090) vs the capacity to finish at all (GB10) — pick your bottleneck.
  • Power: the GB10 sips ~240 W system vs the 5090's 575 W GPU.

Deploying AI for your business?

Inference, GPUs, and quantization choices look different in production. See where your business is on the readiness curve.

Take the AI Readiness Check
Share: Post Share

Read next