2026-03-16

ollama llm benchmarks 4090 self-hosted

37 Models, One 4090: Local LLM Benchmarks on Consumer Hardware

I run all my LLMs locally. No API keys, no rate limits, no data leaving my network. Over the past few months I've tested every model worth running on my desktop rig, and I finally sat down to benchmark all 37 of them properly. Here's the full rundown - what's fast, what's usable, what's not worth the VRAM, and what I actually keep loaded day-to-day.

The Hardware

CPU Intel Core i9-14900K (24C/32T)
GPU 1 NVIDIA RTX 4090 24GB - PCIe Gen4 x16
GPU 2 NVIDIA RTX 4070 12GB - PCIe Gen4 x4 (via K43SG riser)
RAM 64GB DDR5-5600
Driver NVIDIA 595.71 / CUDA 13.2
OS Fedora 41, kernel 6.12

The second GPU is a 4070 in a K43SG external adapter running at x4 bandwidth. It's not going to win any PCIe throughput contests, but it gives me an extra 12GB of VRAM for offloading layers when running 70B models. More on that later.

Ollama Configuration

Before we get into numbers, the Ollama config matters a lot. These two flags alone were worth more than the K43SG hardware upgrade:

/etc/systemd/system/ollama.service.d/override.conf

[Service] Environment="OLLAMA_FLASH_ATTENTION=1" Environment="OLLAMA_KV_CACHE_TYPE=q8_0" Environment="OLLAMA_NUM_PARALLEL=2" Environment="OLLAMA_MAX_LOADED_MODELS=2"

Flash attention reduces memory bandwidth pressure during inference. q8_0 KV cache quantizes the key-value cache from fp16 to int8, which cuts KV memory by half with negligible quality loss. Together they let me fit larger context windows and run bigger models without spilling to CPU. If you're running Ollama and haven't set these, stop reading and go do it now.

Benchmark Method

Nothing fancy. I wanted real-world inference speed, not synthetic nonsense.

benchmark.sh

$ curl -s http://localhost:11434/api/generate -d '{ "model": "$MODEL", "prompt": "Count from 1 to 20", "stream": false }' | jq '.eval_count / .eval_duration * 1e9' # 3 runs per model, best tok/s reported # Non-streaming to get accurate total timing # Simple prompt to minimize variance from thinking tokens

I chose "Count from 1 to 20" because it produces a predictable, short output. The model doesn't need to think hard - we're measuring raw token throughput, not reasoning quality. Three runs, best result. Let's go.

The Full Results

S Tier: Over 100 tok/s

These models are absurdly fast. Responses feel instant - the bottleneck shifts to how fast you can read.

Model	tok/s	VRAM	Notes
glm-ocr	667	~4 GB	OCR specialist. Not a general LLM.
qwen3-coder:30b	181	~18 GB	MoE arch. 30B params, only ~8B active.
qwen3-vl:30b	154	~18 GB	Vision + text. Same MoE trick.
nemotron-3-nano	154	~8 GB	NVIDIA's agent model. Screamer.
qwen3.5:35b	109	~20 GB	Best general-purpose at this speed.

glm-ocr at 667 tok/s is not a typo. It's a tiny model laser-focused on OCR - extracting text from images. It spits out tokens faster than I thought was possible on consumer hardware. If you're running Paperless-ngx or any document pipeline, this model is a cheat code.

The qwen3-coder and qwen3-vl numbers look suspicious for 30B models until you realize they're Mixture-of-Experts architectures. 30B total parameters, but only ~8B are active per inference. You get the quality of a 30B model with the speed of an 8B. This is where the industry is heading and that's the good part.

A Tier: 50-100 tok/s

The sweet spot. Fast enough for interactive use, big enough to be genuinely smart.

Model	tok/s	VRAM	Notes
phi4:14b	95	~9 GB	Microsoft's best small model.
deepcoder:14b	91	~9 GB	Code-specialized. Surprisingly good.
lfm2:24b	78	~15 GB	Liquid Foundation. Smooth outputs.

phi4:14b at 95 tok/s is remarkable. Microsoft keeps shipping small models that punch well above their weight class. If you need a fast general assistant and don't need the MoE models, this is your pick.

B Tier: 20-50 tok/s

Workable for real tasks. You'll notice the latency but it won't drive you crazy.

Model	tok/s	VRAM	Notes
command-r:35b	48	~20 GB	Best local RAG model, period.
devstral-small-2	44	~14 GB	Mistral's code agent. Tool use.
deepseek-r1:32b	43	~20 GB	Reasoning king. Worth every second.

deepseek-r1:32b at 43 tok/s is the model I reach for when I need something actually figured out. Yes, it's slower. But when it drops a chain-of-thought response that nails a complex problem, you forget about the latency. Quality per token matters more than tokens per second.

C Tier: Dual-GPU Split (~25 tok/s)

Here's where the second GPU earns its keep. These 70B models won't fit in 24GB even at Q2, so layers get split across both cards.

Model	tok/s	VRAM Split	Notes
llama3.3:70b Q2	25	~20 + ~6 GB	Meta's best open model.
deepseek-r1-abliterated:70b Q2	25	~20 + ~6 GB	Uncensored reasoning. Same speed.

VRAM allocation

$ nvidia-smi --query-gpu=index,memory.used,memory.total --format=csv index, memory.used, memory.total 0, 20148 MiB, 24564 MiB # 4090 - bulk of layers 1, 6220 MiB, 12282 MiB # 4070 - overflow layers # Both 70B models need ~26GB weights at Q2 # 4090 handles ~80% of layers, 4070 takes the rest # x4 PCIe bandwidth on the riser caps inter-GPU transfer

25 tok/s for a 70B model is totally usable. Not snappy, but you can have a real conversation. The x4 PCIe link on the K43SG riser is the bottleneck - with a proper x16 slot, these would likely hit 30-35 tok/s. But for $40 in riser hardware, I'll take it.

D Tier: CPU Spill (<10 tok/s)

Models too large for 36GB total VRAM. Layers spill to system RAM and performance craters.

Model	tok/s	Notes
llama4:scout	9.7	MoE but still too big. Painful.
qwen3-coder-next	8.1	Next-gen coder. Needs more VRAM.

Once you start hitting system RAM, it's a different world. 9.7 tok/s sounds okay until you're staring at a reasoning model slowly generating a 2000-token chain of thought. I only reach for these when I absolutely need the parameter count and I'm willing to go make coffee while it thinks.

The Speed Landscape

tok/s distribution (log scale, best per model)

glm-ocr |████████████████████████████████████████████████████| 667 qwen3-coder:30b |██████████████████████████████████ | 181 qwen3-vl:30b |████████████████████████████████ | 154 nemotron-3-nano |████████████████████████████████ | 154 qwen3.5:35b |████████████████████████████ | 109 phi4:14b |██████████████████████████ | 95 deepcoder:14b |█████████████████████████ | 91 lfm2:24b |████████████████████ | 78 command-r:35b |████████████████ | 48 devstral-small-2 |███████████████ | 44 deepseek-r1:32b |██████████████ | 43 llama3.3:70b Q2 |████████ | 25 ds-r1-abl:70b Q2 |████████ | 25 llama4:scout |███ | 9.7 qwen3-coder-next |███ | 8.1

Key Discoveries

Three distinct performance tiers exist and they're defined by hardware, not model architecture. 4090-solo models cluster at 55-181 tok/s. Dual-GPU split lands at ~25 tok/s. CPU spill drops to 2-10 tok/s. The cliffs are steep and there's almost nothing in between.
The sweet spot is 14-24GB models on the 4090 alone. Once you stay inside a single GPU's VRAM, you get full memory bandwidth and the numbers are great. The moment you cross that boundary - even to a second GPU - you pay a 3-4x speed penalty.
70B Q2 is smarter than any 32B Q4. This surprised me. Conventional wisdom says quantization destroys quality, but a 70B model at aggressive Q2 quantization consistently outperforms 32B models at higher quantization levels. More parameters > more precision, at least down to Q2. I wouldn't go lower.
Flash attention + q8_0 KV cache are the biggest multipliers. Bigger impact than adding the second GPU. The K43SG riser gave me ~15% more throughput on dual-GPU models. The Ollama flags gave me 25-40% across the board. Software eats hardware for breakfast.
Abliteration has zero performance cost. deepseek-r1-abliterated:70b runs at exactly the same speed as the censored version. The abliteration process modifies a handful of weight vectors associated with refusal behavior - it doesn't change the model's computational graph. If you want uncensored responses, there is literally no reason not to use the abliterated variant.
Vision models are slower per-token than text-only equivalents. qwen3-vl:30b at 154 tok/s vs qwen3-coder:30b at 181 tok/s - same parameter count, same MoE architecture, but the vision variant pays a tax for its image encoder even on text-only prompts. Something to consider if you don't need vision capabilities.

What I Actually Use

Benchmarks are fun but this is what matters. Here's my daily driver lineup:

Use Case	Model	tok/s	Why This One
Coding	qwen3-coder:30b	181	MoE magic. Fast enough for inline completion.
General chat	qwen3.5:35b	109	Best all-rounder at usable speed.
Reasoning	deepseek-r1:32b	43	When I need it to actually think. Worth the wait.
Agent / tool-use	nemotron-3-nano	154	NVIDIA trained it for function calling. It shows.
Vision	qwen3-vl:30b	154	Only vision model I need. Fast, accurate.
OCR	glm-ocr	667	Feeds my Paperless-ngx pipeline.
RAG	command-r:35b	48	Cohere's model was built for retrieval. It respects citations.
Heavy reasoning	llama3.3:70b Q2	25	When 32B isn't enough. Coffee break model.

Most of the time I'm bouncing between qwen3-coder for code, qwen3.5 for general questions, and deepseek-r1 when I need something properly reasoned through. The others come out for specialized tasks.

The real workflow

I keep MAX_LOADED_MODELS=2 in Ollama, which means two models stay warm in VRAM simultaneously. My default pair is qwen3.5:35b + qwen3-coder:30b - both MoE, both fast, and they coexist in 24GB. When I need reasoning, I swap in deepseek-r1 and accept the cold-load delay. It's a 10-second penalty, not a big deal.

Advice for Your Own Setup

If you have a 4090: You're in the best position possible for local LLMs right now. Set flash attention and q8_0 KV cache, then go straight to the Qwen3 MoE models. You'll be shocked at the quality-per-dollar.

If you have a 4080/3090 (16GB): Focus on 14B models - phi4, deepcoder, nemotron-3-nano. They'll fit entirely in VRAM and run at 60-90 tok/s.

If you're considering a second GPU: Do the Ollama config optimizations first. Seriously. I got more speed from two environment variables than from a $300 GPU + riser setup. If you still want dual-GPU after that, get the cheapest card with the most VRAM you can find - bandwidth doesn't matter much since the inter-GPU transfer is inherently bottlenecked.

If you want 70B models: Q2 quantization is the move. Yes, Q2. The quality loss from 70B Q4 to 70B Q2 is smaller than the quality gap between 70B Q2 and 32B Q4. More neurons thinking imprecisely beats fewer neurons thinking precisely. I've done blind comparisons and the 70B Q2 wins on reasoning tasks almost every time.

What's Next

I'm watching three things: Llama 4 (the full model, not Scout), Qwen3.5's rumored 72B MoE variant, and whatever DeepSeek ships next. The MoE trend is the real story here - it's making parameter count partially irrelevant by only activating what's needed. If Qwen ships a 72B MoE with 14B active parameters, it'll run at Tier A speeds with Tier C quality. That's the dream.

For now, the 4090 remains the best local inference card you can buy. The 5090's extra VRAM would be nice, but at twice the price, the performance-per-dollar isn't there yet. I'll upgrade when 70B Q4 fits in a single card. Until then, Q2 and coffee.