37 Models, One 4090: Local LLM Benchmarks on Consumer Hardware
I run all my LLMs locally. No API keys, no rate limits, no data leaving my network. Over the past few months I've tested every model worth running on my desktop rig, and I finally sat down to benchmark all 37 of them properly. Here's the full rundown - what's fast, what's usable, what's not worth the VRAM, and what I actually keep loaded day-to-day.
The Hardware
GPU 1 NVIDIA RTX 4090 24GB - PCIe Gen4 x16
GPU 2 NVIDIA RTX 4070 12GB - PCIe Gen4 x4 (via K43SG riser)
RAM 64GB DDR5-5600
Driver NVIDIA 595.71 / CUDA 13.2
OS Fedora 41, kernel 6.12
The second GPU is a 4070 in a K43SG external adapter running at x4 bandwidth. It's not going to win any PCIe throughput contests, but it gives me an extra 12GB of VRAM for offloading layers when running 70B models. More on that later.
Ollama Configuration
Before we get into numbers, the Ollama config matters a lot. These two flags alone were worth more than the K43SG hardware upgrade:
Flash attention reduces memory bandwidth pressure during inference. q8_0 KV cache quantizes the key-value cache from fp16 to int8, which cuts KV memory by half with negligible quality loss. Together they let me fit larger context windows and run bigger models without spilling to CPU. If you're running Ollama and haven't set these, stop reading and go do it now.
Benchmark Method
Nothing fancy. I wanted real-world inference speed, not synthetic nonsense.
I chose "Count from 1 to 20" because it produces a predictable, short output. The model doesn't need to think hard - we're measuring raw token throughput, not reasoning quality. Three runs, best result. Let's go.
The Full Results
S Tier: Over 100 tok/s
These models are absurdly fast. Responses feel instant - the bottleneck shifts to how fast you can read.
| Model | tok/s | VRAM | Notes |
|---|---|---|---|
| glm-ocr | 667 | ~4 GB | OCR specialist. Not a general LLM. |
| qwen3-coder:30b | 181 | ~18 GB | MoE arch. 30B params, only ~8B active. |
| qwen3-vl:30b | 154 | ~18 GB | Vision + text. Same MoE trick. |
| nemotron-3-nano | 154 | ~8 GB | NVIDIA's agent model. Screamer. |
| qwen3.5:35b | 109 | ~20 GB | Best general-purpose at this speed. |
glm-ocr at 667 tok/s is not a typo. It's a tiny model laser-focused on OCR - extracting text from images. It spits out tokens faster than I thought was possible on consumer hardware. If you're running Paperless-ngx or any document pipeline, this model is a cheat code.
The qwen3-coder and qwen3-vl numbers look suspicious for 30B models until you realize they're Mixture-of-Experts architectures. 30B total parameters, but only ~8B are active per inference. You get the quality of a 30B model with the speed of an 8B. This is where the industry is heading and that's the good part.
A Tier: 50-100 tok/s
The sweet spot. Fast enough for interactive use, big enough to be genuinely smart.
| Model | tok/s | VRAM | Notes |
|---|---|---|---|
| phi4:14b | 95 | ~9 GB | Microsoft's best small model. |
| deepcoder:14b | 91 | ~9 GB | Code-specialized. Surprisingly good. |
| lfm2:24b | 78 | ~15 GB | Liquid Foundation. Smooth outputs. |
phi4:14b at 95 tok/s is remarkable. Microsoft keeps shipping small models that punch well above their weight class. If you need a fast general assistant and don't need the MoE models, this is your pick.
B Tier: 20-50 tok/s
Workable for real tasks. You'll notice the latency but it won't drive you crazy.
| Model | tok/s | VRAM | Notes |
|---|---|---|---|
| command-r:35b | 48 | ~20 GB | Best local RAG model, period. |
| devstral-small-2 | 44 | ~14 GB | Mistral's code agent. Tool use. |
| deepseek-r1:32b | 43 | ~20 GB | Reasoning king. Worth every second. |
deepseek-r1:32b at 43 tok/s is the model I reach for when I need something actually figured out. Yes, it's slower. But when it drops a chain-of-thought response that nails a complex problem, you forget about the latency. Quality per token matters more than tokens per second.
C Tier: Dual-GPU Split (~25 tok/s)
Here's where the second GPU earns its keep. These 70B models won't fit in 24GB even at Q2, so layers get split across both cards.
| Model | tok/s | VRAM Split | Notes |
|---|---|---|---|
| llama3.3:70b Q2 | 25 | ~20 + ~6 GB | Meta's best open model. |
| deepseek-r1-abliterated:70b Q2 | 25 | ~20 + ~6 GB | Uncensored reasoning. Same speed. |
25 tok/s for a 70B model is totally usable. Not snappy, but you can have a real conversation. The x4 PCIe link on the K43SG riser is the bottleneck - with a proper x16 slot, these would likely hit 30-35 tok/s. But for $40 in riser hardware, I'll take it.
D Tier: CPU Spill (<10 tok/s)
Models too large for 36GB total VRAM. Layers spill to system RAM and performance craters.
| Model | tok/s | Notes |
|---|---|---|
| llama4:scout | 9.7 | MoE but still too big. Painful. |
| qwen3-coder-next | 8.1 | Next-gen coder. Needs more VRAM. |
Once you start hitting system RAM, it's a different world. 9.7 tok/s sounds okay until you're staring at a reasoning model slowly generating a 2000-token chain of thought. I only reach for these when I absolutely need the parameter count and I'm willing to go make coffee while it thinks.
The Speed Landscape
Key Discoveries
- Three distinct performance tiers exist and they're defined by hardware, not model architecture. 4090-solo models cluster at 55-181 tok/s. Dual-GPU split lands at ~25 tok/s. CPU spill drops to 2-10 tok/s. The cliffs are steep and there's almost nothing in between.
- The sweet spot is 14-24GB models on the 4090 alone. Once you stay inside a single GPU's VRAM, you get full memory bandwidth and the numbers are great. The moment you cross that boundary - even to a second GPU - you pay a 3-4x speed penalty.
- 70B Q2 is smarter than any 32B Q4. This surprised me. Conventional wisdom says quantization destroys quality, but a 70B model at aggressive Q2 quantization consistently outperforms 32B models at higher quantization levels. More parameters > more precision, at least down to Q2. I wouldn't go lower.
- Flash attention + q8_0 KV cache are the biggest multipliers. Bigger impact than adding the second GPU. The K43SG riser gave me ~15% more throughput on dual-GPU models. The Ollama flags gave me 25-40% across the board. Software eats hardware for breakfast.
- Abliteration has zero performance cost. deepseek-r1-abliterated:70b runs at exactly the same speed as the censored version. The abliteration process modifies a handful of weight vectors associated with refusal behavior - it doesn't change the model's computational graph. If you want uncensored responses, there is literally no reason not to use the abliterated variant.
- Vision models are slower per-token than text-only equivalents. qwen3-vl:30b at 154 tok/s vs qwen3-coder:30b at 181 tok/s - same parameter count, same MoE architecture, but the vision variant pays a tax for its image encoder even on text-only prompts. Something to consider if you don't need vision capabilities.
What I Actually Use
Benchmarks are fun but this is what matters. Here's my daily driver lineup:
| Use Case | Model | tok/s | Why This One |
|---|---|---|---|
| Coding | qwen3-coder:30b | 181 | MoE magic. Fast enough for inline completion. |
| General chat | qwen3.5:35b | 109 | Best all-rounder at usable speed. |
| Reasoning | deepseek-r1:32b | 43 | When I need it to actually think. Worth the wait. |
| Agent / tool-use | nemotron-3-nano | 154 | NVIDIA trained it for function calling. It shows. |
| Vision | qwen3-vl:30b | 154 | Only vision model I need. Fast, accurate. |
| OCR | glm-ocr | 667 | Feeds my Paperless-ngx pipeline. |
| RAG | command-r:35b | 48 | Cohere's model was built for retrieval. It respects citations. |
| Heavy reasoning | llama3.3:70b Q2 | 25 | When 32B isn't enough. Coffee break model. |
Most of the time I'm bouncing between qwen3-coder for code, qwen3.5 for general questions, and deepseek-r1 when I need something properly reasoned through. The others come out for specialized tasks.
I keep MAX_LOADED_MODELS=2 in Ollama, which means two models stay warm in VRAM simultaneously. My default pair is qwen3.5:35b + qwen3-coder:30b - both MoE, both fast, and they coexist in 24GB. When I need reasoning, I swap in deepseek-r1 and accept the cold-load delay. It's a 10-second penalty, not a big deal.
Advice for Your Own Setup
If you have a 4090: You're in the best position possible for local LLMs right now. Set flash attention and q8_0 KV cache, then go straight to the Qwen3 MoE models. You'll be shocked at the quality-per-dollar.
If you have a 4080/3090 (16GB): Focus on 14B models - phi4, deepcoder, nemotron-3-nano. They'll fit entirely in VRAM and run at 60-90 tok/s.
If you're considering a second GPU: Do the Ollama config optimizations first. Seriously. I got more speed from two environment variables than from a $300 GPU + riser setup. If you still want dual-GPU after that, get the cheapest card with the most VRAM you can find - bandwidth doesn't matter much since the inter-GPU transfer is inherently bottlenecked.
If you want 70B models: Q2 quantization is the move. Yes, Q2. The quality loss from 70B Q4 to 70B Q2 is smaller than the quality gap between 70B Q2 and 32B Q4. More neurons thinking imprecisely beats fewer neurons thinking precisely. I've done blind comparisons and the 70B Q2 wins on reasoning tasks almost every time.
What's Next
I'm watching three things: Llama 4 (the full model, not Scout), Qwen3.5's rumored 72B MoE variant, and whatever DeepSeek ships next. The MoE trend is the real story here - it's making parameter count partially irrelevant by only activating what's needed. If Qwen ships a 72B MoE with 14B active parameters, it'll run at Tier A speeds with Tier C quality. That's the dream.
For now, the 4090 remains the best local inference card you can buy. The 5090's extra VRAM would be nice, but at twice the price, the performance-per-dollar isn't there yet. I'll upgrade when 70B Q4 fits in a single card. Until then, Q2 and coffee.