Three Nodes, One Chat UI: Building a Multi-Node Ollama Cluster
I run local LLMs. A lot of them. Over the past year my Ollama setup grew from a single instance on my Unraid server into a three-node cluster spanning a NAS, a workstation, and a laptop. The whole thing feeds into one Open WebUI instance, and the experience is surprisingly smooth for something held together with environment variables and hope.
This post covers the architecture, the trade-offs, and the practical reality of running 30+ models across machines that aren't always on.
The Problem
Running LLMs locally means making choices. Small models are fast but dumb. Smart models need serious hardware. And the hardware you want to run them on isn't always available - my GPU workstation sleeps to save power, and my laptop comes and goes from the network.
I wanted a single chat interface where I pick a model and it just works, regardless of which machine is actually running inference. No SSH-ing into boxes, no remembering ports, no juggling browser tabs.
The Nodes
bge-m3 for
AnythingLLM's RAG pipeline. Also hosts Open WebUI, SearXNG, and
the rest of the supporting stack.
qwen3.5:9b,
gemma3:4b, glm-ocr. Apple's Neural Engine
handles small models well. Ollama listens on all interfaces via
OLLAMA_HOST=0.0.0.0 with a LaunchAgent for persistence.
The Glue: Open WebUI
The entire cluster is stitched together by a single Open WebUI instance running on Unraid. The magic is one environment variable:
OLLAMA_BASE_URLS=http://172.17.0.1:11434;http://192.168.50.117:11434;http://192.168.50.247:11434
That semicolon-separated list tells Open WebUI to query all three Ollama backends. It merges the model lists from every reachable node into one unified dropdown. If a node is down, its models simply don't appear. No errors, no config changes, no restarts needed. When the 14900K wakes up, its 26 models just show up in the UI.
The first URL uses 172.17.0.1 (Docker gateway) instead of
localhost because Open WebUI itself runs in a Docker container on the
same Unraid box. The other two are LAN IPs.
Architecture Overview
Clients (browser)
|
v
+-----------------+
| Open WebUI :6031 | +------------------+
| (model selector)|--------| SearXNG :6030 |
+--+---------+--+-+ | (web search) |
| | | +------------------+
v v v
[node 1] [node 2] [node 3]
Unraid 14900K MBA M3
:11434 :11434 :11434
CPU-only 4090+4070 Apple M3
always-on sleeps roaming
|
v
+-------------------+ +-------------------+
| AnythingLLM :11438 | | Crawl4AI :11235 |
| RAG frontend | | headless crawler |
| OpenAI primary | | gpt-5-mini default |
| qwen2.5:14b sec. | | Ollama per-request |
+-------------------+ +-------------------+
| |
v v
+-------------------+ +-------------------+
| Qdrant :6333 | | OpenAI API |
| vector DB | | (extraction LLM) |
| daily snapshots | +-------------------+
+-------------------+How It Evolved
This wasn't planned. It grew organically, which I think is the honest story behind most homelab architectures.
Phase 1 was just Ollama on Unraid. I installed it because I could. Ran a few small models, realized the Pentium Gold was painfully slow for anything beyond embeddings, and mostly forgot about it.
Phase 2 was adding the 14900K. I built a workstation for
other reasons (gaming, video work) and threw an RTX 4090 in it. Running
Ollama on that machine was a huge improvement - suddenly models like
qwen3.5:35b ran at 93 tok/s. But I had two Ollama instances
and was constantly SSH-ing between them.
Phase 3 was Open WebUI's OLLAMA_BASE_URLS
feature. One env var, three backends, unified model list. That's when it
clicked. I added the MacBook as a third node just because I could. Set
OLLAMA_HOST=0.0.0.0, created a LaunchAgent to persist the
config, and it was in the pool.
The 14900K Model Tiers
The workstation is where the real inference happens. I benchmarked everything and organized models into tiers based on throughput:
| Tier | tok/s | Models | Notes |
|---|---|---|---|
| S | 133-363 | glm-ocr, granite3.2-vision, reader-lm:1.5b, dolphin3, qwen2.5-vl-abl:7b, gemma3n:e4b | Purpose-built + small fast models |
| A | 72-97 | phi4:14b, qwen3.5-abliterated:35b, qwen3.5:35b, qwen3.5:9b, lfm2:24b | Best general-purpose tier |
| B | 33-48 | command-r:35b, qwen2.5-coder:32b, deepseek-r1:32b, qwq, devstral | Heavy hitters, all fit in 4090 |
| C/D | 2.5-18 | qwen3-coder-next (51.7GB), llama4:scout (67GB), gemma3-abl:27b | CPU spill, batch only |
The critical threshold is 24GB of VRAM. Models that fit entirely in the 4090 run at full speed. Models that split across both GPUs via the Gen4 x4 K43SG riser are slower but usable - around 25 tok/s on 70B models. Anything that spills to system RAM drops to single digits.
The Data Flow
What happens when you send a message in Open WebUI:
1. You pick a model, say qwen3.5:35b
2. Open WebUI knows which backend hosts it
(discovered during model list merge)
3. Request routes to 192.168.50.117:11434
(the 14900K)
4. If web search is toggled:
query --> SearXNG :6030 --> results injected into context
5. Response streams back through Open WebUI to your browser
For RAG (via AnythingLLM):
1. Document uploaded to AnythingLLM :11438
2. Text chunked and embedded via bge-m3
on Unraid Ollama (CPU, but embeddings are fast)
3. Vectors stored in Qdrant :6333
4. Query hits OpenAI gpt-5.2 (primary)
or Ollama qwen2.5:14b (secondary)
5. n8n runs daily Qdrant snapshot backups at 2 AMThe Supporting Cast
SearXNG (port 6030)
A privacy-respecting metasearch engine. Open WebUI connects to it for
web search in chat - toggle "Web Search" in the UI and your prompts
get augmented with live search results. It aggregates from multiple
search engines without tracking anything. Runs on Unraid, accessible
at http://172.17.0.1:6030 from Docker containers.
Crawl4AI (port 11235)
A headless Chromium crawler with LLM-powered extraction. The crawler
itself doesn't use an LLM - that only kicks in when you use
LLMExtractionStrategy to structure the crawled content.
Defaults to OpenAI gpt-5-mini for extraction (cheap, fast,
always available), but you can override per-request to use an Ollama
model on any node.
AnythingLLM (port 11438)
The RAG frontend. Uses OpenAI gpt-5.2 as its primary LLM
and falls back to qwen2.5:14b on the 14900K's Ollama.
Embeddings go through bge-m3 on Unraid's always-on Ollama.
I considered switching the Ollama model to qwen3.5:35b (93
tok/s, much smarter, fits in VRAM) but since OpenAI handles the primary
workload, the current config is fine.
Qdrant (port 6333)
Vector database for the RAG pipeline. AnythingLLM stores its embeddings here. An n8n workflow takes daily snapshots at 2 AM and sends a notification via ntfy on success or failure. Straightforward, boring infrastructure. The best kind.
The "Sleeping GPU" Problem
The 14900K is the best node in the cluster by a wide margin. It's also off most of the time.
When it sleeps, its 26 models vanish from Open WebUI. You open the model dropdown and see a handful of small models from Unraid and maybe the MacBook. The heavy hitters are gone.
I considered several solutions:
- Wake-on-LAN automation - have Open WebUI trigger a WOL packet when you select a model from a sleeping node. Technically possible, but complex to implement and you'd still wait 30-60 seconds for the machine to boot.
- Keep it running 24/7 - the RTX 4090 idles at ~30W. The whole system is maybe 80W idle. Not terrible, but it adds up and the fans aren't silent.
- Just wake it manually - this is what I do. If I need a big model, I wake the machine. It takes a minute. The models appear in the dropdown. Done.
Sometimes the simplest solution is the right one. I don't need the 4090 at 3 AM. When I do need it, I know where the power button is (or I send a WOL packet from my phone).
Trade-offs and Lessons
Cloud APIs are still part of the stack. Local LLMs are great for privacy and tinkering, but for production workloads (Crawl4AI extraction, AnythingLLM primary), I use OpenAI. The quality gap is real, especially for structured extraction. Local models are the secondary path.
Embeddings are the perfect local workload. They're
compute-light, latency-insensitive, and privacy-relevant (your documents
never leave your network). Running bge-m3 on a Pentium Gold
is slow but totally fine for batch embedding.
VRAM is the only metric that matters for GPU inference. The 4070's 12GB extends the pool to 36GB total via a Gen4 x4 K43SG riser. Models that fit in the 4090's 24GB fly. Models that split across both GPUs are slower but usable - 25 tok/s on 70B Q2 models. Anything that spills to system RAM crawls.
Graceful degradation beats complex orchestration. Open WebUI's behavior of silently dropping unreachable backends is exactly right. No health checks to configure, no failover logic to debug. The model list is always accurate: if you can see it, you can use it.
The best distributed system is one where nodes can disappear and nobody notices except by the absence of options.
What's Next
The setup works well enough that I don't think about it most days. A few things on the list:
- Adding an n8n workflow to check 14900K availability every 15 minutes and log the pattern. Might inform whether keeping it on 24/7 is worth the power cost. (Update: this is running now.)
-
Exploring
qwen3.5:35bas the AnythingLLM secondary model - it benchmarks at 93 tok/s and is much smarter thanqwen2.5:14b. - Better monitoring. UptimeKuma tracks the core services but doesn't know about model availability per node. Would be nice to have a dashboard showing which models are currently reachable.
The fundamental architecture - multiple Ollama instances feeding one UI - scales well. Adding a fourth node would be another semicolon in an env var. The hard part isn't the software; it's having machines with enough VRAM to be useful.
This is part of an ongoing series about my Unraid homelab. The server is a UGREEN DXP4800+ running Unraid 7.2.4 with 37 Docker containers. It does too much and I regret nothing.