2026-03-10

Three Nodes, One Chat UI: Building a Multi-Node Ollama Cluster

ollama self-hosted ai open-webui architecture

I run local LLMs. A lot of them. Over the past year my Ollama setup grew from a single instance on my Unraid server into a three-node cluster spanning a NAS, a workstation, and a laptop. The whole thing feeds into one Open WebUI instance, and the experience is surprisingly smooth for something held together with environment variables and hope.

This post covers the architecture, the trade-offs, and the practical reality of running 30+ models across machines that aren't always on.

The Problem

Running LLMs locally means making choices. Small models are fast but dumb. Smart models need serious hardware. And the hardware you want to run them on isn't always available - my GPU workstation sleeps to save power, and my laptop comes and goes from the network.

I wanted a single chat interface where I pick a model and it just works, regardless of which machine is actually running inference. No SSH-ing into boxes, no remembering ports, no juggling browser tabs.

The Nodes

Unraid NAS always on

Pentium Gold 8505 · 38GB RAM · CPU-only · 192.168.50.224:11434

The anchor. Always running, always reachable. Too slow for chat inference (single-digit tok/s on anything useful), but perfect for embeddings. Runs bge-m3 for AnythingLLM's RAG pipeline. Also hosts Open WebUI, SearXNG, and the rest of the supporting stack.

14900K Workstation sleeps

i9-14900K · RTX 4090 (24GB) + RTX 4070 (12GB) · 64GB DDR5 · 192.168.50.117:11434

The heavy hitter. 26+ models installed. Anything that fits in the 4090's 24GB of VRAM runs at 70-360 tok/s. Models that spill to system RAM drop to single digits. This machine sleeps when I'm not using it, which means its models vanish from the UI. That's fine - I just wake it when I need the big guns.

MacBook Air M3 roaming

Apple M3 · 8-core Neural Engine · 16GB unified · 192.168.50.247:11434

The part-timer. Joins the pool whenever it's on the home network. Runs 7-9B models nicely - qwen3.5:9b, gemma3:4b, glm-ocr. Apple's Neural Engine handles small models well. Ollama listens on all interfaces via OLLAMA_HOST=0.0.0.0 with a LaunchAgent for persistence.

The Glue: Open WebUI

The entire cluster is stitched together by a single Open WebUI instance running on Unraid. The magic is one environment variable:

OLLAMA_BASE_URLS=http://172.17.0.1:11434;http://192.168.50.117:11434;http://192.168.50.247:11434

That semicolon-separated list tells Open WebUI to query all three Ollama backends. It merges the model lists from every reachable node into one unified dropdown. If a node is down, its models simply don't appear. No errors, no config changes, no restarts needed. When the 14900K wakes up, its 26 models just show up in the UI.

The first URL uses 172.17.0.1 (Docker gateway) instead of localhost because Open WebUI itself runs in a Docker container on the same Unraid box. The other two are LAN IPs.

Architecture Overview


  Clients (browser)
          |
          v
  +-----------------+
  | Open WebUI :6031 |        +------------------+
  | (model selector)|--------| SearXNG :6030     |
  +--+---------+--+-+        | (web search)     |
     |         |  |          +------------------+
     v         v  v
 [node 1]  [node 2]  [node 3]
 Unraid    14900K    MBA M3
 :11434    :11434    :11434
 CPU-only  4090+4070 Apple M3
 always-on sleeps    roaming
     |
     v
 +-------------------+     +-------------------+
 | AnythingLLM :11438 |     | Crawl4AI :11235    |
 | RAG frontend      |     | headless crawler   |
 | OpenAI primary    |     | gpt-5-mini default |
 | qwen2.5:14b sec.  |     | Ollama per-request |
 +-------------------+     +-------------------+
     |                           |
     v                           v
 +-------------------+     +-------------------+
 | Qdrant :6333       |     | OpenAI API         |
 | vector DB         |     | (extraction LLM)   |
 | daily snapshots   |     +-------------------+
 +-------------------+

How It Evolved

This wasn't planned. It grew organically, which I think is the honest story behind most homelab architectures.

Phase 1 was just Ollama on Unraid. I installed it because I could. Ran a few small models, realized the Pentium Gold was painfully slow for anything beyond embeddings, and mostly forgot about it.

Phase 2 was adding the 14900K. I built a workstation for other reasons (gaming, video work) and threw an RTX 4090 in it. Running Ollama on that machine was a huge improvement - suddenly models like qwen3.5:35b ran at 93 tok/s. But I had two Ollama instances and was constantly SSH-ing between them.

Phase 3 was Open WebUI's OLLAMA_BASE_URLS feature. One env var, three backends, unified model list. That's when it clicked. I added the MacBook as a third node just because I could. Set OLLAMA_HOST=0.0.0.0, created a LaunchAgent to persist the config, and it was in the pool.

The 14900K Model Tiers

The workstation is where the real inference happens. I benchmarked everything and organized models into tiers based on throughput:

Tier	tok/s	Models	Notes
S	133-363	glm-ocr, granite3.2-vision, reader-lm:1.5b, dolphin3, qwen2.5-vl-abl:7b, gemma3n:e4b	Purpose-built + small fast models
A	72-97	phi4:14b, qwen3.5-abliterated:35b, qwen3.5:35b, qwen3.5:9b, lfm2:24b	Best general-purpose tier
B	33-48	command-r:35b, qwen2.5-coder:32b, deepseek-r1:32b, qwq, devstral	Heavy hitters, all fit in 4090
C/D	2.5-18	qwen3-coder-next (51.7GB), llama4:scout (67GB), gemma3-abl:27b	CPU spill, batch only

The critical threshold is 24GB of VRAM. Models that fit entirely in the 4090 run at full speed. Models that split across both GPUs via the Gen4 x4 K43SG riser are slower but usable - around 25 tok/s on 70B models. Anything that spills to system RAM drops to single digits.

The Data Flow


  What happens when you send a message in Open WebUI:

  1. You pick a model, say qwen3.5:35b

  2. Open WebUI knows which backend hosts it
     (discovered during model list merge)

  3. Request routes to 192.168.50.117:11434
     (the 14900K)

  4. If web search is toggled:
     query --> SearXNG :6030 --> results injected into context

  5. Response streams back through Open WebUI to your browser

  For RAG (via AnythingLLM):

  1. Document uploaded to AnythingLLM :11438

  2. Text chunked and embedded via bge-m3
     on Unraid Ollama (CPU, but embeddings are fast)

  3. Vectors stored in Qdrant :6333

  4. Query hits OpenAI gpt-5.2 (primary)
     or Ollama qwen2.5:14b (secondary)

  5. n8n runs daily Qdrant snapshot backups at 2 AM

The Supporting Cast

SearXNG (port 6030)

A privacy-respecting metasearch engine. Open WebUI connects to it for web search in chat - toggle "Web Search" in the UI and your prompts get augmented with live search results. It aggregates from multiple search engines without tracking anything. Runs on Unraid, accessible at http://172.17.0.1:6030 from Docker containers.

Crawl4AI (port 11235)

A headless Chromium crawler with LLM-powered extraction. The crawler itself doesn't use an LLM - that only kicks in when you use LLMExtractionStrategy to structure the crawled content. Defaults to OpenAI gpt-5-mini for extraction (cheap, fast, always available), but you can override per-request to use an Ollama model on any node.

AnythingLLM (port 11438)

The RAG frontend. Uses OpenAI gpt-5.2 as its primary LLM and falls back to qwen2.5:14b on the 14900K's Ollama. Embeddings go through bge-m3 on Unraid's always-on Ollama. I considered switching the Ollama model to qwen3.5:35b (93 tok/s, much smarter, fits in VRAM) but since OpenAI handles the primary workload, the current config is fine.

Qdrant (port 6333)

Vector database for the RAG pipeline. AnythingLLM stores its embeddings here. An n8n workflow takes daily snapshots at 2 AM and sends a notification via ntfy on success or failure. Straightforward, boring infrastructure. The best kind.

The "Sleeping GPU" Problem

The 14900K is the best node in the cluster by a wide margin. It's also off most of the time.

When it sleeps, its 26 models vanish from Open WebUI. You open the model dropdown and see a handful of small models from Unraid and maybe the MacBook. The heavy hitters are gone.

I considered several solutions:

Wake-on-LAN automation - have Open WebUI trigger a WOL packet when you select a model from a sleeping node. Technically possible, but complex to implement and you'd still wait 30-60 seconds for the machine to boot.
Keep it running 24/7 - the RTX 4090 idles at ~30W. The whole system is maybe 80W idle. Not terrible, but it adds up and the fans aren't silent.
Just wake it manually - this is what I do. If I need a big model, I wake the machine. It takes a minute. The models appear in the dropdown. Done.

Sometimes the simplest solution is the right one. I don't need the 4090 at 3 AM. When I do need it, I know where the power button is (or I send a WOL packet from my phone).

Trade-offs and Lessons

Cloud APIs are still part of the stack. Local LLMs are great for privacy and tinkering, but for production workloads (Crawl4AI extraction, AnythingLLM primary), I use OpenAI. The quality gap is real, especially for structured extraction. Local models are the secondary path.

Embeddings are the perfect local workload. They're compute-light, latency-insensitive, and privacy-relevant (your documents never leave your network). Running bge-m3 on a Pentium Gold is slow but totally fine for batch embedding.

VRAM is the only metric that matters for GPU inference. The 4070's 12GB extends the pool to 36GB total via a Gen4 x4 K43SG riser. Models that fit in the 4090's 24GB fly. Models that split across both GPUs are slower but usable - 25 tok/s on 70B Q2 models. Anything that spills to system RAM crawls.

Graceful degradation beats complex orchestration. Open WebUI's behavior of silently dropping unreachable backends is exactly right. No health checks to configure, no failover logic to debug. The model list is always accurate: if you can see it, you can use it.

The best distributed system is one where nodes can disappear and nobody notices except by the absence of options.

What's Next

The setup works well enough that I don't think about it most days. A few things on the list:

Adding an n8n workflow to check 14900K availability every 15 minutes and log the pattern. Might inform whether keeping it on 24/7 is worth the power cost. (Update: this is running now.)
Exploring qwen3.5:35b as the AnythingLLM secondary model - it benchmarks at 93 tok/s and is much smarter than qwen2.5:14b.
Better monitoring. UptimeKuma tracks the core services but doesn't know about model availability per node. Would be nice to have a dashboard showing which models are currently reachable.

The fundamental architecture - multiple Ollama instances feeding one UI - scales well. Adding a fourth node would be another semicolon in an env var. The hard part isn't the software; it's having machines with enough VRAM to be useful.

This is part of an ongoing series about my Unraid homelab. The server is a UGREEN DXP4800+ running Unraid 7.2.4 with 37 Docker containers. It does too much and I regret nothing.