Local LLM inference
Qwen3-30B-A3B on llama.cpp. Attention on a 4080, MoE experts mlock-pinned in RAM. About 42 tok/s decode.
- Model
- Qwen3-30B
- Decode
- ~42 tok/s
- Prefill
- ~110 tok/s
- Context
- 64K
Live sample
A faux stream that looks roughly like what a real chat completion looks like coming out of the server. Token by token, decode-rate accurate:
The split
The model is a 30-billion-parameter mixture-of-experts. The trick to running it fast on a single 16 GB consumer GPU is being opinionated about which tensors live on which device:
- Attention layers, router, embeddings and KV cache live on the RTX 4080, about 4.7 GiB of VRAM, leaving plenty of headroom.
- All 48 layers’ MoE expert weights stay in container RAM, pinned with mlock so the kernel can never page them out. That takes ~17 GiB and never moves.
- Decode lands around 42 tokens/second with q8_0 KV cache and flash-attention on.
- attention + KV cache4.7 GiB
- xtts voice (CT 124)2.4 GiB
- free8.9 GiB
- MoE experts (mlock)17.3 GiB
- ZFS ARC (capped)8 GiB
- other containers + VMs22 GiB
- free16.7 GiB
The invocation
# inside CT 108 (rootfs has llama.cpp built against CUDA 12)
# systemd unit: /etc/systemd/system/llama-server.service
# LimitMEMLOCK=infinity <- required for mlock past the 8 MiB default
llama-server \
--model /models/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf \
--host 0.0.0.0 --port 11434 \
--n-cpu-moe 48 \
--no-mmap --mlock \
--flash-attn on \
-ctk q8_0 -ctv q8_0 \
-c 65536 \
--metricsThe LXC config gets lxc.prlimit.memlock: unlimited so the in-container LimitMEMLOCK isn’t silently clipped by the host’s 8 MiB default.
Coexistence on one GPU
Another container on the same host runs xtts for voice synthesis, which sits at roughly 2.4 GiB of VRAM. Sizing the LLM’s GPU residency conservatively keeps total VRAM under 8 GiB, so the two coexist on the same 4080 without any swapping or eviction games.
The ZFS ARC was also capped at 8 GiB on the host; its default of half of RAM would have fought with the mlock-pinned expert weights. Photo browsing on the NAS pool is now slightly slower on first access; everything else benefits from the freed RAM.
API
llama.cpp’s server speaks both its native completion API and an OpenAI-compatible chat endpoint, so existing OpenAI client libraries point at it with nothing more than a base-URL change:
from openai import OpenAI
client = OpenAI(
base_url="http://192.168.1.34:11434/v1",
api_key="not-used",
)
resp = client.chat.completions.create(
model="qwen3-30b",
messages=[{"role": "user", "content": "hello, llama"}],
stream=True,
)
for chunk in resp:
print(chunk.choices[0].delta.content or "", end="")