Skip to content
back to projects
AI / ML

Local LLM inference

Qwen3-30B-A3B on llama.cpp. Attention on a 4080, MoE experts mlock-pinned in RAM. About 42 tok/s decode.

llama.cppQwen3-30B-A3BCUDALXC bind-mount
Model
Qwen3-30B
Decode
~42 tok/s
Prefill
~110 tok/s
Context
64K

Live sample

A faux stream that looks roughly like what a real chat completion looks like coming out of the server. Token by token, decode-rate accurate:

streaming · ~42 tok/s

The split

The model is a 30-billion-parameter mixture-of-experts. The trick to running it fast on a single 16 GB consumer GPU is being opinionated about which tensors live on which device:

  • Attention layers, router, embeddings and KV cache live on the RTX 4080, about 4.7 GiB of VRAM, leaving plenty of headroom.
  • All 48 layers’ MoE expert weights stay in container RAM, pinned with mlock so the kernel can never page them out. That takes ~17 GiB and never moves.
  • Decode lands around 42 tokens/second with q8_0 KV cache and flash-attention on.
RTX 408016 GiB VRAM
  • attention + KV cache4.7 GiB
  • xtts voice (CT 124)2.4 GiB
  • free8.9 GiB
Host RAM64 GiB
  • MoE experts (mlock)17.3 GiB
  • ZFS ARC (capped)8 GiB
  • other containers + VMs22 GiB
  • free16.7 GiB

The invocation

shell
# inside CT 108 (rootfs has llama.cpp built against CUDA 12)
# systemd unit: /etc/systemd/system/llama-server.service
# LimitMEMLOCK=infinity   <- required for mlock past the 8 MiB default

llama-server \
  --model /models/Qwen3-30B-A3B-Instruct-2507-Q4_K_M.gguf \
  --host 0.0.0.0 --port 11434 \
  --n-cpu-moe 48 \
  --no-mmap --mlock \
  --flash-attn on \
  -ctk q8_0 -ctv q8_0 \
  -c 65536 \
  --metrics

The LXC config gets lxc.prlimit.memlock: unlimited so the in-container LimitMEMLOCK isn’t silently clipped by the host’s 8 MiB default.

Coexistence on one GPU

Another container on the same host runs xtts for voice synthesis, which sits at roughly 2.4 GiB of VRAM. Sizing the LLM’s GPU residency conservatively keeps total VRAM under 8 GiB, so the two coexist on the same 4080 without any swapping or eviction games.

The ZFS ARC was also capped at 8 GiB on the host; its default of half of RAM would have fought with the mlock-pinned expert weights. Photo browsing on the NAS pool is now slightly slower on first access; everything else benefits from the freed RAM.

API

llama.cpp’s server speaks both its native completion API and an OpenAI-compatible chat endpoint, so existing OpenAI client libraries point at it with nothing more than a base-URL change:

python
from openai import OpenAI

client = OpenAI(
    base_url="http://192.168.1.34:11434/v1",
    api_key="not-used",
)

resp = client.chat.completions.create(
    model="qwen3-30b",
    messages=[{"role": "user", "content": "hello, llama"}],
    stream=True,
)
for chunk in resp:
    print(chunk.choices[0].delta.content or "", end="")