AI / ML

Local LLM inference

Qwen3.6-35B-A3B on llama.cpp, multimodal: vision for Home Assistant and Frigate alongside chat. MoE experts mlock-pinned in RAM, ~43 tok/s decode at a 256K context.

llama.cppQwen3.6-35B-A3BCUDAFrigateHome Assistant

Model: Qwen3.6 35B-A3B
Decode: ~43 tok/s
Context: 256K
Vision: Frigate + HA

Live sample

A faux stream that looks roughly like what a real chat completion looks like coming out of the server, slowed down enough to read:

streaming · ~43 tok/s

Vision: Frigate and Home Assistant

Qwen3.6 is multimodal, so the same OpenAI-compatible endpoint answers questions about images, not just text. On the LAN, Frigate and Home Assistant both call it:

Frigateposts a snapshot of a detected object and gets back a short natural-language description, stored alongside the event so a later search like “a person carrying a box” resolves against what the model saw.
Home Assistant calls the same endpoint for scene understanding inside automations, turning a snapshot into a sentence it can act on. A frame of two animals in the yard becomes “two cats, one crouched and ears back,” and the automation decides whether that is worth a notification.

Both run entirely on the cluster. No camera frame leaves the LAN, and there is no per-image cloud-vision bill.

The split

The model is a 35-billion-parameter mixture of experts with about 3 billion active per token, plus a vision encoder. The trick to running it fast on a single 16 GB consumer GPU is being opinionated about which tensors live on which device:

Attention layers, router, embeddings, the vision encoder and the KV cache live on the RTX 4080, about 7.2 GiB of VRAM at a 256K context window.
All 48 layers’ MoE expert weights stay in container RAM, pinned with mlock so the kernel can never page them out. That takes ~23 GiB and never moves.
Decode lands around 43 tokens/second with q8_0 KV cache and flash-attention on; prefill runs near 280.

RTX 408016 GiB VRAM

LLM + vision + KV cache7.2 GiB
XTTS voice (CT 124)2.8 GiB
Whisper STT2.1 GiB
free3.8 GiB

Host RAM64 GiB

MoE experts (mlock)22.8 GiB
ZFS ARC (capped)8 GiB
other containers + VMs24 GiB
free8 GiB

The invocation

shell

# inside CT 108 on the 4080 node (llama.cpp built against CUDA 12)
# systemd unit: /etc/systemd/system/llama-server.service
# LimitMEMLOCK=infinity   <- required for mlock past the 8 MiB default
# LD_PRELOAD=libjemalloc.so.2  <- keeps RSS steady with 23 GiB pinned

llama-server \
  --model /models/Qwen3.6-35B-A3B-UD-Q5_K_XL.gguf \
  --mmproj /models/mmproj-Qwen3.6-35B-A3B-F16.gguf \
  --host 0.0.0.0 --port 11434 \
  --n-gpu-layers 99 \
  --n-cpu-moe 99 \
  --no-mmap --mlock \
  --flash-attn on \
  --cache-type-k q8_0 --cache-type-v q8_0 \
  --ctx-size 262144 \
  --parallel 1 --kv-unified \
  --threads 12 --threads-batch 16 \
  --jinja \
  --reasoning off \
  --temp 0.7 --top-p 0.8 --top-k 20 --presence-penalty 1.5 \
  --metrics

The --mmproj flag loads the vision projector, so the one server handles text, images and short video. A single request slot owns the whole unified 256K-token KV pool (--parallel 1 --kv-unified). The LXC config gets lxc.prlimit.memlock: unlimited so the in-container LimitMEMLOCK isn’t silently clipped by the host’s 8 MiB default.

Coexistence on one GPU

The 4080 does more than serve the LLM. Two other containers share it: an XTTS voice container at roughly 2.8 GiB and a Whisper speech-to-text container at about 2.1 GiB, both feeding the local voice stack. With the model and its vision encoder sitting near 7.2 GiB, that packs a full local AI stack (chat, vision, TTS and STT) onto one 16 GiB consumer GPU with about 3.8 GiB to spare, no swapping or eviction games.

The ZFS ARC was also capped at 8 GiB on the host; its default of half of RAM would have fought with the mlock-pinned expert weights. Photo browsing on the NAS pool is now slightly slower on first access; everything else benefits from the freed RAM.

API

llama.cpp’s server speaks both its native completion API and an OpenAI-compatible chat endpoint, so existing OpenAI client libraries point at it with nothing more than a base-URL change. With the projector loaded, that same endpoint takes images. This is the shape of the call Frigate and Home Assistant make:

python

from openai import OpenAI

client = OpenAI(
    base_url="http://192.168.1.34:11434/v1",
    api_key="not-used",
)

resp = client.chat.completions.create(
    model="qwen3.6",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe what is happening in this frame."},
            {"type": "image_url",
             "image_url": {"url": "data:image/jpeg;base64,..."}},
        ],
    }],
)
print(resp.choices[0].message.content)