Skip to content
back to projects
Infrastructure

Proxmox cluster

A four-node home cluster + a Raspberry-Pi QDevice that keeps quorum alive even with three of four nodes down.

Proxmox VE 9ZFSCorosyncPBSMellanox 10G
Nodes
4
QDevice
RPi
Quorum
4 / 7
Uplink
10 G

Topology

Four nodes meshed via corosync over the LAN, plus a Raspberry Pi off to the side that doubles as the backup server and a corosync QDevice. Hover any node:

msilaptopwysehppi

The hardware

noderolecpuramnotes
pvedesktopmsiprimary · GPU · NASi9-13900K · 24c/32t64 GBRTX 4080 · 2× ZFS mirror naspool (5.4 TB) · 10G SFP+ uplink
pvelaptophaos host · battery UPSi5-5200U · 2c/4t16 GBinternal battery keeps HA alive past the rack UPS
pvewyse5070services workhorseCeleron J4105 · 4c/4t32 GB9 containers including production mapping backend · CyberPower UPS over USB
pvehpgeneral-purpose 4th nodei7-6700T · 4c/8t32 GBHP EliteDesk Mini · USB-2.5G + onboard 1G bonded (active-backup) · this site lives here
pi (192.168.1.223)PBS + QDeviceBCM2711 · 4c8 GB8 TB IronWolf via USB-SATA · vote weight 3 under LMS algorithm

Why the laptop is one of the nodes

The whole rack is on a UPS, but the laptop’s internal battery buys the cluster a much longer tail in an extended outage. The HAOS virtual machine (Home Assistant’s control plane for the rest of the house) lives on that node so the orchestration brain is the last thing to lose power.

A 5-minute systemd timer pushes the battery’s capacity, voltage, cycle count and AC-online state into InfluxDB; Grafana plots the long-term health curve so the day the battery falls below 80 % of design capacity doesn’t sneak up.

bash
# /usr/local/bin/push-battery-metrics.sh (excerpt)
read_int() { tr -d '\n' < "/sys/class/power_supply/BAT1/$1" 2>/dev/null; }
charge_now=$(read_int charge_now)
charge_full_design=$(read_int charge_full_design)
health_pct=$(awk -v a=$charge_full -v b=$charge_full_design 'BEGIN{printf "%.2f", 100*a/b}')
curl -sS -u proxmox_writer:****** -XPOST \
  "http://192.168.1.118:8086/write?db=proxmox" \
  --data-binary "battery,host=pvelaptop,model=PA5185U-1BRS health_pct=$health_pct"

The 10G networking story

The desktop host used to ride two onboard I226-V 2.5 Gbps interfaces. A known firmware bug on that NIC produced a hard link flap every few minutes, which broke long-running TCP streams, most painfully PBS chunk uploads, which the kernel held open well past the HTTP/2 idle timeout.

A Mellanox ConnectX-3 EN now collapses everything onto a single 10 G SFP+ uplink to the aggregation switch. iperf3 sustains line rate with zero retransmits:

shell
$ iperf3 -c 192.168.1.223 -t 30 -P 4
[SUM]   0.00-30.00  sec  3.29 GBytes  942 Mbits/sec  0    sender
[SUM]   0.00-30.04  sec  3.29 GBytes  942 Mbits/sec       receiver
# (capped at ~942 Mb/s by the Pi side, not the link)

The trade-off is a single point of failure, which is the right one to take for a home lab: the two onboard ports remain cabled in, just admin-down, ready to bond back in if the Mellanox ever dies.

Quorum math

Four cluster nodes (1 vote each) plus a Pi-hosted QDevice (3 votes under the LMS algorithm) gives 7 total votes, quorum = 4. One surviving node plus the Pi clears that bar, so three of the four nodes can drop and the cluster keeps running.

text
$ sudo pvecm status
Quorum information
  Date:             Fri May 22 21:47:50 2026
  Nodes:            4
  Quorate:          Yes

Votequorum information
  Expected votes:   7
  Highest expected: 7
  Total votes:      7
  Quorum:           4
  Flags:            Quorate Qdevice

Membership information
  Nodeid      Votes    Qdevice Name
  0x00000001       1   A,V,NMW pvedesktopmsi
  0x00000002       1   A,V,NMW pvelaptop
  0x00000003       1   A,V,NMW pvewyse5070
  0x00000004       1   A,V,NMW pvehp
  0x00000000       3            Qdevice

Backups

Proxmox Backup Server runs on the same Pi. The schedule is staggered per-node (21:00, 22:00, 23:00, 00:00) because the Pi cannot drain all four nodes’ dirty bitmaps simultaneously without stalling, and a stalled qemu I/O path has cost me an ext4 corruption inside HAOS more than once. The big container (a ~2 TB photo store on a ZFS mirror) gets its own weekly slot at 2 a.m. on Sunday so its 6–10 hour run never collides with anything else.