No description
  • HCL 79.3%
  • Shell 20.7%
Find a file
statevault e5bb4425d4 chore(zot): flip sm-zot retention dryRun → false
Applied today after the earlier dryRun=true period. Real safety
floors stay: delay=24h on candidate eligibility, keepTags.most
RecentlyPushedCount=20 per **/cache repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 18:34:09 +02:00
docs docs(stalwart): multi-domain consolidation runbook + mail.md section 2026-05-13 14:13:19 +02:00
images fix(searxng): keep git in runtime image — searx/version.py needs it 2026-05-09 23:33:41 +02:00
scripts fix(searxng): codify bind-mount source dir creation + bao-exec path lookup 2026-05-09 23:27:03 +02:00
stalwart fix(stalwart): add Stdout tracer so logs reach Loki via journal 2026-05-06 15:42:39 +02:00
templates chore(zot): flip sm-zot retention dryRun → false 2026-05-13 18:34:09 +02:00
.gitignore acme: switch to acme.sh HTTP-01 for LE cert issuance 2026-04-16 20:04:20 +02:00
.mcp.json chore(mcp): switch media MCP wiring to type:http (media-api integration) 2026-05-05 09:24:32 +02:00
.terraform.lock.hcl feat(stalwart): consolidate bingerin.de onto loop-coop.net mailboxes 2026-05-11 07:49:24 +02:00
.wiki-state.json chore(garage): remove Garage S3 / sccache infrastructure 2026-04-20 13:55:44 +02:00
acme.tf breaking(discovery): retire discovery + wg-peer LXCs 2026-05-07 12:22:14 +02:00
backup-images.tf feat(images): qi→SM copy short-cut + backup hook + workstation-only docs 2026-05-07 09:51:41 +02:00
backup-stalwart.tf feat(stalwart): age-encrypted daily backup to S3 2026-04-30 08:28:05 +02:00
bao.yml feat(bao): add bao.yml + scrub ~/projects refs in docs 2026-05-01 18:36:44 +02:00
CLAUDE.md docs: matrix-mail-bridge — full integration stack reference 2026-05-05 16:18:52 +02:00
coturn.tf feat(coturn): auto-reload on cert renewal via path-watcher 2026-04-30 08:47:06 +02:00
data.tf refactor(storage): single 100 GB Cloud Volume for MinIO + LXC caches 2026-04-25 08:24:47 +02:00
devpi.plan feat: incus server cert from bao PKI via cloud-init write_files 2026-04-16 10:09:40 +02:00
devpi.tf feat(images): wire qubes-incus build pipeline into tofu 2026-05-07 09:27:16 +02:00
dns.tf feat(dns): add tapir.loop-coop.net A record → ingress IP 2026-05-12 15:15:57 +02:00
floating-ip.tf fix(network): bind public floating IP locally on eth0 2026-04-27 08:37:20 +02:00
haproxy-cert-bootstrap.tf feat(acme): cert SAN + host_local for discovery.svc.loop-coop.net 2026-05-02 20:27:02 +02:00
haproxy.tf fix(haproxy): bootstrap self-signed PEM for ACME certs 2026-04-27 09:15:51 +02:00
image-builds.tf chore(devpi): bump devpi-server 6.14.0 → 6.20.0 2026-05-07 11:01:41 +02:00
images.tf feat(searxng): host SearXNG metasearch on service-machine LXC 2026-05-09 22:26:19 +02:00
inbox.tf feat(images): wire qubes-incus build pipeline into tofu 2026-05-07 09:27:16 +02:00
incus.tf fix(loop-portal): point BAO_ADDR at k8s bao via mesh-ingress + DNS forward 2026-05-03 11:51:17 +02:00
livekit.tf feat(livekit): host-mode SFU + matrix-rtc routing for Element Call 2026-04-30 08:31:31 +02:00
loop-portal.tf feat(images): wire qubes-incus build pipeline into tofu 2026-05-07 09:27:16 +02:00
main.tf breaking(discovery): retire discovery + wg-peer LXCs 2026-05-07 12:22:14 +02:00
mesh-agent.tf breaking(discovery): retire discovery + wg-peer LXCs 2026-05-07 12:22:14 +02:00
mesh-proxy.tf breaking(discovery): retire discovery + wg-peer LXCs 2026-05-07 12:22:14 +02:00
minio.tf feat(minio): scoped IAM client for matrix-relay sccache 2026-05-13 10:20:54 +02:00
mkdocs.yml feat(ingress): wire stalwart mail + coturn + livekit + matrix-rtc into host 2026-04-30 08:32:52 +02:00
monitoring.plan feat: incus server cert from bao PKI via cloud-init write_files 2026-04-16 10:09:40 +02:00
monitoring.tf feat(images): wire qubes-incus build pipeline into tofu 2026-05-07 09:27:16 +02:00
nftables.tf fix(nft): SNAT stalwart outbound mail to floating IP (FCrDNS, closes #26) 2026-05-03 21:19:46 +02:00
outputs.tf feat(network): own hcloud_network 'loadbalancer' in service-machine 2026-05-03 14:30:48 +02:00
pki.tf fix(pki): add *.dev.loop-coop.net SAN to haproxy_server cert 2026-04-27 19:53:09 +02:00
README.md feat(images): qi→SM copy short-cut + backup hook + workstation-only docs 2026-05-07 09:51:41 +02:00
searxng.tf fix(searxng): codify bind-mount source dir creation + bao-exec path lookup 2026-05-09 23:27:03 +02:00
squid.tf feat(images): wire qubes-incus build pipeline into tofu 2026-05-07 09:27:16 +02:00
ssh.tf feat(outputs): add gateway-compatible outputs + drop stale SAN 2026-04-24 07:57:42 +02:00
stalwart-plan.tf feat(stalwart): consolidate bingerin.de onto loop-coop.net mailboxes 2026-05-11 07:49:24 +02:00
stalwart.tf feat(images): wire qubes-incus build pipeline into tofu 2026-05-07 09:27:16 +02:00
unbound.tf feat(images): wire qubes-incus build pipeline into tofu 2026-05-07 09:27:16 +02:00
variables.tf feat(stalwart): consolidate bingerin.de onto loop-coop.net mailboxes 2026-05-11 07:49:24 +02:00
versions.tf chore(state): migrate backend infra/service-machine → mesh/service-machine 2026-05-01 17:58:19 +02:00

service-machine

Lean Incus node on Hetzner that joins the WireGuard mesh and hosts containerised services. Mesh hub + public ingress for the talos-hcloud-cluster. All workloads are native Incus system containers (no Podman inside) built from distrobuilder images.

Scope — cluster only. The caches exposed on 10.90.0.2:{3128,3141,5000,9200} (squid, devpi, Zot, MinIO) serve cluster-side workloads (Talos nodes, Woodpecker pods, in-cluster containers). Qube-local AppVMs at home must not proxy through service-machine — the home landline is bandwidth-constrained. Qubes use localhost:5000 (builder Zot via qrexec) / localhost:3141 (builder DevPI) or fetch direct. DNS, OTLP, log/metric pushes through the gateway are fine — low-volume, cluster is the destination.

Architecture

claude qube (10.90.0.12:8200 — OpenBao)
    │
    └── sys-wg (10.90.0.10) ── WireGuard mesh hub
            │
            └── service-machine (10.90.0.2, lb 10.91.0.3, pub 116.202.6.8)
                    │  host HAProxy        (public floating IP only)
                    │  Incus API :8443     (PKI-signed TLS, mesh-only via nft)
                    │  WireGuard :20226
                    │  btrfs Incus pool    (Hetzner Cloud Volume @ /var/lib/incuspool)
                    │
                    ├── services bridge: 10.92.0.0/24
                    │       ├── zot             10.92.0.12   → :5000
                    │       ├── alloy           10.92.0.13   → :4317/4318/12345
                    │       ├── node-exporter   10.92.0.14   → :9100
                    │       ├── devpi           10.92.0.15   → :3141
                    │       ├── unbound         10.92.0.16   → :53 (UDP/TCP), :853
                    │       ├── squid           10.92.0.17   → :3128
                    │       ├── discovery       10.92.0.18   → :3000 (gRPC), :2122
                    │       ├── wg-peer         10.92.0.19   → :51820/udp
                    │       ├── certbot         10.92.0.20   → acme.sh HTTP-01
                    │       ├── minio           10.92.0.21   → :9200 (S3)
                    │       └── mesh-proxy      10.92.0.22   → haproxy for mesh frontends
                    │
                    └── mesh-agent (bao-agent on host, reads qubes bao :8200)
                            ├── WireGuard mesh peers
                            ├── Host HAProxy (public ingress)
                            ├── Mesh-proxy LXC HAProxy (mesh frontends)
                            ├── Unbound DNS overrides
                            ├── Discovery TLS certs, bao-token
                            ├── Alloy config, Zot config, MinIO env
                            ├── Qubes CA trust anchor (system + containers)
                            └── /etc/backup-images/env (S3 creds for image backup)

Each LXC is a Debian 13 system container from a per-service distrobuilder image. No Podman inside, no security.nesting. Incus proxy devices expose every service on mesh IP 10.90.0.2 and LB IP 10.91.0.3 (for k8s pods on the Hetzner private network); nothing binds eth0 except host HAProxy

  • SSH + WireGuard + the public Let's Encrypt endpoint on :3000.

Ports

Host HAProxy — floating IP 116.202.6.8

Port Purpose
80 Public HTTP → cluster ingress + ACME HTTP-01 carve-out → certbot LXC (only for disco.*, id.*, mail.*, autoconfig.*, autodiscover.*, turn.*, matrix-rtc.*)
443 Public HTTPS TCP passthrough → cluster ingress-nginx + SNI splits to id.* (loop-portal LXC), mail.* / autoconfig.* / autodiscover.* (stalwart LXC), and matrix-rtc.* (host LiveKit + cluster lk-jwt-service path-split — see docs/livekit.md)
3000 Public discovery gRPC with LE cert → discovery LXC :3001

Host services bound directly on floating IP (not LXCs)

Port Service
25, 465, 993, 995, 4190 Stalwart mail listeners — see docs/mail.md
3478 udp+tcp, 5349 udp+tcp, 49152-49202 udp coturn TURN/STUN — see docs/coturn.md
7880 tcp (loopback only, fronted by HAProxy), 7881-7882 udp+tcp LiveKit SFU (Matrix group calls) — see docs/livekit.md

Mesh IP 10.90.0.2 (Incus proxy → LXC or mesh-proxy LXC)

Port Service Routed via
53, 853 Unbound DNS + DoT unbound LXC
2122 Discovery metrics discovery LXC
3000 Discovery gRPC discovery LXC
3128 Squid forward proxy squid LXC
3141 devpi devpi LXC
3200 Tempo MCP mesh-proxy LXC
4317 OTLP gRPC (Alloy) alloy LXC
4318 OTLP HTTP (Alloy) alloy LXC
5000 Zot OCI registry zot LXC
6443 k8s API (admin) mesh-proxy LXC
8070 Grafana MCP mesh-proxy LXC
8202 openbao k8s mTLS mesh-proxy LXC
8404 HAProxy stats + Prometheus /metrics mesh-proxy LXC
8443 Incus API + Web UI host (nft-gated to iif=mesh)
9000 Woodpecker gRPC mesh-proxy LXC
9100 node-exporter node-exporter LXC
9200 MinIO S3 API minio LXC
12345 Alloy UI + /metrics alloy LXC
50000 Talos API (admin) mesh-proxy LXC
50100+ per-node Talos proxy host HAProxy (dynamic)
20226/udp WireGuard mesh host kernel
51820/udp WireGuard discovery overlay (172.16.0.0/24) wg-peer LXC

LB IP 10.91.0.3 (Hetzner LB network, cluster-direct)

Same services, listened on 10.91.0.3:<port> via a second Incus proxy device per LXC. Talos workers route on the same subnet, so pods reach zot/devpi/etc. without going through HAProxy mesh-ingress.

Provisioning from scratch

Apply runs from a workstation, not from CI. Image building is wired into tofu via image-builds.tfscripts/build-images-qi.sh, which orchestrates a build LXC on the qubes-incus AppVM and ships images to service-machine via the operator workstation's Incus remotes. The apply MUST run from a workstation that has both:

  • local-incus Incus remote (qrexec → qubes-incus AppVM at port 8443)
  • service-machine Incus remote (mesh IP at port 8443)

Verify with incus remote list before applying. CI runners don't have either remote and will fail at the build provisioner. claude / flex AppVMs are the canonical apply hosts.

Prerequisites: bao-exec admin scope on the workstation, plus the two Incus remotes above.

cd /home/user/work/mesh/service-machine

# One-time cert for the Incus provider (idempotent)
scripts/init-incus-client-cert.sh

# Apply — terraform_data.image_builds runs build-images-qi.sh which:
#   1. ensures qi-builder LXC on qubes-incus
#   2. builds each sm-* image via distrobuilder (skipping any with
#      existing alias on BOTH qi and SM; FORCE_REBUILD=1 bypasses)
#   3. ships missing images to service-machine
#   4. persists every image on qubes-incus as a reproducible source
#   5. invokes backup-images.sh on SM for inline S3 backup
# Then creates volume + LXCs + host config.
bao-exec admin -- bash -c 'export VAULT_ADDR=$BAO_ADDR VAULT_TOKEN=$BAO_TOKEN; \
  bao-exec tofu-hcloud-privileged -- tofu apply'

Break-glass: if qubes-incus is unreachable, the legacy on-SM build path (scripts/build-images.sh) still works. Run with INCUS_REMOTE=local directly inside service-machine. The tofu wiring will still try qi first and fail; you'll need to comment out terraform_data.image_builds for the targeted apply.

For day-to-day operations, see docs/ops.md. For image building + S3 backup, see docs/images.md. For MinIO-specific operations, see docs/minio.md. For Stalwart mail server (mail.loop-coop.net), see docs/mail.md. For coturn TURN/STUN (turn.loop-coop.net, used by Matrix 1:1 WebRTC + reusable), see docs/coturn.md. For LiveKit SFU (matrix-rtc.loop-coop.net + Element Call at call.loop-coop.net, group calls), see docs/livekit.md.

Tofu files

File Purpose
main.tf Hetzner server, firewall, cloud-init hook, mesh seed
incus.tf Incus provider, Cloud Volume, btrfs storage pool, services bridge, profiles (services, privileged-host)
images.tf Per-service image-alias SHAs (sm-<svc>-<sha12>) computed from `sha1(base.yaml
monitoring.tf zot, alloy, node-exporter LXCs
minio.tf MinIO LXC + dedicated Cloud Volume + declarative buckets
devpi.tf devpi LXC
discovery.tf discovery + wg-peer LXCs
unbound.tf, squid.tf, acme.tf remaining apt-native LXCs
mesh-proxy.tf mesh-proxy LXC (HAProxy for mesh-IP frontends)
loop-portal.tf loop-portal LXC (Django identity self-service IdP)
stalwart.tf stalwart LXC (mail server — see docs/mail.md)
coturn.tf coturn host install (TURN/STUN, NOT an LXC — see docs/coturn.md)
livekit.tf LiveKit SFU host install (Matrix group calls, NOT an LXC — see docs/livekit.md)
haproxy.tf Host HAProxy config push (public ingress only)
mesh-agent.tf bao-agent install + env-seed + render templates
nftables.tf Host nftables (SNAT, SSH rate-limit, Incus API mesh-only)
pki.tf vault_pki_secret_backend_cert issuances from qubes bao
backup-images.tf s5cmd install + backup-images.service + .timer
backup-stalwart.tf Daily Stalwart RocksDB tar+S3 push
dns.tf Public DNS A/MX/TXT/SRV records via hcloud_zone_rrset

Bao secrets layout

secret/mesh/keys/service-machine/{private,public}       authoritative (tofu reads)
secret/mesh/peers/service-machine/config                written by tofu
secret/mesh/peers/sys-wg/service-machine                written by tofu
secret/infra/service-machine/
  ssh-key                                               written by tofu (gateway key)
  incus                                                 client cert + ca (tofu)
  incus-cert, incus-client                              tofu + init-incus-client-cert.sh
  mesh-agent                                            role_id, secret_id (AppRole, seed once)
  minio                                                 random root user + password
  discovery                                             mesh_cluster_id, mesh_cipher_key
  discovery-wg-key                                      WG keypair for the 172.16.0.0/24 overlay
  sys-wg-discovery                                      sys-wg's peer cred on the overlay
  haproxy-certs                                         PKI-issued HAProxy TLS (tofu)
secret/infra/s3                                 S3 creds consumed by backup-images.timer
secret/infra/gateway/haproxy/workers/*                  written by platform tofu
secret/infra/gateway/dnsmasq/hosts/*                    written by platform tofu

Relationship to talos-hcloud-cluster

platform/ reads service-machine's mesh IP + SSH key from bao. Writes k8s worker node IPs back so HAProxy + Unbound update automatically on mesh-agent's 60-second render cycle. Service-machine persists independently across cluster rebuilds.

Known gotchas

  • Cloud-init is first-boot only on the Hetzner server. Template changes require tofu taint hcloud_server.node && tofu apply to rebuild. Host-side deltas after first boot come through SSH provisioners in haproxy.tf, nftables.tf, mesh-agent.tf, incus.tf, backup-images.tf.
  • Distrobuilder images live in the Incus pool — replacing the pool (e.g. driver change) requires rebuilding every image. scripts/build-images.sh all.
  • Hetzner volume format only supports ext4/xfs. We create ext4, the SSH provisioner reformats to btrfs before first mount (idempotent).
  • Hetzner hcloud_volume delete_protection via tofu hits 403 with our current token scope. Set it manually: hcloud volume update --delete-protection <id>.
  • Incus aliases must not contain : — the provider parses foo:bar as <remote>:<alias>. We use sm-<svc>-<sha12>.
  • : in systemd mount unit names — escaped to \x2d, easy to mangle in shell. The Incus pool path is /var/lib/incuspool (no hyphen) on purpose.
  • Floating IP cutoverhcloud_floating_ip_assignment in main.tf must not be applied while another server holds the same floating IP.
  • Incus client cert is in .incus/ (git-ignored); if lost re-run scripts/init-incus-client-cert.sh.
  • Loop-CA-signed qube cert (claude-ui trust entry) expires on the cluster PKI's schedule. Re-add after rotation via incus config trust add-certificate --name claude-ui-$(date +%Y%m%d) -.