No description
  • HCL 77.9%
  • Python 13.5%
  • Shell 8.6%
Find a file
statevault 99f62f5b14 feat(backup): include opencloud data + config PVCs in daily Longhorn→S3
Both PVCs hold non-S3-replicable state:
- opencloud-opencloud-data: libregraph-idm boltdb (users, groups,
  identities) + storage-system spaces registry. Wiping it orphans the
  whole S3 bucket (S3 has blobs only, no filenames/folder tree).
- opencloud-opencloud-config: chart-rendered opencloud.yaml with the
  service-account credentials/secrets. Re-running the chart would
  regenerate these and break in-flight tokens.

Background: 2026-05-12 commit 07a929a in workloads/opencloud dropped
the data PVC entirely on the assumption that S3 is the full source
of truth. That re-mounted /var/lib/opencloud as emptyDir, every pod
restart wiped the IDM, and AIenv space content (43 MB in S3) became
unrecoverable orphan blobs. Re-enabled the data PVC same day; now
also rolling it into the daily backup loop so the next PVC mishap
restores in seconds.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 18:07:47 +02:00
.woodpecker feat(alerts): runbook + investigate annotations on all critical rules + CI lint 2026-05-08 09:19:42 +02:00
charts/grafana-dashboards fix(dashboards): rewrite loop-portal audit dashboards for SM LXC + Loki-only 2026-05-04 06:49:39 +02:00
docs docs(pipeline): update state backend section for qubes-incus migration 2026-05-10 07:11:38 +02:00
packer chore: trivy/SBOM + image bumps + dashboards + openbao 2.5.3 + Talos plan 2026-05-02 16:04:30 +02:00
platform feat(scaling): wo 2->1 small, wo-big 1->2 big 2026-05-11 16:02:13 +02:00
scripts feat(alerts): runbook + investigate annotations on all critical rules + CI lint 2026-05-08 09:19:42 +02:00
stack feat(backup): include opencloud data + config PVCs in daily Longhorn→S3 2026-05-15 18:07:47 +02:00
tests docs+bao: ~/work/ path scrub + bao.yml manifest (#61) 2026-05-01 16:52:15 +00:00
.gitignore topology spread, CPU requests, dashboard chart, harbor config, alertmanager webhook 2026-04-10 10:34:27 +02:00
.gitlab-ci.yml Import talos-hcloud-cluster from old-aienv 2026-03-24 18:59:30 +01:00
bao.yml docs+bao: ~/work/ path scrub + bao.yml manifest (#61) 2026-05-01 16:52:15 +00:00
CLAUDE.md fix(stack): read transit_unseal_token from bao not env 2026-05-08 16:12:47 +02:00
Makefile Preserve gateway across destroys (prevent_destroy + state rm) 2026-04-01 22:37:34 +02:00
mkdocs.yml Add pipeline docs to mkdocs nav 2026-04-02 02:30:01 +02:00
README.md chore: scrub image registry paths to loco/<group>-<project> (#63) 2026-05-01 19:02:39 +00:00
terraform.tfvars.example Import talos-hcloud-cluster from old-aienv 2026-03-24 18:59:30 +01:00

talos-hcloud-cluster

OpenTofu module that provisions a Talos Linux Kubernetes cluster on Hetzner Cloud with full-mesh WireGuard encryption, a platform services stack, and integrated secrets management.

Architecture

gateway/                           stack/
  Mesh hub, NAT, HAProxy             operators (CRDs, OpenBao)
  dnsmasq, registry mirrors          bootstrap (one-time k8s bao setup)
  OpenBao raft peer                  restore (DR from raft snapshot)
                                     services (Helm releases)
platform/
  Hetzner servers                  database/
  WireGuard mesh                     shared-pg (3-instance CNPG HA)
  Talos bootstrap                    Valkey replication
  Firewall, DNS

Three root modules with separate state:

  • service-machine/ — standalone at ~/work/mesh/service-machine/. Mesh hub, NAT, HAProxy, OpenBao transit
  • platform/ — infrastructure, networking, cluster bootstrap
  • stack/ — Kubernetes workloads. Reads kubeconfig from platform output

What it deploys

Infrastructure — Hetzner servers, pre-allocated public IPs, cloud firewall, DNS records.

Gateway (10.90.0.2) — Debian server on the WireGuard mesh: HAProxy ingress proxy, dnsmasq, registry pull-through mirrors, OpenBao transit engine for auto-unseal.

Networking — Full-mesh WireGuard overlay. All intra-cluster traffic over encrypted tunnels. The API endpoint is a mesh IP — never exposed publicly.

Platform stack (in dependency order):

Service Namespace Purpose
Prometheus operator + Prometheus monitoring Metrics collection
Grafana monitoring Dashboards, OIDC SSO
Tempo monitoring Distributed tracing
Loki + Alloy monitoring Log aggregation, OTLP collection
cert-manager cert-manager TLS (Let's Encrypt)
ingress-nginx ingress-nginx Ingress controller
CloudNative-PG cnpg-system PostgreSQL operator
Longhorn longhorn-system Distributed block storage (LUKS2)
OpenBao openbao 3-node raft HA, transit auto-unseal, OIDC provider
External Secrets external-secrets OpenBao → k8s Secret sync

Cluster-scoped applications (this repo):

Service URL Auth
Grafana grafana.loop-coop.net OpenBao OIDC
Loop Portal id.loop-coop.net Userpass + TOTP MFA (OIDC provider)

Build-env services (separate repo, deployed on top): Forgejo and Woodpecker live in k8s-build-env. This repo provides the generic cluster — k8s-build-env layers the git forge + CI on top, reusing shared-pg, OpenBao, ESO, cert-manager, and the monitoring stack defined here.

Container images and Helm charts are served from the Forgejo OCI registry at git.loop-coop.net/projects/*. Pull-through caching for public registries runs on the gateway via Zot at 10.90.0.2:5000.

Deploy flows

Fresh deploy

cd ~/work/mesh/service-machine && tofu apply        # gateway + NAT + mesh
cd ~/work/platform/talos-hcloud-cluster/platform && tofu apply  # nodes + Talos
cd ../stack && tofu apply -target=module.operators  # CRDs + OpenBao pods
kubectl exec -n openbao openbao-0 -- bao operator init -recovery-shares=1 -recovery-threshold=1
# Store root token in qubes bao: bao kv put secret/infra/k8s-bao ...
kubectl port-forward -n openbao pod/openbao-0 8201:8200 &
cd bootstrap && TF_VAR_openbao_token=<root_token> TF_VAR_openbao_addr=http://localhost:8201 tofu apply
cd .. && tofu apply                                 # services converge

Restore from backup

cd platform && tofu apply                        # infrastructure (gateway persists)
cd ../stack && tofu apply -target=module.operators # CRDs + OpenBao pods
cd restore && tofu apply                          # verify transit, restore raft from S3
cd .. && tofu apply                               # services (skip bootstrap)

Day-2

cd stack && tofu apply                            # single command

Module structure

Path Purpose
platform/ Infrastructure, networking, cluster bootstrap
~/work/mesh/service-machine/ Standalone: HAProxy, dnsmasq, registry mirrors, NAT, OpenBao raft peer
platform/node/ Talos machine config generation
stack/ Root module: operators → services
stack/bootstrap/ One-time k8s bao setup: secrets, policies, OIDC, ESO (no qubes bao)
stack/restore/ DR: verify transit key, download snapshot from S3, restore raft
stack/modules/operators/ CRDs, cert-manager, ingress, Longhorn, CNPG, ESO, OpenBao
stack/modules/services/ Applications, monitoring, backups, cert backup/restore
stack/dashboards/ Grafana dashboard JSON files
docs/ Backup strategy, pipeline, tracing

Cluster networking

WireGuard mesh (KubeSpan)

Talos KubeSpan provides a full-mesh WireGuard overlay across all nodes. Each node gets a stable mesh IP in 10.90.0.0/16 assigned by the discovery service.

Discovery — nodes find each other via a self-hosted discovery-server (siderolabs-compatible gRPC) running as an Incus container on service-machine:

https://discovery.svc.loop-coop.net:3000   (public, Let's Encrypt TLS)

DNS discovery.svc.loop-coop.net resolves to the service-machine public IP (178.104.189.138) directly — Talos nodes connect without going through HAProxy or mesh. The discovery WireGuard network (UDP 51820) is open inbound/outbound on the cluster firewall.

Configplatform/node/main.tf sets:

machine.network.kubespan = {
  enabled             = true
  advertise_kubernetes_subnets = true
}
discovery.registries.service = {
  endpoint = "https://discovery.svc.loop-coop.net:3000"
}
discovery.registries.kubernetes = { disabled = true }

The Kubernetes registry is disabled — all peer exchange goes through the service registry only.

Gateway peer — service-machine (sys-wg, 10.90.0.0/16) is a static peer in every node's WireGuard config (NbqETvspFER/..., AllowedIPs 10.90.0.0/16), giving nodes a route back to the qubes mesh.

Secrets management

OpenBao provides centralized secrets via External Secrets Operator (ESO) and serves as OIDC provider for Grafana (and, via k8s-build-env, Forgejo). User-facing SSO (userpass + TOTP) is handled by Loop Portal at id.loop-coop.net, which authenticates against the same OpenBao identity.

OpenBao (KV v2)  ←──  K8s auth  ──→  ESO SecretStore (per namespace)
     │                                        │
     └── secret/data/infra/<app>/*            └── ExternalSecret CRs → K8s Secrets

Auto-unseal: Transit engine on gateway bao (10.90.0.2:8200). Pods auto-unseal on start — no manual intervention needed.

Identity: Userpass auth + OIDC provider with groups (admins, developers, ci). TOTP MFA via Loop Portal.

Backup & DR

Data Schedule Encryption
OpenBao raft Hourly Vault barrier (AES-256-GCM)
Platform credentials Daily GPG (Claude + Stefan)
CNPG databases Daily GPG (Claude + Stefan)
Longhorn volumes Daily LUKS2 at rest
TLS certificates 24h PushSecret Vault KV (in raft)

Full DR round-trip verified. See docs/backup.md.

CI/CD

Woodpecker itself is deployed by the k8s-build-env repo. This repo has pipelines in .woodpecker/ that run on the same Woodpecker instance once it's up:

Pipeline Trigger Purpose
ci.yml push, PR Format check + validate
deploy.yml manual Apply platform + stack
build-image.yml manual Build Talos image
release.yml tag Create Forgejo release

Security model

  • Kubernetes API binds to mesh IPs only — unreachable without WireGuard
  • Hetzner cloud firewall restricts public-facing ports
  • All data at rest encrypted (LUKS2, unique per node)
  • Transit auto-unseal — gateway bao key never leaves the gateway
  • Spread placement groups for physical host isolation
  • OIDC SSO for all user-facing services via OpenBao

Requirements

  • OpenTofu >= 1.11
  • Hetzner Cloud account with API token
  • Talos Linux snapshot on Hetzner
  • DNS zone on Hetzner Cloud
  • S3 credentials for backup bucket

Version

Current: v0.5.0 (2026-04-05)