- HCL 77.9%
- Python 13.5%
- Shell 8.6%
Both PVCs hold non-S3-replicable state: - opencloud-opencloud-data: libregraph-idm boltdb (users, groups, identities) + storage-system spaces registry. Wiping it orphans the whole S3 bucket (S3 has blobs only, no filenames/folder tree). - opencloud-opencloud-config: chart-rendered opencloud.yaml with the service-account credentials/secrets. Re-running the chart would regenerate these and break in-flight tokens. Background: 2026-05-12 commit 07a929a in workloads/opencloud dropped the data PVC entirely on the assumption that S3 is the full source of truth. That re-mounted /var/lib/opencloud as emptyDir, every pod restart wiped the IDM, and AIenv space content (43 MB in S3) became unrecoverable orphan blobs. Re-enabled the data PVC same day; now also rolling it into the daily backup loop so the next PVC mishap restores in seconds. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|---|---|---|
| .woodpecker | ||
| charts/grafana-dashboards | ||
| docs | ||
| packer | ||
| platform | ||
| scripts | ||
| stack | ||
| tests | ||
| .gitignore | ||
| .gitlab-ci.yml | ||
| bao.yml | ||
| CLAUDE.md | ||
| Makefile | ||
| mkdocs.yml | ||
| README.md | ||
| terraform.tfvars.example | ||
talos-hcloud-cluster
OpenTofu module that provisions a Talos Linux Kubernetes cluster on Hetzner Cloud with full-mesh WireGuard encryption, a platform services stack, and integrated secrets management.
Architecture
gateway/ stack/
Mesh hub, NAT, HAProxy operators (CRDs, OpenBao)
dnsmasq, registry mirrors bootstrap (one-time k8s bao setup)
OpenBao raft peer restore (DR from raft snapshot)
services (Helm releases)
platform/
Hetzner servers database/
WireGuard mesh shared-pg (3-instance CNPG HA)
Talos bootstrap Valkey replication
Firewall, DNS
Three root modules with separate state:
- service-machine/ — standalone at
~/work/mesh/service-machine/. Mesh hub, NAT, HAProxy, OpenBao transit - platform/ — infrastructure, networking, cluster bootstrap
- stack/ — Kubernetes workloads. Reads kubeconfig from platform output
What it deploys
Infrastructure — Hetzner servers, pre-allocated public IPs, cloud firewall, DNS records.
Gateway (10.90.0.2) — Debian server on the WireGuard mesh: HAProxy ingress proxy, dnsmasq, registry pull-through mirrors, OpenBao transit engine for auto-unseal.
Networking — Full-mesh WireGuard overlay. All intra-cluster traffic over encrypted tunnels. The API endpoint is a mesh IP — never exposed publicly.
Platform stack (in dependency order):
| Service | Namespace | Purpose |
|---|---|---|
| Prometheus operator + Prometheus | monitoring | Metrics collection |
| Grafana | monitoring | Dashboards, OIDC SSO |
| Tempo | monitoring | Distributed tracing |
| Loki + Alloy | monitoring | Log aggregation, OTLP collection |
| cert-manager | cert-manager | TLS (Let's Encrypt) |
| ingress-nginx | ingress-nginx | Ingress controller |
| CloudNative-PG | cnpg-system | PostgreSQL operator |
| Longhorn | longhorn-system | Distributed block storage (LUKS2) |
| OpenBao | openbao | 3-node raft HA, transit auto-unseal, OIDC provider |
| External Secrets | external-secrets | OpenBao → k8s Secret sync |
Cluster-scoped applications (this repo):
| Service | URL | Auth |
|---|---|---|
| Grafana | grafana.loop-coop.net | OpenBao OIDC |
| Loop Portal | id.loop-coop.net | Userpass + TOTP MFA (OIDC provider) |
Build-env services (separate repo, deployed on top): Forgejo and Woodpecker live in
k8s-build-env. This repo provides the
generic cluster — k8s-build-env layers the git forge + CI on top, reusing shared-pg,
OpenBao, ESO, cert-manager, and the monitoring stack defined here.
Container images and Helm charts are served from the Forgejo OCI registry at git.loop-coop.net/projects/*. Pull-through caching for public registries runs on the gateway via Zot at 10.90.0.2:5000.
Deploy flows
Fresh deploy
cd ~/work/mesh/service-machine && tofu apply # gateway + NAT + mesh
cd ~/work/platform/talos-hcloud-cluster/platform && tofu apply # nodes + Talos
cd ../stack && tofu apply -target=module.operators # CRDs + OpenBao pods
kubectl exec -n openbao openbao-0 -- bao operator init -recovery-shares=1 -recovery-threshold=1
# Store root token in qubes bao: bao kv put secret/infra/k8s-bao ...
kubectl port-forward -n openbao pod/openbao-0 8201:8200 &
cd bootstrap && TF_VAR_openbao_token=<root_token> TF_VAR_openbao_addr=http://localhost:8201 tofu apply
cd .. && tofu apply # services converge
Restore from backup
cd platform && tofu apply # infrastructure (gateway persists)
cd ../stack && tofu apply -target=module.operators # CRDs + OpenBao pods
cd restore && tofu apply # verify transit, restore raft from S3
cd .. && tofu apply # services (skip bootstrap)
Day-2
cd stack && tofu apply # single command
Module structure
| Path | Purpose |
|---|---|
platform/ |
Infrastructure, networking, cluster bootstrap |
~/work/mesh/service-machine/ |
Standalone: HAProxy, dnsmasq, registry mirrors, NAT, OpenBao raft peer |
platform/node/ |
Talos machine config generation |
stack/ |
Root module: operators → services |
stack/bootstrap/ |
One-time k8s bao setup: secrets, policies, OIDC, ESO (no qubes bao) |
stack/restore/ |
DR: verify transit key, download snapshot from S3, restore raft |
stack/modules/operators/ |
CRDs, cert-manager, ingress, Longhorn, CNPG, ESO, OpenBao |
stack/modules/services/ |
Applications, monitoring, backups, cert backup/restore |
stack/dashboards/ |
Grafana dashboard JSON files |
docs/ |
Backup strategy, pipeline, tracing |
Cluster networking
WireGuard mesh (KubeSpan)
Talos KubeSpan provides a full-mesh WireGuard overlay across all nodes. Each node gets a stable mesh IP in 10.90.0.0/16 assigned by the discovery service.
Discovery — nodes find each other via a self-hosted discovery-server (siderolabs-compatible gRPC) running as an Incus container on service-machine:
https://discovery.svc.loop-coop.net:3000 (public, Let's Encrypt TLS)
DNS discovery.svc.loop-coop.net resolves to the service-machine public IP (178.104.189.138) directly — Talos nodes connect without going through HAProxy or mesh. The discovery WireGuard network (UDP 51820) is open inbound/outbound on the cluster firewall.
Config — platform/node/main.tf sets:
machine.network.kubespan = {
enabled = true
advertise_kubernetes_subnets = true
}
discovery.registries.service = {
endpoint = "https://discovery.svc.loop-coop.net:3000"
}
discovery.registries.kubernetes = { disabled = true }
The Kubernetes registry is disabled — all peer exchange goes through the service registry only.
Gateway peer — service-machine (sys-wg, 10.90.0.0/16) is a static peer in every node's WireGuard config (NbqETvspFER/..., AllowedIPs 10.90.0.0/16), giving nodes a route back to the qubes mesh.
Secrets management
OpenBao provides centralized secrets via External Secrets Operator (ESO) and serves as OIDC provider for Grafana (and, via k8s-build-env, Forgejo). User-facing SSO (userpass + TOTP) is handled by Loop Portal at id.loop-coop.net, which authenticates against the same OpenBao identity.
OpenBao (KV v2) ←── K8s auth ──→ ESO SecretStore (per namespace)
│ │
└── secret/data/infra/<app>/* └── ExternalSecret CRs → K8s Secrets
Auto-unseal: Transit engine on gateway bao (10.90.0.2:8200). Pods auto-unseal on start — no manual intervention needed.
Identity: Userpass auth + OIDC provider with groups (admins, developers, ci). TOTP MFA via Loop Portal.
Backup & DR
| Data | Schedule | Encryption |
|---|---|---|
| OpenBao raft | Hourly | Vault barrier (AES-256-GCM) |
| Platform credentials | Daily | GPG (Claude + Stefan) |
| CNPG databases | Daily | GPG (Claude + Stefan) |
| Longhorn volumes | Daily | LUKS2 at rest |
| TLS certificates | 24h PushSecret | Vault KV (in raft) |
Full DR round-trip verified. See docs/backup.md.
CI/CD
Woodpecker itself is deployed by the k8s-build-env repo. This repo has pipelines in
.woodpecker/ that run on the same Woodpecker instance once it's up:
| Pipeline | Trigger | Purpose |
|---|---|---|
ci.yml |
push, PR | Format check + validate |
deploy.yml |
manual | Apply platform + stack |
build-image.yml |
manual | Build Talos image |
release.yml |
tag | Create Forgejo release |
Security model
- Kubernetes API binds to mesh IPs only — unreachable without WireGuard
- Hetzner cloud firewall restricts public-facing ports
- All data at rest encrypted (LUKS2, unique per node)
- Transit auto-unseal — gateway bao key never leaves the gateway
- Spread placement groups for physical host isolation
- OIDC SSO for all user-facing services via OpenBao
Requirements
- OpenTofu >= 1.11
- Hetzner Cloud account with API token
- Talos Linux snapshot on Hetzner
- DNS zone on Hetzner Cloud
- S3 credentials for backup bucket
Version
Current: v0.5.0 (2026-04-05)