K3s-cluster/README.md

147 lines
14 KiB
Markdown

# K3s cluster
## CRDs
| Name | Description | Operator | Prometheus integration |
| ------------------------------------------------------------------------ | ----------------------------- | -------- | ---------------------- |
| [Traefik](https://doc.traefik.io/traefik/providers/kubernetes-ingress/) | Kubernetes Ingress Controller | No | Configured |
| [Prometheus](https://github.com/prometheus-operator/prometheus-operator) | Metrics scraping | Yes | Configured |
| [ArgoCD](https://argo-cd.readthedocs.io/en/stable/) | Declarative GitOps CD | No | Configured |
| [Longhorn](https://longhorn.io/) | Distributed block storage | No | Not configured |
| [MetalLB](https://metallb.universe.tf/) | Vare metal load-balancer | No | Not configured |
| [CloudNativePG](https://cloudnative-pg.io/) | PostgreSQL operator | Yes | Not configured |
| [SOPS](https://github.com/isindir/sops-secrets-operator) | Secret management | Yes | Not configured |
## Services
| Name | Usage | Accessibility | Host | DB type | Additional data | Backup configuration | Loki integration | Prometheus integration | Secret management | Status | Standalone migration |
| ----------------------- | ------------------------------------ | ---------------- | ---------- | ---------- | -------------------- | ---------------------- | ---------------- | ---------------------- | ---------------------- | ----------------------------- | --------------------- |
| Traefik | Reverse proxy and load balancer | Public & Private | [All] | - | - | - | Configured | Configured | - | Completed<sup>5</sup> | Backbone |
| ArgoCD | Declarative GitOPS CD | Private | [Workers] | - | - | - | Configured | Configured | - | Completed | Backbone |
| Vaultwarden | Password manager | Public | [Workers] | PostgreSQL | - | - | Configured | Not available | Configured | Completed | Completed |
| Gitea | Version control system | Public | [Workers] | PostgreSQL | User created content | Configured<sup>9</sup> | Configured | Configured | Configured | Completed<sup>4</sup> | Completed |
| Synapse | Matrix server - Message centralizer | Public | [Workers] | PostgreSQL | User files | Configured<sup>9</sup> | Configured | Configured | Configured | Completed | Completed |
| Grafana | Graph visualizer | Private | [Workers] | - | - | - | Configured | Configured | Configured | Completed | Completed<sup>8</sup> |
| Prometheus | Metrics aggregator | Private | [Workers] | - | - | Configured<sup>9</sup> | Configured | Configured | - | Completed | Completed<sup>8</sup> |
| Loki | Log aggregator | Private | [Workers] | \_ | - | Configured<sup>9</sup> | Configured | Configured | - | Completed | Completed<sup>8</sup> |
| Adguard | DNS ad blocker and custom DNS server | Private | [Egress] | - | - | - | Configured | Configured | Configured | Completed | Completed |
| Home assistant | Home automation and monitoring | Private | [Workers] | PostgreSQL | Additional data | Configured<sup>9</sup> | Configured | Configured | Configured | Completed | Completed |
| Owncloud Infinity Scale | File hosting webUI | Public | [Workers] | ? | Drive files | Not configured | Configured | Not configured | Configured | Pending configuration | Awaiting |
| therbron.com | Personal website | Public | [Workers] | - | - | - | Not configured | Not configured | - | Awaiting configuration | Awaiting |
| Radarr | Movie collection manager | Private | [Workers] | PostgreSQL | - | - | Configured | Not configured | Not configured | Partial | Awaiting |
| Flaresolverr | Cloudflare proxy | Private | [Workers] | - | - | - | - | - | - | Completed | Awaiting |
| Sonarr | TV shows collection manager | Private | [Workers] | SQLite | - | Not configured | Configured | Not configured | Not configured | Partial | Awaiting |
| Prowlarr | Torrent indexer | Private | [Workers] | PostgreSQL | - | Not configured | Configured | Not available | Not configured | Partial | Awaiting |
| Jellyfin | Media streaming | Public | Archimedes | SQLite\*\* | - | - | Configured | Not configured | Configured<sup>6</sup> | Completed | Awaiting |
| Jellyseerr | Media requesting WebUI | Public | [Workers] | - | - | - | Not configured | Not available | Configured<sup>7</sup> | Awaiting configuration | Awaiting |
| Minecraft | Vanilla minecraft server for friends | Public | Archimedes | - | Game map | Not configured | Not configured | Not configured | - | Awaiting configuration | Awaiting |
| Satisfactory | Satisfactory server for friends | Public | Archimedes | - | Game map | Not configured | Not configured | Not configured | - | Not needed for v1 | Awaiting |
| Space engineers | Space engineers server for friends | Public | Archimedes | - | Game map | Not configured | Not configured | Not configured | - | Not needed for v1 | Awaiting |
| Raspsnir | Bachelor memorial website | Public | [Workers] | PostgreSQL | - | Not configured | Not configured | Not configured | - | Not needed for v1 | Awaiting |
| Vikunja | To-do and Kanban boards | Public | [Workers] | - | - | - | Not configured | Not configured | - | Migrate to Gitea | Awaiting |
| Wiki | Documentation manager | Public | [Workers] | - | - | - | Not configured | Not configured | - | Migrate to VuePress and Gitea | Awaiting |
| PaperlessNG | PDF viewer and organiser | Public | [Workers] | PostgreSQL | - | - | Not configured | Not configured | - | Research migration into OCIS | Awaiting |
\* Configuration panel only available internally<br>
** Current implementation only support SQLite, making manual backups a necessity<br>
<sup>4</sup> Configuration completed, awaiting data migration from Gitlab<br>
<sup>5</sup> Missing dashboard configuration<br>
<sup>6</sup> Done through volume backup, because not possible otherwise<br>
<sup>7</sup> Done, but needs a reimplementation using kustomize for secret separation from configmap<br>
<sup>8</sup> Done but included in a grouped project `Monitoring`<br>
<sup>9</sup> Handled by Longhorn<br>
## Backup management
### Databases
All services needing a database to function come with a sidecar pod running a crontab to automate individual database backups.
These backups are saved into a longhorn volume, to benefit from general snapshots later one.
Each sidecar pod can only mount the backup folder it has been linked with, and cannot see other services' backups.
### Additional data
All additional data needing to be backed up is mounted to a longhorn volume, to also benefit from scheduled backups.
Example :
```
longhorn
└───backups
└───vaultwarden
│ └───<backup_date>.sql
│ │ ...
└───gitlab
└───<backup_date>.sql
│ ...
```
## TODO
- ~~Add AntiAffinities to `outsider` nodes~~
- ~~Migrate Homeassistant to PostgreSQL instead of MariaDB~~
- ~~Move Prometheus connection management to ServiceMonitors instead of ConfigMap~~
- Schedule longhorn S3 backups
- Schedule CloudNativePG S3 backups
- Restrict `metrics` endpoint on public services
- ~~Migrate Vaultwarden to PostgreSQL instead of MariaDB~~
- ~~Deploy PostgresQL cluster using operator for database HA and easy maintenance~~ - To be tested properly
- Change host/deployment specific variables to use environment variables (using Kustomize)
- ~~Write CI/CD pipeline to create environment loaded files~~ Done with Kustomize migration
- ~~Write CI/CD pipeline to deploy cluster~~ Done with ArgoCD
- ~~Setup internal traefik with nodeport as reverse proxy for internal only services~~ Done through double ingress class and LB
- ~~Setup DB container sidecars for automated backups to Longhorn volume~~
- ~~Setup secrets configuration through CI/CD variable injection (using Kustomize)~~ Environment modified by SOPS implementation
- Figure out SOPS secret injection for absent namespaces
- Explore permission issues when issuing OVH API keys (not working for wildcard and `beta.halia.dev` subdomain)
- Setup default users for deployments
- ~~Setup log and metric monitoring~~
- ~~Define namespaces through yaml files~~
- ~~Look into CockroachDB for redundant database~~ Judged too complicated, moving to a 1 to 1 relationship between services and databases
- ~~Configure IP range accessibility through Traefik (Internal vs external services)~~ Impossible because of flannel ip-masq
- ~~Move secrets to separate, private Git repository ?~~ Done with SOPS
- ~~Configure NFS connection for media library~~
- ~~Research IPv6 configuration for outsider node~~ Impossible in Denmark while using YouSee as an ISP for now (no IPv6 support)
- Write small script for auto installation of the cluster, to split API calls into 2 stages (solves MetalLB API not found error)
- Migrate ingresses to traefik kind instead of k8s kind
## Notes
### Cluster base setup
Setup the cluster's backbone
```
make dev
# Include SOPS master secret generation
kubectl create secret generic age-key --from-file=~/.sops/key.txt -n sops
```
NOTE: It might be required to update the metallb IP range as well as traefik LoadBalancerIPs
### Convert helm chart to k3s manifest
`helm template chart stable/chart --output-dir ./chart`
### Gitlab backup process
Because gitlab does not offer the possibility to backup a container's data from an external container, a cronjob has been implemented in the custom image used for deployment.
NOTE: This does not apply anymore, as a migration is planned to Gitea
### VPN configuration for Deluge
~~Instead of adding an extra networking layer to the whole cluster, it seems like a better idea to just integrate a wireguard connection inside of the deluge image, and self-build everything within Gitlab registry.
This image could utilize kubernetes secrets, including a "torrent-vpn" secret produces by the initial wireguard configuration done via Ansible.
This ansible script could create one (or more) additional client(s) depending on the inventory configuration, and keep the "torrent-vpn" configuration file within a k3s formated file, inside of the auto-applied directory on CP.<br>
Cf : https://docs.k3s.io/advanced#auto-deploying-manifests~~
After furhter reflection, it doesn't make sense to have Deluge being part of the cluster. It will be moved to the NAS, as it can run only when the NAS is running. This will also ease the whole VPN configuration.
### Development domains
To access a service publicly when developing, the domain name should be *.beta.halia.dev
To only expose a service internally, the domain name should be *.beta.entos
### Ingresses
To split between external and internal services, two traefik ingresses are implemented through the `ingressclass` annotation.
`traefik-external` will only allow external access to a given service, while `traefik-internal` restrict to an internal only access.
### Secret management
All secrets are encrypted using SOPS and stored in a private secret repository.
Secrets are decrypted on the fly when applied to the kluster using the SOPS Operator.
Inject the AGE key in the cluster to allow the operator to decrypt secrets :
```
kubectl create secret generic age-key --from-file=<path_to_file> -n sops
```