feat: [torrust#33] add secret management strategy documents

josecelano · josecelano · commit 31f9374e5163 · 2025-08-13T21:21:38.000+01:00
diff --git a/docs/redesign/phase3-design/secret-management-strategy.md b/docs/redesign/phase3-design/secret-management-strategy.md
@@ -0,0 +1,129 @@
+# Secret Management Strategy
+
+## 1. Context
+
+The Torrust Tracker application requires the management of sensitive information (secrets) to
+operate correctly. These secrets include database credentials, API tokens, and other sensitive
+parameters.
+
+In the previous Proof of Concept (PoC), secrets were managed through a `.env` file stored on
+the host virtual machine (VM). This file was used by Docker Compose to inject secrets into
+running containers and was also sourced by host-level scripts (e.g., for database backups).
+
+This approach, while simple, stores secrets in plaintext, which has security implications. As
+we move to a production-grade design, we must formalize our secret management strategy,
+balancing security, operational simplicity, and the technical constraints of our chosen
+services.
+
+This decision is documented in
+**[ADR-004: Configuration Approach - Files vs Environment Variables](../adr/004-configuration-approach-files-vs-environment-variables.md)**.
+
+## 2. The Challenge: Service-Specific Configuration
+
+While the twelve-factor app methodology advocates for strict configuration via environment
+variables, not all services support this pattern. A key challenge in our stack is
+**Prometheus**, which does not support runtime environment variable substitution in its
+configuration files.
+
+As noted in ADR-004, this means that any secrets required by Prometheus (such as an API
+token for scraping a protected endpoint) must be embedded directly into the `prometheus.yml`
+file at deployment time. This technical constraint forces us to adopt a hybrid configuration
+strategy.
+
+## 3. Proposed Strategy: Centralized Plaintext Configuration
+
+We will adopt a strategy that centralizes secrets in plaintext files within a protected
+directory on the host VM. This approach acknowledges the limitations of our stack while
+providing a clear, maintainable, and operationally simple system.
+
+1. **Primary Secrets File (`.env`):**
+
+   - A primary `.env` file will be located at `/var/lib/torrust/compose/.env`.
+   - This file will contain the majority of secrets, such as database credentials,
+     Grafana passwords, and the tracker's admin token.
+   - Docker Compose will use this file to inject secrets into the relevant service
+     containers (Tracker, MySQL, Grafana, etc.) at runtime.
+
+2. **Service-Specific Configuration Files:**
+
+   - For services that do not support environment variables for secrets (i.e.,
+     Prometheus), the secrets will be embedded directly into their configuration files
+     (e.g., `/var/lib/torrust/prometheus/etc/prometheus.yml`).
+   - These configuration files will be generated from templates during the `app-deploy`
+     process, where secret values are substituted from the main environment
+     configuration.
+
+3. **Containerized Backups:**
+   - To avoid exposing database credentials to the host's `cron` system, database
+     backups will be performed by a dedicated, short-lived `torrust-backup` container.
+   - This container will be launched by a simple `cron` job on the host
+     (`docker compose run --rm torrust-backup`).
+   - The backup container will receive the necessary database credentials from the
+     `.env` file via Docker Compose, ensuring that secrets do not need to be read or
+     managed by host-level scripts.
+
+### Benefits of this Strategy
+
+- **Operational Simplicity:** Easy for administrators to manage. Secrets can be rotated
+  by editing the `.env` file and restarting services.
+- **Self-Contained System:** The VM is fully self-sufficient after deployment. The
+  installer machine can be discarded.
+- **Handles Exceptions:** The strategy explicitly accounts for services like Prometheus
+  that cannot use environment variables for secrets.
+
+### The Prometheus Precedent
+
+The decision to embed secrets directly into configuration files for certain services is not
+merely a workaround but aligns with the design philosophy of major tools in our stack. The
+Prometheus development team has explicitly stated their position on this matter, confirming
+that the intended and supported method for providing secrets is through the configuration
+file itself.
+
+In a long-standing GitHub issue,
+**[Support for secrets set in ENV variables #504]**, the Prometheus team
+clarifies that they have chosen to support only one method for configuration to maintain
+simplicity and consistency. When asked about supporting environment variables for secrets, a
+core developer stated:
+
+[Support for secrets set in ENV variables #504]: https://github.com/prometheus/alertmanager/issues/504
+
+> The chosen approach is to put them in the config file. There's many many possible ways
+> to provide configuration, for sanity we have to choose just one of them.
+
+This official stance validates our hybrid approach. It confirms that for services like
+Prometheus, managing secrets via file-based configuration is the expected pattern, not an
+anti-pattern. Our strategy, therefore, is consistent with the operational principles of the
+tools we use.
+
+## 4. Security Considerations
+
+This strategy involves storing secrets in plaintext on the VM's filesystem. It is crucial
+to understand the security implications.
+
+If an attacker gains root-level or `torrust` user access to the host VM, they can
+compromise the application's secrets. The security of this model relies on the security of
+the host VM itself.
+
+An attacker with access to the host could:
+
+1. **Read Plaintext Files:** Directly read the contents of
+   `/var/lib/torrust/compose/.env` and any other configuration files containing secrets.
+2. **Inspect Running Containers:** Use `docker inspect` on any running container to view
+   all the environment variables that were passed to it.
+3. **Execute Commands in Containers:** Use `docker exec` to gain a shell inside a running
+   container and then use commands like `env` or `printenv` to list all environment
+   variables.
+
+This strategy prioritizes operational simplicity and compatibility with our service stack
+over achieving the highest possible level of security (which would require an external
+secrets manager like HashiCorp Vault). The primary defense is hardening the host VM itself
+through measures like:
+
+- A restrictive firewall (`ufw`).
+- SSH key-only authentication.
+- Intrusion detection tools (`fail2ban`).
+- Regular security updates.
+
+This approach is deemed an acceptable risk for the project's scope, providing a
+significant improvement over the PoC by centralizing configuration and containerizing
+auxiliary tasks like backups.