From 618a9e9bce7ec91cff1d1d1292c4669552cefea8 Mon Sep 17 00:00:00 2001 From: maxime Date: Wed, 18 Feb 2026 18:35:49 +0100 Subject: [PATCH] docs: add sprint-1 hardening and container observability runbook --- README.md | 19 ++- SECURITY.md | 26 +++- docs/README.md | 6 +- ...6-02-18_container_observability_runbook.md | 138 ++++++++++++++++++ 4 files changed, 184 insertions(+), 5 deletions(-) create mode 100644 docs/ops/2026-02-18_container_observability_runbook.md diff --git a/README.md b/README.md index a89ba693..639ee76f 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,7 @@ ## Table of Contents - [Introduction](#introduction) +- [2026-02-18 Hardening Notes](#2026-02-18-hardening-notes) - [Security & Performance](#security--performance-breakthrough-new) - [Enterprise Security](#️-enterprise-grade-hashicorp-vault-integration) - [Maximum Speed Trading](#-maximum-speed-trading) @@ -30,6 +31,21 @@ The **ELVIS** (**E**nhanced **L**everaged **V**irtual **I**nvestment **S**ystem) Trading Bot is a sophisticated, modular algorithmic trading system that leverages machine learning models for automated cryptocurrency trading. The system integrates multiple ML architectures, real-time data processing, risk management, and execution modules to facilitate intelligent trading strategies with comprehensive monitoring and visualization capabilities. +## 2026-02-18 Hardening Notes + +The February 18, 2026 hardening update closed issues `#9`, `#10`, `#11`, `#13`, and `#16` and changed runtime defaults in ways operators need to know: + +- `VAULT_TOKEN` is no longer hardcoded or auto-populated. +- `POSTGRES_PASSWORD` no longer has a hardcoded default. +- Trade History API bind is now local by default: + - `TRADE_HISTORY_API_HOST=127.0.0.1` + - `TRADE_HISTORY_API_PORT=5050` +- Repository hygiene policy now ignores local virtualenv/build trees by default (`env*/`, `venv*/`, `.venv/`, `tensorflow/`). + +Operational runbook and troubleshooting: +- `docs/ops/2026-02-18_container_observability_runbook.md` +- `SECURITY.md` + ## 🚀 Current Status (July 2025) **✅ FULLY OPERATIONAL TRADING BOT WITH ENTERPRISE SECURITY** @@ -112,7 +128,8 @@ git clone https://github.com/cluster2600/ELVIS.git cd ELVIS/ansible chmod +x run_setup.sh ./run_setup.sh --docker -# Access at http://localhost:5050 when ready +# API health: http://localhost:5050/health +# Grafana: http://localhost:3001 ``` **Option 2: Secure Development Setup** diff --git a/SECURITY.md b/SECURITY.md index 131e2188..dc749179 100644 --- a/SECURITY.md +++ b/SECURITY.md @@ -3,6 +3,26 @@ ## Overview ELVIS Trading Bot implements enterprise-grade security practices with HashiCorp Vault integration for secure secrets management and API key protection. +## Sprint-1 Hardening Updates (2026-02-18) + +These security and exposure changes were shipped in the Sprint-1 hardening pass: + +- Removed hardcoded Vault token behavior. + - `VAULT_TOKEN` must now be provided externally when Vault-backed secret loading is required. +- Removed hardcoded Postgres password fallback. + - `POSTGRES_PASSWORD` has no default and must be supplied via environment/secrets manager. +- Changed Trade History API exposure to local-only by default. + - `TRADE_HISTORY_API_HOST` default: `127.0.0.1` + - `TRADE_HISTORY_API_PORT` default: `5050` +- Added API key authentication to the Flask trade-history API (`X-API-Key` header). + - `/health` is exempt for health checks. + +Operational implications: +- If Prometheus scrapes `/metrics` through the authenticated Flask API, you must either: + - configure Prometheus to send `X-API-Key`, or + - explicitly exempt `/metrics` in API auth middleware. +- For remote/API access from outside localhost (for example in containerized deployments), set `TRADE_HISTORY_API_HOST=0.0.0.0` intentionally. + ## 🔐 HashiCorp Vault Integration ### Security Architecture @@ -195,8 +215,8 @@ def test_vault_security(): --- -**Last Updated**: July 20, 2025 +**Last Updated**: February 18, 2026 **Security Review**: Complete -**Next Review**: January 20, 2026 +**Next Review**: August 18, 2026 -> **Note**: This security implementation represents enterprise-grade protection for cryptocurrency trading operations. All security measures are actively monitored and regularly audited. \ No newline at end of file +> **Note**: This security implementation represents enterprise-grade protection for cryptocurrency trading operations. All security measures are actively monitored and regularly audited. diff --git a/docs/README.md b/docs/README.md index 50fb6857..6e23bb89 100644 --- a/docs/README.md +++ b/docs/README.md @@ -6,6 +6,10 @@ - **[SECURITY.md](../SECURITY.md)** - Complete security implementation with HashiCorp Vault - **[VAULT_SETUP.md](VAULT_SETUP.md)** - Step-by-step Vault configuration guide +### 🧰 Operations Runbooks +- **[2026-02-18 Container Observability Runbook](ops/2026-02-18_container_observability_runbook.md)** - Container startup, Grafana/Prometheus "No data" troubleshooting, and post-hardening env requirements +- **[2026-02-10 ELVIS No Data Debug](ops/2026-02-10_elvis_no_data_debug.md)** - Prior investigation notes for dashboard data issues + ### 📊 System Architecture - **[API Monitoring](../utils/api_connection_tester.py)** - Real-time API health monitoring - **[Console Dashboard](../utils/console_dashboard.py)** - Live trading dashboard with visual indicators @@ -116,4 +120,4 @@ Updated: 18:15:42 **For detailed setup instructions, see [VAULT_SETUP.md](VAULT_SETUP.md)** **For security details, see [SECURITY.md](../SECURITY.md)** -**For support, check the main [README.md](../README.md)** \ No newline at end of file +**For support, check the main [README.md](../README.md)** diff --git a/docs/ops/2026-02-18_container_observability_runbook.md b/docs/ops/2026-02-18_container_observability_runbook.md new file mode 100644 index 00000000..04a36519 --- /dev/null +++ b/docs/ops/2026-02-18_container_observability_runbook.md @@ -0,0 +1,138 @@ +# 2026-02-18 - Container observability and hardening runbook + +## Scope +This runbook documents the post-hardening behavior introduced on February 18, 2026 (issues `#9`, `#10`, `#11`, `#13`, `#16`) and how to keep Grafana dashboards populated in container deployments. + +## Security-sensitive runtime defaults + +### Required environment behavior +- `VAULT_TOKEN` has no hardcoded fallback. +- `POSTGRES_PASSWORD` has no hardcoded fallback. +- Trade History API defaults to local bind: + - `TRADE_HISTORY_API_HOST=127.0.0.1` + - `TRADE_HISTORY_API_PORT=5050` +- Flask trade-history API now requires `X-API-Key: ` for all routes except `/health`. + +### Practical impact +- If you do not provide `VAULT_TOKEN`, Vault-backed secret retrieval is unavailable. +- If you do not provide `POSTGRES_PASSWORD`, DB auth may fail. +- If you keep default API host binding (`127.0.0.1`), remote/container scrapers cannot reach the API unless they share the same namespace. + +## Container startup (monitoring stack) + +From project root: + +```bash +docker compose up -d postgres redis elvis-bot prometheus grafana loki promtail +docker compose ps +``` + +Expected host endpoints: +- Grafana: `http://localhost:3001` +- Prometheus: `http://localhost:9090` +- Trade API health: `http://localhost:5050/health` + +## Verify metrics pipeline end-to-end + +1. Verify ELVIS API process is healthy: + +```bash +curl -s http://localhost:5050/health +``` + +2. Verify metrics endpoint responds: + +Without API auth: +```bash +curl -s -o /dev/null -w "%{http_code}\n" http://localhost:5050/metrics +``` + +With API auth: +```bash +curl -s -H "X-API-Key: ${API_KEY}" -o /dev/null -w "%{http_code}\n" http://localhost:5050/metrics +``` + +3. Verify Prometheus target status: + +```bash +curl -s http://localhost:9090/api/v1/targets +``` + +For job `elvis`, confirm `health` is `up`. + +4. In Grafana, verify datasource and dashboard: +- Datasource: `Prometheus` (provisioned via `grafana/provisioning/datasources/prometheus.yml`) +- Dashboard folder: `ELVIS` + +## Grafana "No data" troubleshooting + +### 1) ELVIS service not running +Symptom: +- Prometheus target state is `down` with connection refused. + +Fix: +```bash +docker compose up -d elvis-bot +docker compose logs --tail=200 elvis-bot +``` + +### 2) Scrape target mismatch +If ELVIS runs inside the same Compose network as Prometheus: +- Preferred target: `elvis-bot:5050` + +If ELVIS runs on host and Prometheus runs in container: +- Target can be: `host.docker.internal:5050` + +Update `prometheus.yml` accordingly and restart Prometheus: +```bash +docker compose restart prometheus +``` + +### 3) API auth blocking `/metrics` +Because API auth now applies to most routes, `/metrics` may return `401`/`503`. + +Fix option A (recommended): +- Configure Prometheus to send `X-API-Key`. + +Example: +```yaml +- job_name: 'elvis' + static_configs: + - targets: ['elvis-bot:5050'] + metrics_path: '/metrics' + scheme: 'http' + scrape_interval: 10s + http_config: + headers: + X-API-Key: '' +``` + +Fix option B: +- Exempt `/metrics` from Flask auth middleware if your deployment model requires anonymous local scrape. + +### 4) API bound to localhost only +If Prometheus is remote (container/host boundary), local-only bind can prevent access. + +Fix: +- Set `TRADE_HISTORY_API_HOST=0.0.0.0` intentionally in deployment env. + +## Repository hygiene policy + +To prevent repo bloat and accidental secret/artifact commits: +- Do not commit local environments (`env*/`, `venv*/`, `.venv/`). +- Do not commit local ML source/build directories (for example `tensorflow/`). +- Keep generated logs/data/model artifacts out of git unless explicitly versioned. + +If large local directories are accidentally tracked: + +```bash +git rm -r --cached env-coreml env-ydf .venv venv tensorflow +git commit -m "chore(repo): stop tracking local environment/build artifacts" +``` + +## Related files +- `README.md` +- `SECURITY.md` +- `docker-compose.yml` +- `prometheus.yml` +- `trading/utils/trade_history_api.py`