Automated Full-Stack Observability deployment resources for Uhstray.io
- Architecture
- Overview
- Contributing Guidelines
- Component Overview
- Getting Started
- Testing and Developing
- Troubleshooting
- Technology References
graph LR
%% Add direction controls to subgraphs
subgraph Observability Collection
subgraph Data Sources
apps[Applications]
containers[Containers]
servers[Servers]
logs[Logs]
end
subgraph Instrumentation
subgraph Exporters
otel_agent[OpenTelemetry Agent - Ports :4317, :4318]
nodeexp[Node Exporter - Ports :9090]
cadvisor[cAdvisor - Ports :9090]
promtail[Promtail - Ports :3100]
end
subgraph Local OTEL Collector
otel_col[OpenTelemetry Collector - Ports :4317, :4318]
end
subgraph Local Time Series Database
prom[Prometheus - Ports :9090]
end
end
end
subgraph Observability Pipeline
subgraph Logging
subgraph Logs Pipeline
alloy_logs{"Alloy<br>:4318"}
end
loki[Loki - Ports :3100]
end
subgraph Tracing
subgraph Tracing Pipeline
alloy_trace{"Alloy<br>:12345/:4319"}
end
tempoDistributor["Tempo Distributor"]
tempoIngesters["Tempo Ingesters"]
tempoQuery["Tempo Query Frontend<br>:3200"]
tempoQuerier["Tempo Querier"]
tempoCompactor["Tempo Compactor"]
tempoMetricsGen["Tempo Metrics Generator"]
end
subgraph Metrics Pipeline
mimirLB{"mimir Load Balancer<br>:9009"}
mimir1["mimir-1<br>:8080"]
mimir2["mimir-2<br>:8080"]
mimir3["mimir-3<br>:8080"]
end
end
subgraph Observability Analytics
subgraph Visualization and Analytics
grafana[Grafana - Ports :3000]
end
subgraph Profiling
pyroscope["Pyroscope<br>:4040"]
end
end
subgraph Data Storage and Recovery
subgraph Object Storage
minio[MinIO S3 Object Storage - Ports :9000]
end
subgraph Relational Storage
postgres[PostgreSQL - Ports :5432]
end
end
%% Data flow connections
apps --> otel_agent
containers --> cadvisor
servers --> nodeexp
logs --> promtail
otel_agent --> otel_col
nodeexp --> prom
cadvisor --> prom
tempoMetricsGen --> mimirLB
promtail --> alloy_logs
otel_col --> alloy_trace
prom --> mimirLB
alloy_logs --> loki
alloy_trace --> tempoDistributor
alloy_trace --> pyroscope
mimirLB --> mimir1
mimirLB --> mimir2
mimirLB --> mimir3
mimir1 --> minio
mimir2 --> minio
mimir3 --> minio
tempoDistributor --> tempoIngesters
tempoQuery --> tempoQuerier
tempoQuerier --> tempoIngesters
tempoCompactor --> minio
tempoIngesters --> minio
grafana --> mimirLB
grafana --> loki
grafana --> tempoQuery
grafana --> pyroscope
grafana --> postgres
This repository contains the deployment resources for our observability stack, including Grafana, Prometheus, Mimir, Tempo, Loki, and OpenTelemetry components. The stack provides comprehensive monitoring, logging, tracing, and metrics collection for Uhstray.io services.
- Prometheus: Time-series database for storing metrics
- Node Exporter: Hardware and OS metrics collection
- cAdvisor: Container metrics collection
- Mimir: Scalable, long-term metrics storage
- Loki: Log aggregation system
- Promtail: Log collection agent
- Tempo: Distributed tracing backend
- OpenTelemetry Collector: Trace collection and processing
- Alloy: Unified telemetry collector
- Grafana: Unified visualization platform for metrics, logs, and traces
- Pyroscope: Continuous profiling platform
- Docker and Docker Compose installed
- Minimum recommended resources: 8 CPU cores, 16GB RAM
Pull down this repository and navigate to the main o11y
directory:
git clone https://github.com/uhstray-io/o11y.git
cd ./o11y
Run docker compose:
docker compose up -d
Navigate to the following dashboards:
- Grafana Dashboard: http://localhost:3000 (default credentials: admin/admin)
- Prometheus Dashboard: http://localhost:9090
- Mimir Dashboard: http://localhost:9009/
- cAdvisor Dashboard: http://localhost:9092/
- Tempo UI: http://localhost:3200
- Loki UI: http://localhost:3100
Get the current logs from the deployment to triage:
docker compose logs
Spin the current deployment down:
docker compose down
Spin down the deployment and remove all volumes:
docker compose down -v
Spin down the deployment and remove all images+volumes:
docker compose down --rmi="all" -v
- Services fail to start: Check for port conflicts with
docker ps -a
and stop any conflicting services - Out of memory errors: Increase Docker memory allocation in Docker Desktop settings
- Permission issues: Ensure proper file permissions for volume mounts
# View logs for a specific service
docker compose logs grafana
# Follow logs live
docker compose logs -f prometheus
- Grafana - Visualization platform
- Grafana Mimir - Scalable metrics storage
- Grafana Alloy - Unified telemetry collector
- Grafana Beyla - eBPF-based auto-instrumentation
- Grafana Pyroscope - Continuous profiling
- Prometheus - Metrics collection and storage
- Node Exporter - System metrics collection
- Promtail - Log collector
- Windows Exporter - Windows metrics collection
- PostgreSQL Exporter - PostgreSQL metrics
- OpenTelemetry Collector - Telemetry collection
- OTEL Protocol - Telemetry protocol specification
- OTEL GO Instrumentation - Go instrumentation
- OpenLLMetry - LLM observability
- Initial deployment with Grafana, Prometheus, and exporters
- Upgrade Grafana to use Mimir Prometheus TSDB
- Develop OpenTelemetry Collector Process for Wisbot
- Deploy OpenTelemetry o11y collector integrated with Grafana
- Upgrade Alert Manager Storage to use GitHub Actions driven Secrets
- Upgrade to Alloy Collector where necessary for production needs
- Migrate Mimir to Microservice Deployment Mode
- Determine Beyla eBPF Instrumentation Targets
- Add Pyroscope for Wisbot Profiling
- Setup relabeling to streamline service discovery | https://grafana.com/docs/loki/latest/send-data/promtail/scraping/
- Implement high availability configuration for production
- Add custom dashboards for Wisbot service monitoring
- Implement proper secrets management through environment variables
- Remove hardcoded credentials from configuration files
- Configure TLS for exposed services
- Review and secure default credentials
- Enable AlertManager integration with proper configuration
- Develop comprehensive alerting rules beyond basic infrastructure monitoring
- Setup proper backup/restore procedures for persistent data
- Standardize configuration practices across components
- Complete Wisbot instrumentation to enable service-level metrics
- Implement distributed tracing for application components
- Configure application profiling via Pyroscope
- Create operational procedures and runbooks
- Develop dashboard usage guidelines
- Define metrics dictionary for key indicators
- Document architectural decisions for component choices
alertmanager_storage:
backend: s3
s3:
access_key_id: {{ .Values.minio.rootUser }}
bucket_name: {{ include "mimir.minioBucketPrefix" . }}-ruler
endpoint: {{ template "minio.fullname" .Subcharts.minio }}.{{ .Release.Namespace }}.svc:{{ .Values.minio.service.port }}
insecure: true
secret_access_key: {{ .Values.minio.rootPassword }}