Replace DataDog/Terraform with open-source monitoring stack#76
Merged
Replace DataDog/Terraform with open-source monitoring stack#76
Conversation
DataDog 모니터링과 Terraform IaC를 제거하고 오픈소스 모니터링 스택으로 전환하기 위한 정리 작업. datadog-agent 서비스 블록 및 관련 env 파일 삭제, .gitignore에서 Terraform 항목을 모니터링 볼륨 항목으로 교체.
ddtrace를 제거하고 opentelemetry-instrument로 자동 계측 전환. django-prometheus 미들웨어와 /metrics 엔드포인트 추가. 로그 포맷에 trace_id/span_id 삽입하여 Loki-Jaeger 연동 지원.
OTel Collector, Jaeger, Prometheus, Grafana, Loki, Promtail, cAdvisor, mysqld-exporter, celery-exporter, k6 서비스로 구성된 모니터링 스택 추가. RabbitMQ prometheus 플러그인 연동, Grafana 데이터소스 자동 프로비저닝, Loki에서 trace_id 클릭 시 Jaeger 트레이스 연결 지원.
…stack MONITORING.md 신규 작성: 아키텍처, PromQL 쿼리 모음, OTel 계측 상세, 로그-트레이스 연동, GCP 멀티 인스턴스 배포 가이드 포함. PERFORMANCE_TEST.md에서 DataDog/InfluxDB 참조를 Prometheus/Grafana로 교체, k6 결과를 Prometheus remote write로 전송하도록 업데이트.
Add defaults for otelTraceID/otelSpanID in log formatter to prevent KeyError in services not wrapped by opentelemetry-instrument (e.g. Flower).
- Add k6 HTTP load test (load-test.js) for REST API endpoints - Add MQTT load test (mqtt-load-test.py) for real IoT pipeline simulation - Fix Loki 429 errors by increasing stream/ingestion limits
- Delete obsolete PERFORMANCE_TEST.md (referenced non-existent files) - Delete unused rabbitmq.conf and enabled_plugins (plugins enabled via command) - Fix docker-compose.monitoring.yml: - Remove dead MQTT env vars from k6 service - Fix celery-exporter: remove wrong depends_on, add restart policy - Add restart policy to mysqld-exporter for cross-file dependency - Move cAdvisor to linux profile (macOS incompatible) - Add comprehensive PERFORMANCE_TEST_GUIDE.md with actual working procedures
- Fix OTEL_TRACES_SAMPLER value (parentbased_tracealways → parentbased_always_on) - Fix k6 script paths in MONITORING.md (/scripts/tests/*.js → /scripts/load-test.js) - Fix container names in PERFORMANCE_TEST_GUIDE.md (speedcam-ocr-worker → speedcam-ocr) - Remove non-existent python Docker service reference - Remove stale Make requirement from DEPLOYMENT.md - Add uid to Jaeger datasource for Loki→Jaeger trace linking - Fix promtail regex to match uppercase hex trace IDs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DataDog + Terraform 기반 모니터링/인프라를 오픈소스 스택으로 전면 교체하고, 로컬 테스트 환경을 정비합니다.
주요 변경사항
1. DataDog/Terraform 제거
datadog.env.example,ddtrace의존성 제거terraform/)Makefile삭제2. 오픈소스 모니터링 스택 구축 (
docker-compose.monitoring.yml)profiles: [linux])profiles: [loadtest])3. OpenTelemetry 계측
ddtrace→opentelemetry-distro+ 자동 계측 패키지 (django, celery, pymysql, requests, logging)opentelemetry-instrumentCLI로 Gunicorn/Celery 자동 계측trace_id/span_id주입 (Loki↔Jaeger 연동)defaultsdict 추가로 OTel 없이도 KeyError 방지4. docker-compose.monitoring.yml 안정성 개선
depends_on: prometheus제거,restart: unless-stopped추가restart: unless-stopped추가 (cross-file 의존성 처리)profiles: [linux]— macOS에서 자동 제외5. 부하 테스트 스크립트
docker/k6/load-test.js— HTTP REST API 부하 테스트 (k6)docker/k6/mqtt-load-test.py— IoT 파이프라인 시뮬레이션 (Python paho-mqtt)6. Loki 설정 수정
max_global_streams_per_user: 10000— 429 Too Many Requests 해결ingestion_burst_size_mb: 16,ingestion_rate_mb: 87. 문서 정리
docs/PERFORMANCE_TEST.md삭제 (존재하지 않는 파일 참조하는 허상 문서)docs/PERFORMANCE_TEST_GUIDE.md신규 — 실제 동작하는 테스트 가이드docs/MONITORING.md신규 — 모니터링 스택 가이드docs/DEPLOYMENT.md업데이트 — GCP 멀티 인스턴스 아키텍처docker/rabbitmq/rabbitmq.conf,enabled_plugins삭제 (미사용 dead 파일)변경 통계
Commits (8)
2e92f01Remove DataDog agent, Terraform, and Makefileb5e5486Replace ddtrace with OpenTelemetry and django-prometheus00773d8Add open-source monitoring stack with docker-compose.monitoring.yml232966eAdd monitoring docs and update performance test guidee3b5dcaUpdate deployment guide for multi-instance GCP architecture4b1dd82Fix OTel logging crash when running without opentelemetry-instrumentb65bb1eAdd load test scripts and fix Loki ingestion limits846f4e0Clean up monitoring stack and replace performance test docsTest plan
docker compose -f docker-compose.yml -f docker-compose.monitoring.yml up -d정상 기동speedcam-api,speedcam-ocr트레이스 수집 확인