Skip to content

fix: gevent DB thread-safety 이슈 해결 및 STAR 분석 문서 추가#82

Merged
lsh1215 merged 12 commits intodevelopfrom
docs/gevent-db-thread-safety
Feb 18, 2026
Merged

fix: gevent DB thread-safety 이슈 해결 및 STAR 분석 문서 추가#82
lsh1215 merged 12 commits intodevelopfrom
docs/gevent-db-thread-safety

Conversation

@lsh1215
Copy link
Member

@lsh1215 lsh1215 commented Feb 13, 2026

Summary

  • Celery gevent pool + OTel late monkey-patching으로 인한 DatabaseWrapper objects created in a thread can only be used in that same thread 에러 수정
  • notification_tasks.py에서 task 시작 시 _thread_ident를 현재 greenlet ID로 리셋 후 stale 커넥션 정리
  • STAR 기법 기반 블로그용 분석 문서 추가 (docs/GEVENT_DB_THREAD_SAFETY.md)

Root Cause

opentelemetry-instrument 래퍼가 Celery보다 먼저 ssl/urllib3를 import하여 gevent.monkey.patch_all()threading.local()을 greenlet-local로 패치하지 못함. 결과적으로 모든 greenlet이 동일한 django.db.connections dict를 공유하게 되어 thread ID 불일치 발생.

Fix

import _thread
current_ident = _thread.get_ident()
for conn in db.connections.all():
    conn._thread_ident = current_ident
db.close_old_connections()

Test plan

  • speedcam-alert 컨테이너에 코드 배포 및 재시작
  • Detection 처리 트리거 후 retry 시 thread-sharing 에러 미발생 확인
  • Jaeger trace에서 정상 처리 확인

Document the DatabaseWrapper thread-sharing error that occurs when
Celery alert worker uses gevent pool with concurrency=100. Includes
root cause analysis, solution comparison, and reference materials.
Add Layer 1 root cause: opentelemetry-instrument imports ssl before
Celery calls gevent.monkey.patch_all(), causing incomplete patching.
Include actual container startup logs showing MonkeyPatchWarning and
_after_fork_in_child AssertionError. Add cause hierarchy diagram.
Restructure from verbose reference format to storytelling flow.
Remove deployment-specific instructions, reduce redundancy,
curate references to official docs + 3 GitHub issues + 3 enterprise blogs.
…nalysis

- Add 5 mermaid diagrams: prefork vs gevent comparison, sequence diagram
  for greenlet retry failure, late patching flow, cause layer diagram,
  quadrant chart for solution trade-offs
- Make capture placeholders visible (blockquote format, not HTML comments)
- Add personal Spring developer perspective on monkey-patching
- Expand solution section with detailed trade-off analysis per option
Replace structured method-by-method format with storytelling approach
where trade-offs emerge naturally through the elimination process.
…rs in prod

Local runs celery directly (proper monkey-patch order), deployment uses
opentelemetry-instrument wrapper which imports ssl before gevent patches.
OTel late patching prevents threading.local from being greenlet-local,
causing stale connections to be shared across greenlets. Transfer
connection ownership to current greenlet before close_old_connections()
so both task code and Celery's post-task cleanup pass validation.
close_old_connections() alone fails because close() also validates
thread sharing. Must reset _thread_ident to current greenlet first.
…ead-safety

# Conflicts:
#	tasks/notification_tasks.py
…cause fix)

기존 Django private API(_thread_ident) 의존 워크어라운드를 제거하고,
OTel 환경변수(OTEL_PYTHON_AUTO_INSTRUMENTATION_EXPERIMENTAL_GEVENT_PATCH)로
monkey-patch 순서를 교정하여 근본 원인을 해결한다.
- Kombu consumer receives detections.completed events (single thread)
- send_notification dispatched to Celery gevent pool via .delay()
- notification_tasks.py converted to @shared_task with autoretry
- fcm_queue added to celery.py for FCM task routing
- start_alert_worker.sh runs 2 processes (POSIX sh compatible)
- GEVENT_DB_THREAD_SAFETY.md cleaned up to focus on concurrency issue
@lsh1215 lsh1215 merged commit 650546d into develop Feb 18, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant