Skip to content

Conversation

@Titas-Ghosh
Copy link

Fixes #37

Hi @pradeeban , @karthiksathishjeemain,

This PR addresses Issue #37 by implementing the missing anonymization pipeline for DICOM and PDF files. While the architectural approach was previously discussed, the repository lacked a working implementation that handled cross-platform environments.

What was the problem?

  • DICOM files: We needed a way to protect patient identity while preserving "Longitudinal Linkage" (linking records across visits). Simple deletion makes this impossible.
  • PDF files: Medical reports often contain PII "burnt into" the pixel layer, which metadata scrubbing misses.
  • CI/CD Instability: Hardcoded paths for tools like Poppler and Tesseract cause test failures on Linux CI servers when the code was written on Windows.

How I solved it

I built a cross-platform pipeline that addresses these specific gaps:

1. Cryptographic Linkage (DICOM)
Instead of deleting the PatientID, I implemented SHA-256 Hashing. This converts IDs (e.g., 12345) into secure, consistent hashes (a8f93...). This preserves the ability to track unique patients for research without exposing their raw identity.

2. Visual Redaction (PDF)
I implemented a visual processing pipeline: PDFImagesOCRRedactRecompile PDF. This physically draws black boxes over identified names and dates, ensuring burnt-in text is sanitized irreversibly.

3. Dynamic Path Resolution
I refactored the backend to check environment variables for external tools (POPPLER_PATH, TESSERACT_PATH) first. If these tools are missing (as is common in basic CI runners), the tests now skip gracefully instead of crashing the build.

Changes Implementation

  • python_backend/advanced_anonymizer.py: Added core logic for metadata hashing and visual redaction using pydicom and presidio-image-redactor.
  • python_backend/tests/test_advanced_anonymizer.py: Added a robust test suite that handles missing dependencies via @pytest.mark.skipif.
  • requirements.txt: Added necessary libraries (pdf2image, pytesseract, pydicom).

How to Test Locally

  1. Navigate to the backend:
    cd python_backend
  2. Install the new dependencies:
    pip install -r requirements.txt
  3. Run the specific test file:
    pytest -v tests/test_advanced_anonymizer.py

Evidence

1. DICOM Hashing Proof
Verifying that Patient IDs are cryptographically hashed rather than just deleted.

Screenshot 2026-01-25 024109

2. PDF Visual Redaction
Verifying that PII is visually masked with black boxes.

Screenshot 2026-01-25 024122

3. Test Suite Status
Verifying the new module passes checks.

Screenshot 2026-01-25 030319

PR Checklist


Note to Maintainers:

Since this is my first contribution to this specific module, I am very open to feedback! If you have any suggestions on the architectural decisions (specifically the hashing approach) or the code style, please let me know and I will iterate on it immediately.

@pradeeban pradeeban merged commit 7625c6a into healthyinc:dev Jan 26, 2026
0 of 4 checks passed
@pradeeban
Copy link
Contributor

Thank you and congratulations on your first PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants