feat: Implement dicom hashing and pdf visual redaction #101
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes #37
Hi @pradeeban , @karthiksathishjeemain,
This PR addresses Issue #37 by implementing the missing anonymization pipeline for DICOM and PDF files. While the architectural approach was previously discussed, the repository lacked a working implementation that handled cross-platform environments.
What was the problem?
PopplerandTesseractcause test failures on Linux CI servers when the code was written on Windows.How I solved it
I built a cross-platform pipeline that addresses these specific gaps:
1. Cryptographic Linkage (DICOM)
Instead of deleting the
PatientID, I implemented SHA-256 Hashing. This converts IDs (e.g.,12345) into secure, consistent hashes (a8f93...). This preserves the ability to track unique patients for research without exposing their raw identity.2. Visual Redaction (PDF)
I implemented a visual processing pipeline:
PDF→Images→OCR→Redact→Recompile PDF. This physically draws black boxes over identified names and dates, ensuring burnt-in text is sanitized irreversibly.3. Dynamic Path Resolution
I refactored the backend to check environment variables for external tools (
POPPLER_PATH,TESSERACT_PATH) first. If these tools are missing (as is common in basic CI runners), the tests now skip gracefully instead of crashing the build.Changes Implementation
pydicomandpresidio-image-redactor.@pytest.mark.skipif.pdf2image,pytesseract,pydicom).How to Test Locally
cd python_backendEvidence
1. DICOM Hashing Proof
Verifying that Patient IDs are cryptographically hashed rather than just deleted.
2. PDF Visual Redaction
Verifying that PII is visually masked with black boxes.
3. Test Suite Status
Verifying the new module passes checks.
PR Checklist
devbranch.black,isort, andflake8).Note to Maintainers:
Since this is my first contribution to this specific module, I am very open to feedback! If you have any suggestions on the architectural decisions (specifically the hashing approach) or the code style, please let me know and I will iterate on it immediately.