AnonLFI 2.0: Extensible Architecture for PII Pseudonymization in CSIRTs with OCR and Technical Recognizers
AnonLFI 2.0 is a modular pseudonymization framework for CSIRTs that resolves the conflict between data confidentiality (GDPR/LGPD) and analytical utility. It uses HMAC-SHA256 to generate strong, reversible pseudonyms, natively preserves XML and JSON structures, and integrates an OCR pipeline and specialized technical recognizers to handle PII in complex security artifacts. This allows sensitive incident data to be used safely for threat analysis, detection engineering, and training AI (LLM) models.
The tool is designed with a modular, layered architecture to separate responsibilities and allow for extensibility. The following diagram illustrates the main components and workflows.
%% Anonymization Architecture Flow (AnonLFI 2.0)
graph TD
A[User] -- "uv run anon.py <file> [args]" --> B(anon.py CLI);
subgraph "1. Orchestration and Selection"
B -- "Reads arguments (--lang, --preserve...)" --> B;
B -- "Instantiates" --> Eng(AnonymizationOrchestrator);
B -- "get_processor(file, engine)" --> F["Processor Factory (processors.py)"];
end
subgraph "2. Processing by Type (processors.py)"
F -- ".pdf" --> P_PDF(PdfFileProcessor);
F -- ".json" --> P_JSON(JsonFileProcessor);
F -- ".txt" --> P_TXT(TextFileProcessor);
F -- ".docx" --> P_DOCX(DocxFileProcessor);
F -- ".png" --> P_IMG(ImageFileProcessor);
F -- "..." --> P_ETC(...);
end
subgraph "3. Extraction and OCR (Ex: PDF / DOCX)"
P_PDF -- "Reads file (PyMuPDF)" --> T1[Plain Text];
P_PDF -- "Extracts Embedded Images" --> IMG[Images];
IMG -- "Processes w/" --> OCR(Tesseract OCR);
OCR -- "OCR Text" --> T2[Image Text];
T1 & T2 -- "Concatenates Content" --> TXT_BRUTO(Raw Text);
end
subgraph "4. Anonymization Engine (engine.py)"
TXT_BRUTO -- "orchestrator.anonymize_text()" --> ENG_A(Presidio Analyzer);
ENG_A -- "Loads Models (spaCy, Transformer)" --> MOD(models/);
ENG_A -- "Identifies Entities (PII, technical...)" --> ENG_B(CustomSlugAnonymizer);
ENG_B -- "HMAC-SHA256(text, SECRET_KEY)" --> HASH[Secure Hash];
HASH -- "Saves Mapping (original, hash)" --> DB[(db/entities.db)];
HASH -- "Generates Slug [TYPE_hash...]" --> SLUG[Anonymized Slug];
SLUG -- "Replaces PII in Text" --> TXT_ANON(Anonymized Text);
end
subgraph "5. Output Generation"
TXT_ANON -- "Returns to" --> P_PDF;
P_PDF -- "Writes .txt file (or preserves .json, .csv)" --> OUT(output/anon_file...);
B -- "Writes Report" --> LOG(logs/report.txt);
end
- Structure-Preserving Processing: Natively processes
.jsonand.xmlfiles to preserve their original hierarchy, while also supporting.txt,.csv,.pdf,.docx, and.xlsx. - OCR for Images: Automatically extracts and anonymizes text embedded in images within PDF and DOCX files. Also supports direct anonymization of image files like
.png,.jpeg,.gif,.bmp,.tiff,.webp, and more. - Advanced Entity Recognition: Uses Presidio and a Transformer model (
Davlan/xlm-roberta-base-ner-hrl) for high-accuracy entity detection. - Cybersecurity-Focused Recognizers: Includes custom logic to detect specific patterns like IP addresses, URLs, hostnames, hashes, UUIDs, and more.
- Consistent & Secure Anonymization: Generates stable HMAC-SHA256-based slugs for each unique entity.
- Controlled De-anonymization: A separate script allows for retrieving original data from a slug, protected by the same secret key.
- Configurable: Allows preserving specific entity types, adding terms to an allow-list, and customizing the anonymized slug length.
- Directory Processing: Can process a single file or recursively process all supported files in a directory.
This tool is built on top of a powerful stack of open-source libraries:
- Presidio: Core engine for PII identification and anonymization.
- spaCy & Hugging Face Transformers: For state-of-the-art NLP and Named Entity Recognition (NER).
- Pandas: For efficient processing of structured data formats like CSV and XLSX.
- PyMuPDF & python-docx: For parsing PDF and DOCX files.
- Pytesseract: For OCR capabilities to extract text from images.
The integrity of the anonymization process is guaranteed by a secure and consistent hashing mechanism. For each sensitive entity detected (e.g., a person's name), the system performs the following steps:
- The entity's text is normalized to remove extra spaces.
- An HMAC-SHA256 hash is generated from the normalized text, using the
ANON_SECRET_KEYas a secret key. This ensures the hash is unique and impossible to recreate without the key. - The full hash (64 characters) is used as a unique and persistent identifier in the database.
- A "slug" (a prefix of the full hash, with a customizable length via
--slug-length) is used for substitution in the text, making the output more readable.
This process ensures that the same entity (e.g., "John Doe") will always be replaced by the same slug (e.g., [PERSON_a1b2c3d4]), maintaining referential consistency in the anonymized data, which is crucial for training AI models.
The tool uses a SQLite database (db/entities.db) to persist the mapping between original entities and their anonymized slugs. The main table, entities, has the following structure:
| Column | Type | Description |
|---|---|---|
id |
INTEGER | Primary key. |
entity_type |
TEXT | The type of the entity (e.g., PERSON, LOCATION). |
original_name |
TEXT | The original text of the detected entity. |
slug_name |
TEXT | The short hash (slug) displayed in the anonymized text. |
full_hash |
TEXT | The full HMAC-SHA256 hash, used as a unique identifier. |
first_seen |
TEXT | Timestamp of when the entity was first seen. |
last_seen |
TEXT | Timestamp of when the entity was last seen. |
The tool's effectiveness was validated in two representative case studies from the research paper, demonstrating high precision in complex scenarios:
| Scenario | Description | Precision | Recall | F1-Score |
|---|---|---|---|---|
| PDF with OCR | An incident report with PII in text and embedded terminal screenshots. | 100% | 61.9% | 76.5% |
| OpenVAS XML | A vulnerability report with nested technical entities (hashes, certs, etc.). | 100% | 85.42% | 92.13% |
The results confirm the engine's accuracy and the value of the specialized OCR and technical recognizers.
By default, the tool is configured to detect and anonymize a wide range of PII and cybersecurity-related entities:
PERSONLOCATIONORGANIZATIONEMAIL_ADDRESSPHONE_NUMBERIP_ADDRESSURLHOSTNAMEHASH(e.g., SHA256, MD5)UUIDCERT_SERIAL(Certificate Serials)CPE_STRING(Common Platform Enumeration)CERT_BODY(Base64 Certificate Bodies)
This list can be retrieved by running uv run anon.py --list-entities.
The tool is pre-configured for 24 languages:
| Code | Language |
|---|---|
ca |
Catalan |
zh |
Chinese |
hr |
Croatian |
da |
Danish |
nl |
Dutch |
en |
English |
fi |
Finnish |
fr |
French |
de |
German |
el |
Greek |
it |
Italian |
ja |
Japanese |
ko |
Korean |
lt |
Lithuanian |
mk |
Macedonian |
nb |
Norwegian Bokmål |
pl |
Polish |
pt |
Portuguese |
ro |
Romanian |
ru |
Russian |
sl |
Slovenian |
es |
Spanish |
sv |
Swedish |
uk |
Ukrainian |
For a full list of supported languages, run uv run anon.py --list-languages.
.
├── anon.py # Main CLI entry point for the tool
├── scripts/ # Utility scripts (e.g., deanonymize.py)
├── src/anon/ # Main application source code
│ ├── __init__.py
│ ├── config.py # Configuration, constants, and database functions
│ ├── engine.py # Core anonymization logic and Presidio orchestration
│ └── processors.py # File-specific processing classes (PDF, DOCX, etc.)
├── tests/ # Integration and unit tests
├── db/ # (Generated) SQLite database for entity storage
├── logs/ # (Generated) Execution reports
├── models/ # (Generated) Downloaded NLP models
├── output/ # (Generated) Anonymized output files
├── pyproject.toml # Project dependencies for `uv`
├── uv.lock # Locked dependency versions
└── README.md # This file
-
uvTool:- Windows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex" - Linux/macOS:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Windows:
-
Tesseract OCR:
- Required to extract text from images in PDF and DOCX files.
- Ubuntu/Debian:
sudo apt update && sudo apt install tesseract-ocr - macOS (Homebrew):
brew install tesseract - Windows: Download the installer from the Tesseract documentation and add the installation path to the
PATHenvironment variable.
-
Clone the repository:
git clone https://github.com/AnonShield/AnonLFI2.0.git
-
Set the Secret Key (Mandatory): The system's security depends on a secret key. The tool will not run without it.
- Linux/macOS:
export ANON_SECRET_KEY='your-super-secret-key-here' - Windows (PowerShell):
$env:ANON_SECRET_KEY='your-super-secret-key-here'
- Linux/macOS:
-
Install Dependencies:
uv sync
On the first run, the required AI models will be downloaded, which may take a few minutes.
Anonymized files are saved in the output/ directory with the format anon_{original_filename}.ext.
Original Text:
Analyst John Doe (john.doe@email.com) reported a failure on Server-01.
**Anonymized Text (with --slug-length 8):
Analyst [PERSON_a1b2c3d4] ([EMAIL_ADDRESS_b2c3d4e5]) reported a failure on Server-01.
Anonymize a file or directory:
# Process a single file
uv run anon.py path/to/your/file.txt
# Process an entire directory
uv run anon.py path/to/your/directory/De-anonymize a Slug:
uv run scripts/deanonymize.py "[PERSON_...hash...]"Command-Line Options:
file_path: The path to the target file or directory to be anonymized.--lang <code>: Sets the document's language (e.g.,en,pt). Default:en.--preserve-entities <TYPES>: A comma-separated list of entity types to not anonymize (e.g.,"LOCATION,HOSTNAME").--allow-list <TERMS>: A comma-separated list of terms to ignore.--slug-length <NUM>: Sets the character length of the hash displayed in the slug (1-64). If not specified, it defaults to 64 (the full hash), which guarantees no collisions.--list-entities: Lists all supported entity types and exits.--list-languages: Lists all supported languages and exits.
Example with Options:
uv run anon.py incident_report.pdf --lang en --preserve-entities "HOSTNAME" --slug-length 12To verify the tool's integrity, run the test suite:
uv run python -m unittest tests/test_anon_integration.pyThe scripts/ directory contains several helper scripts for analysis and management.
Function: Reverses the anonymization of a single slug, revealing the original text. Requires the ANON_SECRET_KEY to be set and supports audit logging.
Usage:
uv run scripts/deanonymize.py "[PERSON_a1b2c3d4...]"Function: Reads all execution reports from the logs/ folder, aggregates the data, and displays performance statistics.
Usage:
uv run scripts/get_metrics.pyFunction: Runs the main anonymization script multiple times on a test set to collect aggregate performance metrics. Results are saved to metrics_runs.csv.
Usage:
uv run scripts/get_runs_metrics.py <path_to_test_folder>Function: Counts the number of "tickets" in a directory. For .csv and .xlsx files, a ticket is one row. For all other file types, each file is a single ticket.
Usage:
uv run scripts/get_ticket_count.py <path_to_directory>Function: Scans .csv files in a directory and counts the number of cells containing common English words.
Usage:
uv run scripts/count_eng.py <path_to_csv_folder>If you use AnonLFI 2.0 in your research or project, please cite the following paper:
Cristhian Kapelinski, Douglas Lautert, Beatriz Machado, and Diego Kreutz. "AnonLFI 2.0: Extensible Architecture for PII Pseudonymization in CSIRTs with OCR and Technical Recognizers". In Anais da XXII Escola Regional de Redes de Computadores (ERRC), pp. 81-87. Porto Alegre, RS, Brasil, 2025. SBC. DOI: 10.5753/errc.2025.17784.
@inproceedings{Kapelinski2025_AnonLFI,
author = {Cristhian Kapelinski and Douglas Lautert and Beatriz Machado and Diego Kreutz},
title = {AnonLFI 2.0: Extensible Architecture for PII Pseudonymization in CSIRTs with OCR and Technical Recognizers},
booktitle = {Anais da XXII Escola Regional de Redes de Computadores (ERRC)},
year = {2025},
pages = {81--87},
publisher = {SBC},
address = {Porto Alegre, RS, Brasil},
doi = {10.5753/errc.2025.17784},
url = {[https://sol.sbc.org.br/index.php/errc/article/view/39186](https://sol.sbc.org.br/index.php/errc/article/view/39186)}
}This tool is licensed under the GPL-3.0.