Question: How much reasoning capability do LLMs lose when you mask PII?
Answer: With generic redaction (<PERSON>), they lose ~90%. With semantic masking ({Name_hash}), they retain ~100%.
You want to use LLMs on sensitive documents (HR files, support tickets, medical records). Compliance says you can't send raw PII. So you mask it.
But masking destroys context:
Original: "John's manager Sarah approved the request."
Masked: "<PERSON>'s manager <PERSON> approved the request."
Now the LLM can't answer "Who approved the request?" — everyone is <PERSON>.
Replace entities with distinguishable placeholders:
Semantic: "{Name_a3f2}'s manager {Name_b7c9} approved the request."
The LLM answers {Name_b7c9}. We unmask it → Sarah. ✅
You can check out the repo here: https://github.com/Privalyse/privalyse-mask
Test: Can the LLM track "who did what" across a document with multiple people?
| Strategy | Context Retention |
|---|---|
| Original (baseline) | 100% |
Generic Redaction (<PERSON>) |
27% |
Semantic Masking ({Name_hash}) |
100% |
Script: context_research/01_coreference_benchmark.py
Results: results/coref_benchmark.json
Test: After retrieving a masked document, can the LLM answer relationship questions?
| Strategy | Context Retention |
|---|---|
| Original (baseline) | 100% |
| Generic Redaction | 17% |
| Semantic Masking | 92% |
Script: context_research/02_rag_qa_benchmark.py
Results: results/rag_qa_benchmark.json
# Install
pip install privalyse-mask presidio-analyzer presidio-anonymizer openai
# Set API key (for LLM evaluation)
export OPENAI_API_KEY="sk-..."
# Run Coreference Benchmark
python context_research/01_coreference_benchmark.py
# Run RAG QA Benchmark
python context_research/02_rag_qa_benchmark.py- Seed: 42 (all randomness is seeded)
- Data: 100% synthetic (no real PII)
- Evaluator: GPT-4o-mini (temperature=0)
- Embedding: text-embedding-3-small
privalyse-research/
├── README.md # This file
├── context_research/
│ ├── 01_coreference_benchmark.py # Entity tracking test
│ └── 02_rag_qa_benchmark.py # RAG QA test
├── results/
│ └── rag_qa_benchmark.json # Latest results
└── _archive/ # Old experiments (for reference)
The LLM doesn't need to know WHO the person is.
It just needs to know that Person A ≠ Person B.
Semantic placeholders preserve the relationship graph while removing the actual identities.
MIT