Input text (from you)
│
▼
┌─────────────────────────────┐
│ Phase 1: Known PII │ Config-driven dictionary replacement
│ (config.py + phase1.py) │ Your name, address, email → tokens
└─────────────┬───────────────┘
│
▼
┌─────────────────────────────┐
│ Phase 2: Auto-detection │ Microsoft Presidio NER scan
│ (phase2.py) │ Unknown names, phones, locations → tokens
└─────────────┬───────────────┘
│
▼
┌─────────────────────────────┐
│ Token Map │ Bidirectional lookup table
│ (token_map.py) │ real value ↔ [TOKEN_N]
└─────────────┬───────────────┘
│
├──────────────────────────────┐
▼ ▼
Obfuscated text ──────► ┌──────────────────────┐
│ │ Supabase │
│ │ privacy_shield_tokens│
│ │ (persistent storage) │
│ └──────────────────────┘
│
▼
Cloud AI (Claude/Gemini/etc.)
│
▼
AI response (with tokens)
│
┌───────────────────────────┘
▼
┌─────────────────────────────┐
│ De-obfuscation │ Reverse token → real value
│ (deobfuscate.py) │ Using the same token map
└─────────────┬───────────────┘
│
▼
Clean response (back to you)
Reads your personal_data.yaml file containing known PII entries. Validates the structure and categories.
Input: YAML file path
Output: Structured dictionary of PII entries by category
Stores the mapping between real values and placeholder tokens. Supports lookup in both directions and maintains consistency across a conversation. Persists all mappings to Supabase for cross-session retrieval and audit.
Token format: [CATEGORY_N] e.g. [PERSON_1], [ADDRESS_2], [ORG_1]
Key behaviours:
privacy_shield_tokens tablesession_id for groupingToken mappings are stored in the privacy_shield_tokens table in Supabase. This enables:
Table: privacy_shield_tokens
| Column | Type | Purpose |
|---|---|---|
| id | UUID | Auto-generated primary key |
| session_id | UUID | Groups tokens from the same conversation |
| token | TEXT | The placeholder, e.g. [PERSON_1] |
| real_value | TEXT | The actual personal data |
| category | TEXT | Type: person, address, email, etc. |
| source | TEXT | Which phase caught it (phase1 or phase2) |
| confidence | FLOAT | How sure Phase 2 was (null for Phase 1) |
| created_at | TIMESTAMPTZ | When the mapping was created |
Connection: Via Supabase REST API using the project's API key (stored in environment variable, never committed).
Replaces all known PII from the config file. Uses case-insensitive matching with word boundary awareness to avoid partial replacements (e.g. won't replace "Jon" inside "Jonathan" unless both are configured).
Input: Raw text + config entries
Output: Text with known PII replaced + updated token map
Scans Phase-1-processed text for remaining PII using Microsoft Presidio. Presidio combines spaCy's Named Entity Recognition with pattern-based detectors for structured data (phone numbers, credit cards, etc.).
Confidence threshold: Configurable (default 0.7). Higher = fewer false positives but might miss things. Lower = catches more but risks mangling non-PII words.
Supported entity types:
Input: Phase-1-processed text + existing token map
Output: Further obfuscated text + updated token map
Reverses all token replacements in the AI's response using the token map. Handles tokens appearing in any position (start/middle/end of sentences).
Input: AI response text + token map
Output: Clean text with real values restored
Single entry point tying everything together. Provides two main operations:
obfuscate(text) — runs Phase 1 → Phase 2, returns obfuscated textdeobfuscate(text) — reverses tokens in AI responsereview(text) — shows what would be changed without sending anything| Component | Choice | Reason |
|---|---|---|
| Language | Python 3.12 | Matches existing dev stack |
| Phase 2 engine | Microsoft Presidio | Purpose-built for PII, wraps spaCy, adds pattern detectors |
| NER model | spaCy en_core_web_sm | 12MB, runs on any machine, good enough for NER |
| Config format | YAML | Easy to read and edit |
| Package manager | uv | Already installed on system |
| Testing | pytest | Standard Python testing |
| Token storage | Supabase (PostgreSQL) | Persistent token maps, audit trail, cross-session retrieval |
| Supabase client | supabase-py | Official Python client for Supabase REST API |
privacy-shield/
├── pyproject.toml
├── config/
│ └── personal_data.yaml.example
├── src/
│ └── privacy_shield/
│ ├── __init__.py
│ ├── config.py
│ ├── phase1.py
│ ├── phase2.py
│ ├── shield.py
│ ├── token_map.py
│ ├── storage.py
│ └── deobfuscate.py
└── tests/
├── test_config.py
├── test_phase1.py
├── test_phase2.py
├── test_shield.py
├── test_token_map.py
└── test_deobfuscate.py