Practical Pseudonymization Techniques for GDPR Compliance

What Is Pseudonymization?

Pseudonymization is the processing of personal data so that it can no longer be attributed to a specific individual without the use of additional information. That additional information must be kept separately and protected by technical and organizational measures.

GDPR explicitly encourages pseudonymization as a data protection safeguard (Article 25 and Recital 28). Pseudonymized data is still personal data under GDPR -- unlike truly anonymized data -- but it benefits from a reduced risk profile and more flexible processing options.

Pseudonymization vs. Anonymization

Feature	Pseudonymization	Anonymization
Reversible	Yes, with additional information	No (irreversible)
Still personal data under GDPR	Yes	No
GDPR applies	Yes, but with benefits	No
Data utility	High (can be re-linked)	Variable (may lose granularity)
Suitable for analytics	Yes	Yes, but limited
Suitable for individual-level processing	Yes (when re-linked)	No

The key distinction: pseudonymized data can be re-identified using the separately stored mapping, while truly anonymized data cannot be re-identified by any reasonably available means.

Benefits of Pseudonymization Under GDPR

Risk reduction: Pseudonymized data poses less risk if breached, as the attacker cannot immediately identify individuals
Broader processing grounds: Recital 29 suggests pseudonymization can facilitate processing beyond the original purpose under certain conditions
DPIA mitigation: Pseudonymization is recognized as a risk mitigation measure in Data Protection Impact Assessments
Breach notification: A breach involving pseudonymized data may not require individual notification if the data is practically unintelligible to unauthorized parties
Data minimization: Pseudonymization supports the data minimization principle by reducing the identifiability of data in systems that do not need to identify individuals

Pseudonymization Techniques

1. Token Replacement (Tokenization)

Replace identifying values with randomly generated tokens. Maintain a separate lookup table that maps tokens back to original values.

How it works:

Original: john.smith@email.com becomes TKN-8f3a2b1c
The mapping TKN-8f3a2b1c -> john.smith@email.com is stored in a secured, separate system

Best for: Structured data in databases where you need to maintain referential integrity across tables

Considerations:

The token mapping table is the critical asset -- it must be secured with the highest controls
Tokens should be random, not derived from the original data
One-to-one mapping preserves uniqueness for joins and analytics

2. Hashing

Apply a cryptographic hash function to identifying values to produce a fixed-length output.

How it works:

Original: john.smith@email.com
SHA-256 hash: e3b0c44298fc1c149afbf4c8996fb924...

Best for: Scenarios where you need consistent pseudonymization (same input always produces same output) without needing to reverse it

Considerations:

Hashing alone is vulnerable to rainbow table attacks, especially for low-entropy inputs like email addresses
Always use a secret salt or keyed hash (HMAC) to prevent reversal through brute force
Hashing is deterministic, which enables linking records across datasets -- this can be a benefit or a risk depending on context

3. Keyed Hashing (HMAC)

A more secure variant of hashing that incorporates a secret key.

How it works:

HMAC-SHA256 with a secret key produces a pseudonym that cannot be reversed without the key
Different keys produce different pseudonyms for the same input

Best for: Cross-dataset linkage where you control the key, research scenarios, analytics pipelines

Considerations:

The secret key must be managed with the same rigor as an encryption key
Rotating the key requires re-pseudonymizing all affected data

4. Format-Preserving Encryption (FPE)

Encrypt data while preserving the format and length of the original value.

How it works:

Original credit card: 4532-1234-5678-9012
FPE output: 8271-6543-2109-3847

Best for: Legacy systems that validate data formats (credit card numbers, phone numbers, postal codes)

Considerations:

Uses approved algorithms like FF1 or FF3-1 (NIST SP 800-38G)
Reversible with the encryption key
Preserves format constraints, which is valuable for system compatibility

5. Data Masking

Replace parts of a data value with placeholder characters while retaining some original information.

How it works:

Original email: john.smith@email.com becomes j***@email.com
Original phone: +44 7700 900123 becomes +44 7700 ***123

Best for: Display purposes, customer service screens, reports where partial identification is sufficient

Considerations:

Static masking permanently replaces data; dynamic masking applies at query time
Partial masking may not be sufficient pseudonymization if the remaining visible data allows re-identification
Dynamic masking requires database or application-level support

6. Generalization

Replace specific values with broader categories.

How it works:

Age 34 becomes age range 30-39
Postal code EC2A 4NE becomes EC2A
Date of birth 1991-03-15 becomes 1991

Best for: Analytics and statistical processing where exact values are not needed

Considerations:

Reduces data utility with each level of generalization
May not qualify as pseudonymization on its own if the generalized data is still identifying in context
Often used in combination with other techniques

Implementation Architecture

Separation of Mapping Data

The mapping between pseudonyms and original identifiers is the most sensitive component. Protect it with:

Storage in a separate, access-controlled system
Encryption at rest with customer-managed keys
Strict access controls (minimal number of authorized users)
Comprehensive audit logging of all access
Geographic separation from the pseudonymized data where practical

Pseudonymization Service Pattern

Build a centralized pseudonymization service that:

Accepts original identifiers and returns pseudonyms
Maintains the mapping securely
Supports reverse lookup for authorized re-identification
Enforces access policies based on the requester's role and purpose
Logs all pseudonymization and re-identification operations

Key Rotation and Re-Pseudonymization

For techniques that use keys (HMAC, FPE), establish a key rotation schedule:

Rotate keys annually or upon suspected compromise
Re-pseudonymize affected data with the new key
Securely destroy old keys after re-pseudonymization is complete

Choosing the Right Technique

Scenario	Recommended Technique
Database records with cross-table references	Tokenization
Analytics pipeline without need for re-identification	Keyed hashing (HMAC)
Legacy systems requiring format compatibility	Format-preserving encryption
Customer service screens	Dynamic data masking
Statistical reporting	Generalization
Research datasets	Combination of generalization + keyed hashing

Pseudonymization and Data Residency

Pseudonymization becomes especially powerful when combined with data residency controls. By pseudonymizing personal data before it leaves a jurisdiction, you can process it in other regions while the re-identification mapping stays within the original jurisdiction.

GlobalDataShield's region-specific hosting complements pseudonymization strategies by ensuring that the sensitive mapping data -- the keys to re-identification -- remains within the geographic boundaries you define, while pseudonymized data can be used more flexibly for analytics and processing across your infrastructure.

What Is Pseudonymization?

Pseudonymization vs. Anonymization

Benefits of Pseudonymization Under GDPR

Pseudonymization Techniques

1. Token Replacement (Tokenization)

2. Hashing

3. Keyed Hashing (HMAC)

4. Format-Preserving Encryption (FPE)

5. Data Masking

6. Generalization

Implementation Architecture

Separation of Mapping Data

Pseudonymization Service Pattern

Key Rotation and Re-Pseudonymization

Choosing the Right Technique

Pseudonymization and Data Residency

Ready to Solve Data Residency?