PII anonymization patterns

Personally identifiable information (PII) appears throughout enterprise data. Support tickets contain customer names and contact details. HR records contain employee personal information. Sales records contain prospect data. PII categories include names, email addresses, phone numbers, national ID numbers, health data, and financial account details. Genies retrieve, process, and store this data, passing it through the LLM, which means it's processed by an external model.

This raises a compliance question for many organizations. Should PII reach the LLM at all? The answer depends on the use case, the data sensitivity, and the organization's regulatory obligations. This page covers the three layers at which PII can be managed, when to apply each layer, and how to implement layers.

PII three-layer model

PII anonymization in genie workflows has three potential intervention points. Each operates at a different layer and serves a different purpose. The right approach uses one, two, or all three layers depending on the sensitivity of the data and the specific use case requirements.

Layer 1: PII is removed or replaced before content is written to the knowledge base. The knowledge base never contains PII. The LLM retrieves anonymized content.
Layer 2: Skills retrieve data from external systems and anonymize it before returning the result to the genie. The LLM receives anonymized data and reasons about it without seeing the original PII.
Layer 3: The LLM is instructed to remove PII from user input or retrieved data before passing it to a skill or storing it. This is the most flexible layer but also the least reliable. It depends on LLM instruction-following rather than deterministic processing.

Layer 1: Anonymization before knowledge base ingestion

Review the following sections for Layer 1 anonymization before knowledge base ingestion guidelines.

When to use Layer 1

Use Layer 1 anonymization when:

The content to be ingested contains PII that isn't necessary for the genie's retrieval tasks
The knowledge base is accessed by users who shouldn't see the original PII
Regulatory requirements prohibit storing PII in an LLM-accessible vector store

Examples

Ingesting closed support tickets that contain customer names and contact details. Anonymize before ingestion so the ticket content is searchable but customer PII isn't stored in the knowledge base.
Ingesting HR records for an HR assistant knowledge base. Remove employee personal details that aren't needed for the policy retrieval use case.

How to implement Layer 1

Layer 1 anonymization happens in the knowledge base ingestion recipe, before the content is written to the knowledge base.

The standard implementation uses an on-premises agent with Python scripts for anonymization. The on-premises agent runs within your network, processes the raw content from the source system, applies anonymization transformations, and passes the anonymized content to the knowledge base ingestion step.

Common anonymization techniques at this layer include:

Named entity replacement: Replace detected names, email addresses, phone numbers, and other identifiers with generic placeholders. John Smith at [email protected] reported... becomes The employee at [EMAIL_REDACTED] reported...
Consistent pseudonymization: Replace PII with consistent pseudonyms rather than generic placeholders so the same person always maps to the same pseudonym within a document or across related documents. This preserves the ability to reason about a specific person's history without exposing their real identity. Customer_A consistently refers to the same person across all ingested tickets.
Data minimization: Remove fields or sections of documents that contain PII that aren't needed for the retrieval use case. A support ticket ingested for troubleshooting purposes doesn't need the customer's billing address. Remove that section entirely before ingestion.

Python libraries commonly used for entity detection and replacement at this layer include spaCy, Presidio, and similar NLP tools designed for PII detection.

Limitations

Layer 1 anonymization is a pre-processing step. It doesn't affect PII that reaches the genie through other channels, such as skill outputs or the user's own messages in the conversation. Layer 1 only protects PII in ingested knowledge base content.

Layer 2: Anonymization in skill output before it reaches the LLM

Review the following sections for Layer 2 anonymization in skill output before it reaches the LLM guidelines.

When to use Layer 2

Use Layer 2 anonymization when:

Skills retrieve data from external systems that contains PII
The genie doesn't need the actual PII values to complete its task, only the anonymized or aggregated information
The organization's data governance policy prohibits passing certain PII categories to the LLM

Examples

A support ticket retrieval skill fetches ticket details including customer name, email, and phone number. The genie only needs the ticket subject, description, and status for its analysis. Anonymize the customer contact fields before returning to the genie.
An HR data skill fetches employee records that include salary, health plan enrollment, and personal address. The genie only needs the employee name and leave balance. Strip all other fields before returning.

How to implement Layer 2

Layer 2 anonymization happens inside the skill, between the external system API call and the Return response step.

Retrieve data from the external system and add a data transformation step that applies anonymization before the output is mapped to the Return response step. The transformation can be implemented as:

Field exclusion: Don't include PII fields in the skill's output mapping. If the genie doesn't need the customer's email address, don't include it in the return schema. This isn't anonymization in the strict sense, but it achieves the same result. The PII never reaches the LLM.
Replacement in a Workato recipe step: Use formula steps or custom Ruby/Python functions in the recipe to replace specific field values with anonymized equivalents before mapping them to the output.
On-premises agent processing: For more complex anonymization requirements, such as detecting PII in unstructured text fields like ticket descriptions or call notes, route the output through an on-premises agent running Python anonymization scripts before returning to the genie. This requires more infrastructure but handles cases where PII appears in free-text fields that can't be excluded entirely.

The output schema principle

The most reliable approach to Layer 2 anonymization is designing the skill output schema from the start to exclude PII the genie doesn't need. Returning only what the genie needs for the task is a core skill design principle. This principle has a security dimension for PII-sensitive data. A field excluded from the output schema is protected by design, not by anonymization.

Ask whether the genie actually needs the field before adding anonymization logic to a skill. Exclude it from the output schema if it isn't needed. Anonymization is the right approach only for fields where the genie needs the information but should receive it in anonymized form.

Limitations

Layer 2 anonymization doesn't affect PII that users introduce into the conversation themselves. A user who types their own personal information, or a colleague's contact details, into the chat bypasses Layer 2 controls entirely. It also doesn't affect PII that the genie retrieves from the knowledge base. That is covered by Layer 1.

Layer 3: LLM-level anonymization

Review the following sections for Layer 3 LLM-level anonymization guidelines.

When to use Layer 3

Use Layer 3 anonymization when:

The use case involves processing user-provided content that may contain PII before passing it to a skill or storing it
The genie needs to strip PII from its own output before responding, for example in a summarization use case where the input contains customer data that shouldn't appear verbatim in the response
The organization wants a defense-in-depth layer that catches PII not handled by Layers 1 and 2

Examples

A genie that processes customer feedback. The user pastes raw feedback text that contains customer names and contact details. The genie is instructed to anonymize the feedback before summarizing and storing it.
A genie that summarizes call transcripts. The transcript contains customer names and contact details. The genie is instructed to replace identifiable information with generic references in the summary before passing it to a storage skill.

How to implement Layer 3

Layer 3 anonymization is implemented through job description instructions that tell the genie to identify and replace PII before specific actions.

plaintext


PII HANDLING

Before passing any data to a skill or 
storing any content, apply the following 
anonymization rules:

- Replace customer names with "the customer" 
  or "Customer_[number]" if multiple customers 
  are involved
- Replace email addresses with [EMAIL_REDACTED]
- Replace phone numbers with [PHONE_REDACTED]
- Replace national ID numbers, account numbers, 
  and financial identifiers with [ID_REDACTED]
- Replace physical addresses with [ADDRESS_REDACTED]

Apply these replacements consistently within 
a single processing task. If "John Smith" 
is referred to later in the same content 
as "Mr. Smith" or "John", apply the same 
replacement to all references.

Do not apply anonymization to:
- The requesting user's own authenticated 
  identity from the skill trigger context
- Names of employees within your organization 
  when used in their professional capacity

Layer 3 limitations

Layer 3 depends on the LLM following anonymization instructions consistently. LLMs are probabilistic. They occasionally miss a PII instance, apply replacements inconsistently, or fail to recognize an identifier as PII in an ambiguous context.

Layer 3 should never be the only anonymization control for high-sensitivity PII. It's a defense-in-depth layer, valuable for catching cases that Layers 1 and 2 don't cover, but not reliable enough to stand alone.

For regulated data categories, including health information, financial account data, national identity numbers, and data subject to GDPR, HIPAA, or CCPA, apply Layer 1 or Layer 2 controls as the primary anonymization mechanism. Use Layer 3 as a supplementary layer.

Choose the right layers for your use case

The right combination of layers depends on the sensitivity of the data, the regulatory requirements, and where in the workflow PII appears.

Use the following table to help you choose the right layers for your use case:

PII source	Recommended layer	Notes
Knowledge base content ingested from documents	Layer 1	Anonymize before ingestion using an on-premises agent
Structured data returned by skills	Layer 2	Exclude unnecessary PII fields from output schema and apply replacements for required but sensitive fields
Free-text fields in skill output (descriptions, notes)	Layer 2	Route through on-premises anonymization script before returning to the genie
User-provided content in conversation	Layer 3	LLM instruction only. Supplement with Layer 2 if the content is passed to a skill
PII in genie output (summaries, reports)	Layer 3	LLM instruction to anonymize before responding
Regulated health or financial data	Layers 1 and 2	Don't rely on Layer 3 only for regulated data categories. Use a combination of layers.

Test anonymization in production

Test anonymization controls with realistic data containing actual PII patterns before deploying a genie that handles PII. A test that uses [CUSTOMER_NAME] as the test PII won't reveal whether the anonymization logic handles Dr. Sarah Williams correctly.

Test each layer independently:

Layer 1: Ingest the data and search the knowledge base for known PII from the source content and verify it has been replaced or removed.
Layer 2: Call the skill directly in a test recipe with realistic source data and inspect the output to confirm PII fields are excluded or replaced before the Return response step.
Layer 3: Switch to Test mode and provide the genie with content containing realistic PII patterns and ask it to summarize or process the content. Inspect the response and the skill inputs to confirm PII was handled according to the instructions.

Last updated:

PII anonymization patterns ​

PII three-layer model ​

Layer 1: Anonymization before knowledge base ingestion ​

When to use Layer 1 ​

How to implement Layer 1 ​

Limitations ​

Layer 2: Anonymization in skill output before it reaches the LLM ​

When to use Layer 2 ​

How to implement Layer 2 ​

The output schema principle ​

Limitations ​

Layer 3: LLM-level anonymization ​

When to use Layer 3 ​

How to implement Layer 3 ​

Layer 3 limitations ​

Choose the right layers for your use case ​

Test anonymization in production ​

PII anonymization patterns

PII three-layer model

Layer 1: Anonymization before knowledge base ingestion

When to use Layer 1

How to implement Layer 1

Limitations

Layer 2: Anonymization in skill output before it reaches the LLM

When to use Layer 2

How to implement Layer 2

The output schema principle

Limitations

Layer 3: LLM-level anonymization

When to use Layer 3

How to implement Layer 3

Layer 3 limitations

Choose the right layers for your use case

Test anonymization in production