Introduction

Blueguard

(a data privacy product offering from Bluegennx.ai to enable enterprises adopt GenAI/ChatGPT applications securely & safely)

This is an end-user documentation for Blueguard Privacy API. The documentation is organized as follows:

Getting Started illustrates how to get started.
API Reference contains full details for the Blueguard Privacy REST API, including code samples.

Background

This section covers some basic concepts like the definition of PII and provides some background detail on Blueguard Privacy API.

What is PII?

PII stands for "Personally Identifiable Information" and encompasses any form of information that could be used to identify someone. Common examples of PII include names, phone numbers and credit numbers. These directly identify someone and are hence called 'direct identifiers.

In addition to direct identifiers, PII also includes 'quasi-identifiers', which on their own cannot uniquely identify a person, but can exponentially increase the likelihood of re-identifying an individual when grouped together. Examples of quasi-identifiers include nationality, religion and prescribed medications. For example, consider a company with 10,000 customers. Knowing that a particular customer lives in Bangalore isn't likely to allow for re-identification, but knowing that they live in Bangalore, an atheist, is female, has Irish nationality and is diagnosed of cancer probably is!

Multiple regulatory compliances such as DPDPA, CCPA, GDPR defines what 'personal information' is. According to DPDPA (Digital Personal Data Protection Act) 2023 in India defines personal data as any data that relates to an identified or identifiable individual. This encompasses a wide array of information including, but not limited to: Basic identifiers: Name, address, phone number, and email address. Sensitive personal data: Financial information, health data, biometric data, sexual orientation, and religious or political beliefs. Digital identifiers: IP addresses, location data, and any other data that could directly or indirectly identify an individual. (source: DPDPA website)

Even whom the information relates to/identifies/could be linked to differs between legislations (data subject in the GDPR vs. 'you or your household' in the CCPA).

What is De-identification?

De-identification is the process of obscuring information that might reveal someone's identity. De-identification plays a key role in data minimization, which means collecting only absolutely necessary personal data. Not only does that protect individuals privacy from the data collector (e.g., corporation, government), but it also prevents significant harm to individuals and data collectors in the event of a data breach.

It's a topic of debate that redaction, anonymization and de-identification don't work. This is largely due to a number of high profile, improperly de-identified datasets created by companies claiming that they were anonymized. Another key reason is that legacy de-identification systems rely on rule-based PII detectors, which are usually made up of regular expressions (regexes).

Why is Privacy Important in Machine Learning?

Modern Neural Network models such as transformers excel at memorizing training data and can leak sensitive information at inference time. A good example of things going wrong is the sensitive data leaks with ChatGPT, where other ChatGPT users may have inadvertently seen billing information when clicking on their own ‘Manage Subscription’ page. (Source: OpenAI Website). For these reasons, Neural Network models such as transformers should never be trained on personal data without some privacy mitigation steps being taken.

Removing all identifying information also helps improve fairness. A model can't discriminate against age and gender if it has been removed from the input data!

Why Synthetic PII?

Generating synthetic PII has two key advantages. Firstly, any PII identification errors become much harder to find. An attacker must first identify what PII is real, and then use this PII to re-identify target subjects. Secondly, synthetic PII eliminates data shift between training and inference. Transformer models in particular are typically pre-trained on large corpora of natural text and synthetic PII is able to eliminate data shift between pre-training and fine-tuning, reducing any accuracy loss that might be induced by redaction.