The growing demand for data has placed the topic of privacy-enhancing technologies (PETs) at the forefront of discussions on responsible data sharing. Government institutions, businesses and even the general public are keen to know more about both the possibilities and limitations of these technologies. In a previous Tech-Know blog, we focused on two PETs in particular: federated learning and differential privacy. In this blog, we move to a new topic: synthetic data.
Synthetic data is an old de-identification technique that has recently undergone a sea change in functionality and scope of application. Early versions of it date back to the 1980s. Footnote 1 However, now, like many fields, it has leveraged advancements in artificial intelligence (AI) and machine learning (ML) to increase its data processing and analytics capabilities.
These advancements have enabled synthetic data to make significant progress towards addressing a long-standing challenge in de-identification. With more traditional de-identification techniques, it was virtually impossible to de-identify “big data” without significantly reducing data utility. With the help of AI/ML tools and methods, synthetic data is now better able to capture the statistical properties of complex high-dimensional datasets while helping to protect the identities of individuals.
The implications of this are potentially significant. AI/ML systems require access to large amounts of data to train their algorithms. By using synthetic data as a de-identification technique, organizations would have greater flexibility to share “fake” big data sets, which in turn could promote further research and development of AI/ML applications.
Given this role as a potential enabler of AI/ML, it is perhaps not surprising that synthetic data has received a lot of attention as of late. Forrester has named it one of five key advances to realizing the next level of AI for businesses. Gartner has predicted, “By 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated.”
Yet, what is the reality of synthetic data? Does it really represent a wholesale advancement over more traditional de-identification techniques, or in the end do the same or similar trade-offs between privacy and utility apply? Is its role as a potential enabler of AI/ML systems a strictly neutral phenomenon, or does this aspect of it raise additional considerations, whose role and importance in more traditional de-identification techniques was perhaps less pronounced?
In this blog, we will explore some aspects of synthetic data in the hopes of shedding light on these questions. To clarify, this document does not provide guidance on the application of synthetic data under federal privacy laws. Our aim is to discuss some pros and cons to help contextualize our understanding of synthetic data from a technical perspective, as opposed to a legal or policy perspective.
Our use of the term “de-identification” throughout this document is meant in a broadly technical sense to denote the application of tools and techniques to personal information with the aim of rendering the information non-identifiable, beyond the simple removal of direct identifiers. For the purposes of this document, the terms “de-identification” and “anonymization” may be used interchangeably in this specific sense.
IMPORTANT NOTE: The government tabled legislation in June 2022 that would update Canada’s federal private sector privacy law. If passed, Bill C-27 would include legal definitions for anonymized data and de-identified data that do not exist under the current law. (Under Bill C-27, anonymized data would be considered modified to such an extent and in such a manner that it is no longer considered personal information, while de-identified data would still be considered personal information.) This blog post does not provide a legal or policy position on whether and/or under what circumstances synthetic data would be considered de-identified or anonymized as defined under C-27.
What is synthetic data?
Before turning to a discussion of pros and cons, we need to first do some level setting and define what exactly synthetic data is and how it works.
Essentially, synthetic data is fake data produced by an algorithm whose goal is to retain the same statistical properties as some real data, but with no one-to-one mapping between records in the synthetic data and the real data. In terms of output, the main difference in comparison to other de-identification techniques is that synthetic data looks like unmodified identifiable data. Even though it is fake, it retains the same structure and level of granularity as the original.
In terms of functionality, there are four components to be aware of:
- The source data. This is the original data set whose statistical properties the synthetic data attempts to emulate. Other than the removal of variables with no analytic utility, (that is, variables deemed not useful for secondary analysis), it does not undergo any data transformation. This means that, if the source data is about individuals, it will likely contain personal information. It will almost certainly contain quasi-identifiers (for example, age, gender, race, etc.) but may even contain some direct identifiers (for example, facial image, address, DNA profile, etc.).
- The generative model. This is the statistical model used to generate the synthetic data. It is derived from the source data. Multiple methods have been developed over the years.Footnote 2 However, the go-to method today is the use of AI/ML tools, including more advanced “deep learning” techniques. Using AI/ML, a generative model is able to “learn” the statistical properties of the source data without making strong assumptions about the underlying distributions of variables and correlations among them. The details differ depending on the AI/ML architecture used. However, the framework of this approach enables it to capture more complex, nonlinear relationships by default.
- The synthetic data. This is the data generated from the generative model. Typically, it is produced by taking random samples of data points drawn directly from the joint distribution of the generative model. However, in the case of deep learning methods, the process is slightly different. The samples are first randomly drawn from a distribution specific to the training of the model and then inputted into the learned model to generate the synthetic data.
- Privacy and utility metrics. In effect, these measure the “distance,” that is, the amount of similarity or difference, between the joint distributions or the statistical properties of the source data and the synthetic data. There are multiple metrics available. For example, the distance can be measured by comparing single variable distributions, correlations among variables, the Euclidean distance between nearest neighbours in each data set, the accuracy of multivariate prediction models and the extent to which it is possible for a trained model to distinguish between real and synthetic records. In general, there is no one single metric that may be appropriate for a given use case.
In addition to the above, there are two general types of synthetic data:
- Fully synthetic data. This is where the full set of variables in the source data are synthetically generated.
- Partially synthetic data. This is where only the quasi-identifiers or other sensitive variables in the source data are synthetically generated. The remaining variables are present in their original form.
What are the pros?
The buzz surrounding synthetic data is not without merit. In comparison to other de-identification techniques, it offers a number of advantages. There are three main ones to consider.
- Fully synthetic data can protect against traditional re-identification attacks. In the past, most successful re-identification attacks have exploited two types of failures in the de-identification process.
The first is that the releasing organization fails to properly identify which variables should be treated as quasi-identifiers. This is what happened in the oft-cited Netflix Prize attack.Footnote 3 Netflix did not realize that individuals’ movie ratings were also available on the IMDb website, thereby making it easy to re-identify some of its customers in the data set of people who rated the same movies on both sites.
The second is that the releasing organization fails to apply sufficiently robust de-identification techniques to the variables it has identified. This is what happened to New York City when it released data on historical trips and fare logs from taxis. The city used a (non-salted) one-way hash to suppress the driver’s licence number and the medallion number of the taxi for each trip. However, because the total count of possible values for each number was small from a computational perspective, it wasn’t difficult for a computer scientist to compute all possible hashes and determine the original licence and medallion number for each trip.Footnote 4
Because fully synthetic data synthesizes all variables in the source data and applies the same generation process to each, it generally protects against these types of re-identification attacks by default.
- It can capture the statistical properties of high-dimensional data sets. In general, de-identification techniques work in one of two ways. Either they use generalization techniques to protect individuals by bringing effect to the idea of hiding in a crowd or they use randomization techniques to implement a form of plausible deniability. Synthetic data can be viewed as a combination of both approaches. It hides individuals in the statistical properties of the source data, while providing them with a form of plausible deniability through the generation process. By taking a kind of “best of both worlds” approach, synthetic data is better able to capture the statistical properties of complex high-dimensional datasets, while helping to protect the identities of individuals.
- The de-identification process can be automated to a greater degree. The above two points lead to a third. In general, if a de-identification technique does not depend on assumptions regarding which variables should be considered quasi-identifiers, and if its scope of application includes data sets of varying complexity and sizes, then the ability of that technique to automate the overall de-identification process increases. An automated process can facilitate more complex and varied tasks in less time and at less cost. This applies in particular to fully synthetic data.
What are the cons?
Despite these advantages, synthetic data also raises a number of issues and concerns. These fall into two categories. Some are specific to de-identification; others arise through the close connection between synthetic data and the development of AI/ML systems.
With respect to de-identification, synthetic data raises many of the same concerns with more traditional de-identification techniques, but with different details. There are three main issues to consider:
- Re-identification is still possible if records in the source data appear in the synthetic data. At first glance, it may appear as though synthetic data has solved the re-identification problem. However, upon closer inspection, it becomes clear that the risk remains, albeit in a different form. If the generative model learns the statistical properties of the source data too closely or too exactly, that is, if it “overfits” the data, then the synthetic data will simply replicate the source data, making re-identification easy. Even in the case where the generative model does not suffer from overfitting, replication of records may still happen by chance, albeit with lower likelihood. Thus, this risk remains in some form regardless of whether the modelling was done properly or not.
As Hundepool and others remark, simply telling an individual that their personal information was generated synthetically is unlikely to satisfy as an explanation.Footnote 5 Further, empirical evaluations suggest that some synthetic generation tools produce synthetic data that is concerningly close to the source data by default.Footnote 6
- Outliers are at risk of membership inference attacks. Recent research into the security of AI/ML models has led to the establishment of a new class of re-identification attacks.Footnote 7 One such attack is what is known as a “membership inference” attack. In the case of synthetic data, this is where an attacker attempts to learn whether an individual’s record was present in the source data by analyzing properties of the synthetic data. Sometimes even membership in a data set can reveal sensitive information. For example, if a data set is specific to individuals with dementia or HIV, then the mere fact that an individual’s record was included in it would reveal personal information about them. Synthetic data does not fully protect against membership inference attacks. In particular, research suggests that outliers or records in the source data with attribute values outside the 95-percent quantile remain at high risk.Footnote 8
- In general, it does not protect against attribute disclosure. Re-identification is one of two forms of privacy risk associated with de-identified data. The other is what is known as “attribute disclosure.” This is where an attacker is able to learn the value of a confidential attribute for a given individual without necessarily identifying the individual or the individual’s record. Typically, it happens by linking the individual to membership in a group with a common attribute, either deterministically or probabilistically.
In general, synthetic data does not protect against attribute disclosure and thus risks deriving sensitive information from published data.Footnote 9 Of course, to what extent privacy laws should regulate attribute disclosure without re-identification is an ongoing debate.Footnote 10 Nonetheless, synthetic data casts the issue in a new light, given the increased potential it has to reveal sensitive correlations in the source data that are not publicly known. Some researchers have suggested the use of ethics reviews as a mechanism to help protect against the risks of attribute disclosure.Footnote 11
In addition, given that synthetic data is an enabler of AI/ML systems, it also raises additional considerations whose role and importance in more traditional de-identification techniques was perhaps less pronounced. In contrast to general statistics and insights, AI/ML is more fine-grained and capable of making individual-level predictions and decisions that may significantly affect the rights and freedoms of individuals. For this reason, it is important to think of synthetic data within the broader context of AI/ML systems, since it plays a role in contributing to their development. There is one main consideration to mention:
- It may reproduce biases in AI/ML systems. The promise of synthetic data is that it will make big data sets more widely available for the purposes of training, validating and testing AI/ML systems. However, when used as a de-identification technique, synthetic data doesn’t address the main issue with training data, namely that it may contain historical or other types of biases that would then be learned and ultimately reified in the AI/ML systems it helped to create.Footnote 12 While synthetic data helps to protect the identities of individuals whose statistical properties and characteristics make up the training data, any biases present in the source data about them will be reproduced by default. Ultimately, this would affect the fairness and/or accuracy of the AI/ML system.
On this point above, it is important to note that synthetic data can also be used beyond de-identification as a tool to help address issues of bias in training data. For example, it can be used as a data augmentation tool to improve imbalanced data sets by generating more examples of minority classes.Footnote 13 However, even here care must be taken not to reproduce biases. If the data augmentation only strengthens the signal already in the data and the signal itself is flawed, then the synthetic data may further exacerbate biases instead of reducing them.Footnote 14 This use of synthetic data as a de-biasing tool is an emerging area of research.
What about combining synthetic data with other de-identification techniques?
At this point, a key question to ask is whether synthetic data can be combined with other de-identification techniques to help address some of the privacy risks inherent in it. If by default synthetic data continues to raise many of the same concerns with more traditional de-identification techniques, then perhaps a combination of techniques will prove more effective.
The answer to this question is, of course, yes. While the issues relating to potential biases would not be addressed (at least not directly), additional de-identification techniques could be applied at different stages of the generation process to reduce privacy risks. There are three stages to consider:
- Before the generative model is trained. De-identification techniques such as generalization or suppressionFootnote 15 can be applied to the source data to remove or reduce the presence of outliers that may be at risk of membership inference attacks.
- While the generative model is being trained. Differential privacy can be applied to the statistical distributions learned by the generative model to help protect against membership inference attacks as well as data replication through overfitting.Footnote 16
- After the generative model is trained. Suppression can be applied to the synthetic data to remove records that are too close or similar to ones in the source data.Footnote 17
However, now, a new question arises: How different is synthetic data from other de-identification techniques really? If the application of it requires the use of other de-identification techniques, then don’t the same or similar trade-offs between privacy and utility apply to it as well, including in the case of high-dimensional data sharing?
This appears to be what research is saying. Despite advancements over more traditional de-identification techniques, synthetic data is “not a silver bullet.”Footnote 18 According to Stadler and others, “If a synthetic dataset preserves the characteristics of the original data with high accuracy, and hence retains data utility for the use cases it is advertised for, it simultaneously enables adversaries to extract sensitive information about individuals.”Footnote 19
In this blog, we explored various aspects of synthetic data in the hopes of getting closer to the truth behind the buzz surrounding it. Based on the above, it appears that the reality is a complex mix of pros and cons. Synthetic data offers advantages over more traditional de-identification techniques, but it also raises a unique set of issues and concerns. As always with de-identification, the key is to be mindful of the risks and trade-offs!
- Synthetic data is an advanced de-identification technique with pros and cons.
- On the one hand:
- It can protect against traditional re-identification attacks.
- It can capture the statistical properties of high-dimensional data sets.
- The de-identification process can be automated to a greater degree.
- On the other hand:
- Re-identification is still possible if records in the source data appear in the synthetic data.
- Outliers are at risk of membership inference attacks.
- In general, it does not protect against attribute disclosure.
- Also, it may reproduce biases in AI/ML systems.
- Combining it with other de-identification techniques raises the same or similar trade-offs between privacy and utility.