Pan-Canadian De-Identification Guidelines for Personal Health Information
This page has been archived on the Web
Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.
Children’s Hospital of Eastern Ontario Research Institute
This study examines the risks of re-identification of anonymized Personal Health Information (PHI) when the data is combined with information from public databases or with inferential data (for example the predicting of gender and year of birth from first names and graduation years). The study found that, in some circumstances, the success rate of a re-identification attack on anonymized health data can be quite high. The study also found that such re-identification risks are not trivial, especially among job seekers who may post sufficient personal information about themselves on publicly accessible web sites to permit some simple re-identification attacks.
Based on the findings, the research team formulated practical guidelines and a concrete data anonymization tool that will allow data health information custodians to manage re-identification risks in their data releases and to protect the privacy of Canadians. The major focus is the anonymization of quasi-identifiers such as gender, data of birth and postal/zip codes.
The report notes U.S. research by Latanya Sweeney that 87 per cent of the U.S. population can be uniquely identified through public data sources, using the three quasi-identifier variables of the ZIP Code, gender and date of birth. Gender, date of birth and city, town or municipality of residence can also uniquely identify 53 per cent of the U.S. population.
The authors set out to determine what the parallel situation might be using quasi-identifiers and databases available in Ontario. To examine what identification databases might be available to a re-identification attacker, the researchers looked into what datasets are available from 29 Ontario Ministries, commercial information brokers, genealogical sources, professional societies, Statistics Canada and Elections Canada. They also tested the ability to link data about individuals through various publicly available sources, resulting in some hard statistical findings about the ability to link lists of Ontario physicians and lawyers to home postal codes and date of birth, and the ability to obtain date of birth, home phone numbers and the gender of home owners in Ottawa and Toronto. The report also examines inference attacks – particularly the accuracy with which gender and year of birth can be inferred using genderizing software and other predictive methods. There is also detailed analysis of the ability to predict a person’s home postal code from another postal code – for example a work address or a doctor’s address. The researchers considered urban and rural postal codes in Alberta, Ontario and Nova Scotia.
Based on this research, the authors concluded that region (postal code) alone, gender alone, year of birth alone, and the combination of gender and region were quasi-identifiers representing a consistently low risk of re-identification of anonymized data. They warn, however, that this only applies in the specific circumstances of their attack scenario and assumptions about risk threshold. They suggest that further work should consider record level re-identification risk.
The study contains recent research findings on the extent of personal information (name, address, postal code, telephone number and an age indicator) that Canadians were willing to post on the Internet in job resumes, and also what personal information could be recovered from sold-off hard drives. Using a file recovery utility, the researchers were able to recover personal information from 39 of 60 drives they acquired from used computer equipment vendors, despite repartitioning and reformatting. The vast majority of drives with recovered data had personal information on them, which including salary information and tax returns, personal correspondence, information on life insurance policies and inheritances, employee payroll data, police record checks, divorce documents, and personal health information, including one drive with highly sensitive mental health information about a number of individuals.
All of this leads to a recommended decision making process for anonymizing a data set, with some useful and detailed bulleted considerations for different quasi-identifiers.
This document is available in the following language(s):
OPC Funded Project
This project received funding support through the Office of the Privacy Commissioner of Canada’s Contributions Program. The opinions expressed in the summary and report(s) are those of the authors and do not necessarily reflect those of the Office of the Privacy Commissioner of Canada. Summaries have been provided by the project authors. Please note that the projects appear in their language of origin.
Children's Hospital of Eastern Ontario
401 Smyth Road
- Date modified: