Data at the Tipping Point: Managing and Mitigating the Risk of Re-identification

This page has been archived on the Web

Information identified as archived is provided for reference, research or recordkeeping purposes. It is not subject to the Government of Canada Web Standards and has not been altered or updated since it was archived. Please contact us to request a format other than those available.

Remarks at the 6th Electronic Documents and Records Management Course organized by the Federated Press

January 25, 2011
Ottawa, Ontario

Address by Patricia Kosseim
General Counsel, Office of the Privacy Commissioner of Canada

(Check against delivery)


Introduction

On behalf of the Office of the Privacy Commissioner of Canada, I would like to thank Federated Press for the invitation to be here today and in particular, to discuss the privacy implications of de-identifying and re-identifying data with my colleague Dr. Khaled El Emam. This is a very important and timely subject that is attracting a great deal of interest.

Privacy laws are based on the concept of personal information. Canada’s two federal privacy laws have slightly different definitions but they are both based on the notion of “personal information” as a defined and distinct concept. If you are collecting, using or disclosing personal information, the laws apply; if you are not, the laws don’t apply.

But this simple vision is being challenged on at least two fronts.

First of all, this all-or-nothing approach to personal information may no longer be adequate. There are certain types of information are not always linked to specific individuals and may not necessarily meet a strict definition of personal information, but are still worthy of some measure of protection through proper safeguarding. Take as examples, much of the information we generate when we troll through the internet – our search habits, our online “window browsing”, and our geospatial location.

Secondly, the belief that one can de-identify personal information and thus use it freely is being increasingly challenged. Paul Ohm in a recent paper has provocatively argued that "(d)ata can either be useful or perfectly anonymous but never both". Ohm refers to the false promise of anonymization, suggesting we can no longer rely on anonymization to protect data. Ohm’s claim represents one side of a very vigorous debate that’s taking place and my co-presenter Khaled is an important participant in that debate.

Today, I would like to focus on some of the privacy risks associated with potential re-identification and share some of our reflections on when de-identified data approaches the “tipping point” of identifiability. To set the stage, I will discuss some of the recent experiences our Office has had in dealing with complaints and audits that raise potential identifiability issues.  I will share with you some recommended approaches for organizations, including what I might term “mitigating strategies.” Finally, I would like to explain what our Office, as the federal privacy regulator, is doing to stay ahead of the curve on some of these fast-changing challenges posed by re-identification technology.

Privacy Risks Associated with Potential Re-Identification

The Internet and new business models have facilitated massive collection and distribution of information which may be useful, even essential to the efficient operation of business and government. However, what initially appears to be “innocuous” anonymous or de-identified information can in some cases be combined with information from other sources, and then manipulated using powerful database technologies to produce data that can be linked back to specific individuals.

We all leave data trails, whether from publicly available sources, or social networks, or various records that document our online purchases or recreational activities. Some of these data alone might be non-identifying and organizations may therefore take the view that they fall outside the scope of data protection legislation. But depending on the information and how it is used or combined with other data, this may not always be the case. Technology can now piece together a picture of an individual based on this increasingly granular data and the spectre of potential re-identification is raising the stakes in terms of privacy implications.

Through our investigations, audit and policy work, the OPC has had significant experience with the privacy risks flowing from re-identification. This issue has arisen in our work on subjects as diverse as Internet technology and electronic devices; electronic health records; geospatial privacy; statistics and demographics; and data matching.

Deep Packet Inspection

Deep packet inspection is a good example. Deep packet inspection (DPI) is a technology used by Internet Service Providers to manage traffic on large and small telecommunications networks. Network managers argue that deep packet inspection is one tool among many that help them ensure networks can accommodate the varying bandwidth demands of their users. It can be used to search for viruses, spam and peer-to-peer applications with a view to ensuring network integrity, optimal traffic flow and quality of service.

Privacy advocates, however, are concerned with the capacity of providers to also use DPI technology to peer inside packets and reassemble fragments of information beyond transmission-related data to inspect user-generated content or “payload data” that flows through the networks – whether mp3 files, personal e-mail messages or corporate documents.  If configured and deployed this way, DPI technology can be of enormous value to marketers in terms of enhancing target advertisements based on “real” user data.

In the online context, we often hear that Internet Protocol (IP) addresses are not personal information. At issue in a recent complaint to our Office was whether sender and recipient IP addresses collected as part of deep packet inspection technology deployed by Bell Canada to manage its networks constituted a collection of personal information under the Personal Information Protection and Electronic Documents Act – PIPEDA.

Our investigation revealed that Bell assigns a dynamic IP address to a Sympatico subscriber each time they logon to the network. However, because each dynamic IP address is linked to an invariable “subscriber user ID”, Bell retains the ability to trace back a dynamic IP address to an individual Sympatico subscriber at a given time. The OPC has concluded that an IP address can constitute personal information if it can be associated with an identifiable individual. At least in regards to its own subscribers, Bell could determine which individual was associated with a dynamic IP address at any given time, and therefore, IP addresses in this particular context did constitute personal information.

In this case, the Assistant Commissioner recommended that Bell provide clear information to its customers about deep packet inspection and how the company’s traffic management practices affect the privacy of customers. She also noted that any use of deep packet inspection by Bell that would expand the collection, use or disclosure of personal information beyond the current purpose of managing network traffic, would require renewed, meaningful and informed consent.

In sum, the determination of whether an IP address is personal information under PIPEDA is largely contextual, and will depend in part on an organization’s ability to associate an IP address with a specific individual in the circumstances. Future developments, including the move towards the use of static IP addresses, coupled with the ability and incentive to facilitate the linking of an IP address to an individual, may result in IP addresses being considered, more often than not, personal information.

Street View Imaging

Most of you are familiar with Google Street View, its street-imaging and mapping application.

Our Office had begun to monitor this issue in 2007, when we learned that Google was photographing the streets of some Canadian cities for the eventual launch of its Street View application in Canada.  Our Office was concerned that many of the images at the time were of sufficient resolution and close enough to identify individuals, discern their activities and situate their geographic whereabouts.  This was being done without the apparent knowledge or consent of the individuals in potential contravention of PIPEDA.

Following a number of discussions with Google and a Canadian-based company offering similar services, Canpages, both organizations agreed to blur the faces of individuals and also provide easy, transparent and timely means for individuals to have images of themselves taken down. Both companies also agreed not to retain unblurred images (or at least not to do so beyond a reasonable time), to safeguard them and safely dispose of them.  They also undertook to notify people through public means prior to sending their cars through neighborhoods. Our understanding is that although the blurring technology is less than perfect, the companies continue to look for ways to improve this technology.

As you also may have heard, our Office conducted a related investigation of Google after it was discovered that the company, in an effort to collect information about Wi-Fi access points to enhance its location-based services, had inadvertently also collected payload data from unsecured wireless networks. Several jurisdictions, including Canada, undertook parallel investigations into the matter. One issue being contended with by several data protection authorities was whether or not Media Access Control (“MAC”) addresses (manufacturer-assigned codes that allow devices to speak to one another), either alone or in combination with other information, constitute personal information. The OPC did not have to address this question since it found, as a matter of fact, clear examples of personal information collected among the payload data, including complete e-mails, user names and passwords, and even medical conditions of specified individuals. However, as in the case of IP addresses, the question of whether or not MAC addresses constitute personal information is highly contextual and must be considered in conjunction with what other information is also available to different users in different circumstances.

Matching Publicly Available Data for Marketing Purposes

In the context of another investigation, the OPC investigated the question of whether or not, or at what point, the combination of publicly available databases may create newly constituted personal information. This case involved an organization that supplies lists of consumer-related data to businesses for direct-marketing purposes. The complainants alleged that the organization created personalized demographic information by matching White Pages phonebook information with Statistics Canada census level demographic data, and that the use and disclosure of this newly created “personal information” required the consent of those to whom the information related.

Our investigation determined that the organization’s process of imputing aggregate socio-demographic data to individual consumers living in a common census area did not necessarily change the status of the telephone book information. The mere sorting of consumer names and addresses according to census-level socio-demographic data did not convert publicly available information into personal information within the meaning of PIPEDA.

We determined therefore that the combined lists consisted of information about neighbourhoods, rather than about identifiable individuals. Thus, we found the consent complaint to be not well-founded. Nevertheless, the case is an example of how the Office was challenged to think about the consequences of merging two sets of data. In particular, the case raised legitimate questions about how close data can reach the “tipping point” particularly in homogenous neighborhoods where the probability that group characteristics also represent individual characteristics may be higher.

Public Health Registries

Another example of the privacy challenges associated with potential re-identifiability stemmed from an access to information complaint which eventually made its way before the courts. The  2008 Federal Court case, Gordon v. Canada, in which the OPC intervened, involved a CBC journalist who was seeking access to information contained in Health Canada’s Canadian Adverse Drug Reactions Information System. The case stemmed from a complaint before the Access to Information Commissioner, wherein Health Canada agreed to publicly release more than 80 fields of information, but refused to disclose 12 remaining fields on the grounds that these fields, if released in combination with the others, would cross the tipping point and become identifiable personal information. More particularly, at issue before the Court was whether the “province” field – which indicated the province from which an adverse drug reaction report was received – should be released, or exempt from access. 

Justice Gibson found “substantial evidence” that disclosure of the province field could lead to a “serious possibility that an individual could be identified through the use of that information, alone or in combination with other available information.” The “serious possibility” test is the current standard for assessing the risk of re-identification and ultimately defining what constitutes “personal information” under the terms of both the Privacy Act and PIPEDA. In its intervention before Justice Gibson, the OPC argued that “serious possibility” means something more than a frivolous chance but also something less than a statistical probability.

Mitigating Strategies

One of the best strategies for mitigating privacy risks associated with unintended or inappropriate re-identification is through an approach that has become widely known as “privacy by design.” Privacy protection must be built into projects, programs and initiatives at the front end – as they are being conceived and designed, and well before deployment. Google WiFi is an example of what happens if you don’t do this. In that case, the engineer who had developed code to sample categories of publicly broadcast Wi-Fi data also included code allowing for the collection of payload data believing that this type of information might be useful to Google in the future. The engineer had identified what he believed to be “superficial” privacy concerns, but contrary to company procedure, failed to bring these concerns to the attention of Google’s Product Counsel, whose responsibility it would have been to address and resolve these concerns prior to product development. In other words, there was never a front-end assessment of the privacy impact of this project.

Privacy by design is really about preventing such privacy breaches from happening. It involves the following principles:

  • being proactive, not reactive, and preventive not remedial;
  • setting privacy protection as the default;
  • embedding privacy protection right into the design of the system or technology;
  • striving for full product functionality and privacy protection as a positive-sum, not a zero-sum game;
  • providing for privacy protection throughout the entire information life cycle;
  • being visible and transparent about the systems and technologies being planned and deployed and, ultimately,
  • ensuring respect for user privacy.

Last October, at the International Data Protection Commissioners’ Conference in Jerusalem, the Privacy Commissioner and her counterparts from around the world adopted an international resolution – championed by the Information and Privacy Commissioner of Ontario, Dr. Ann Cavoukian – which recognized and affirmed the privacy-by-design concept.

Staying ahead of the curve

To stay ahead of the curve, data protection authorities such as the OPC must understand the nature and potential of emerging technologies and remain alive to their privacy risks and implications. 

One way we have done this is through the launch of a public consultation process in 2010 where we consulted stakeholders across the country on privacy issues involved with online tracking, profiling and targeting, and cloud computing. A draft report of the consultations was made available on the OPC website to solicit feedback and the final report will be released this spring.

The purpose of these consultations was to shine a spotlight on technological trends and to engage stakeholders in a discussion about the privacy implications of the online world and what is being done or should be done to protect privacy in this evolving environment. To do this, we constructed scenarios about real-life daily activities of Canadians in an effort to make more tangible for Canadians, otherwise abstract concepts such as online tracking, behavioral targeting and cloud computing. The issue of what constitutes personal information, how can information be compiled to construct profiles of individuals, and what this information can be used for figured prominently in these discussions.

Our Office tries to stay ahead of the curve in another way – through the OPC Research Contributions Program. Through this program we encourage academics and others to help us all understand the implications of these technologies. Several of our recently funded studies are directly relevant to de- and re-identification, including a study by Khaled and his colleagues at the Childrens’ Hospital of Eastern Ontario Research Institute on Pan-Canadian De-Identification Guidelines for Personal Health Information and a study by the Public Interest Advocacy Centre on Privacy Risks of De-identified and Aggregated Consumer Data.

Another way our Office tries to stay trained on the privacy implications of new technological developments is through a strategic, dedicated and deliberate focus on four policy priorities, including, “Identity Management and Protection”, which examines the evolving concept of personal information.

Finally, the Privacy Commissioner can look forward to new legislative tools, enacted in December under Canada’s new Anti-Spam Act. By virtue of these recent amendments, the Privacy Commissioner may now exchange information with her international counterparts and collaborate in cross-border investigation efforts for the purposes of enforcing Canadian privacy laws. Today’s global challenges in regulating online technologies across borders make it critically important to adopt a common and united front.

Conclusion

Far from being academic, the challenges of situating data along the spectrum of identifiability and assessing how close they come to the ‘Tipping Point” of personal information, are posing some of the most profound practical challenges facing our Office today. Apart from the strong commercial incentives to re-identify data and link them back to specific consumers, the stakes will only get higher as government seeks lawful access to some of these same data through new legislative initiatives. Notable among these are Bills C-50, 51 and 52 which would require Internet Service Providers to disclose IP addresses of users for law enforcement purposes.

In short, our Office remains alive to the risks of re-identification and welcomes ongoing discussion with stakeholders on how to enable technological innovation while also protecting consumers’ data and ultimately maintaining public trust in markets and governments.  And on that note, I welcome you to visit the OPC website this week as we prepare to underline Data Privacy Day this Friday, January 28th and engage Canadians on some of these issues.

Date modified: