This post was written by David Rebollo-Monedero, Senior Researcher at Polytechnic University of Catalonia (UPC), Spain
Barcelona, May. 26th
The most extensively studied aspects of privacy for any information system deal with unauthorized access to sensitive data, by means of authentication, policies for data-access control and confidentiality, implemented as cryptographic protocols. However, the provision of confidentiality against unintended observers fails to address the practical dilemma when the intended recipient of the information is not fully trusted. Even more so when the database collected is to be made accessible to external parties, or openly published for scientific correlating sensitive information with demographics.
It was famously shown that 87% of the population in the United States might be unequivocally identified solely on the basis of the triple consisting of their date of birth, gender and 5-digit ZIP code, according to 1990 census data. This is in spite of the fact that in that year, the U.S. had a population of over 248 million. This notorious fact illustrates the discriminative potential of the simultaneous combination of a few demographic attributes, which, considered individually, would hardly pose a real anonymity risk. Ultimately, this simple observation means that the mere elimination of identifiers such as first and last name, or social security number (SSN), is grossly insufficient when it comes to effectively protecting the anonymity of the participants of published statistical studies containing confidential data linked to demographic information.
Statistical disclosure control (SDC) concerns the postprocessing of the demographic portion of the statistical results of surveys containing sensitive personal information, in order to effectively safeguard the anonymity of the participating respondents. In the SDC terminology, a microdata set is a database table whose records carry information concerning individual respondents, either people or companies. This database commonly contains a set of attributes that may be classified into identifiers, quasi-identifiers and confidential attributes.
- Firstly, identifiers allow the unequivocal identification of individuals. This is the case of full names, SSNs or medical record numbers, which would be removed before the publication of the microdata set, in order to preserve the anonymity of its respondents.
- Secondly, quasi-identifiers are those attributes that, in combination, may be linked with external, usually publicly available information to reidentify the respondents to whom the records in the microdata set refer. Examples include age, address, gender, job, and physical features such as height and weight.
- Finally, the dataset contains confidential attributes with sensitive information on the respondent, such as salary, political affiliation, religion, and health condition. The classification of attributes as key or confidential may ultimately rely on the specific application and the privacy requirements the microdata set is intended for.
The primary target of application of SDC in CIPSEC, in which confidential information linkable to demographic variables may provide enormous data utility and at the same time require special privacy measures, is the collection and disclosure of patient data. We illustrate the main idea with a hypothetical example of vital signs of a few patients in an operating room, shown in Fig. 1. In this example, sex, age, weight (in pounds) and body mass index (BMI) are quasi-identifiers. The confidential attributes included in the data records are heart rate (in beats per minute), blood pressure (BP, in mmHg), peripheral oxygen saturation (SpO2, as a percent), and an electrocardiogram (EKG).
Fig. 1. Hypothetical example of vital signs collected for a few patients in an operating room, containing sex, age, weight (in pounds) and body mass index (BMI) as quasi-identifiers, and heart rate (in beats per minute), blood pressure (BP, in mmHg), peripheral oxygen saturation (SpO2, as a percent), and an electrocardiogram (EKG), as confidential attributes.
The figure also illustrates how the analysis of those measurements constitutes a privacy risk, as the combination of demographic and confidential attributes recorded may reveal a number of sensitive health conditions and specific behavioral patterns. For instance, the low BMI and high heart rate of patient B might be consistent with untreated hyperthyroidism. The respiratory insufficiency evidenced by the high heart rate and low oxygen saturation of patient C, taking also into consideration her age, might be indicative of an asthmatic crisis. The amplitude and frequency of the EKG of patient D reveal the patient’s unusual fondness for sports. The BMI, high heart rate, and low oxygen saturation of patient E indicate obesity, hypertension and obesity hypoventilation syndrome.
Intuitively, the perturbation of numerical or categorical quasi-identifiers enables us to preserve privacy to a certain extent, at the cost of losing some of the data utility, in the sense of accuracy with respect to the unperturbed version. k-Anonymity is the requirement that each tuple of key-attribute values be identically shared by at least k records in the dataset. This may be achieved through the microaggregation approach illustrated by the synthetic example depicted in Fig. 2. Rather than making the original table available, we publish a k-anonymous version containing aggregated records, in the sense that all quasi-identifying values within each group are replaced by a common representative tuple. As a result, a record cannot be unambiguously linked to the corresponding record in any external sources assigning identifiers to quasi-identifiers. In principle, this prevents a privacy attacker from ascertaining the identity of a patient for a given record in the microaggregated database, which contains confidential information.
Fig. 2. Conceptual example of k-anonymous microaggregation of medical data with k=3. Rather than making the original table available, we publish a k-anonymous version without identifiers and containing aggregated records, in the sense that all quasi-identifying values within each group are replaced by a common representative tuple. As a result, a record cannot be unambiguously linked to the corresponding record in any external sources assigning identifiers to quasi-identifiers.
Ideally, microaggregation algorithms strive to introduce the smallest perturbation possible in the quasi-identifiers, in order to preserve the statistical quality of the published data. More technically speaking, these algorithms are designed to find a partition of the sequence of quasi-identifying tuples in k-anonymous cells, while reducing as much as possible the distortion incurred when replacing each original tuple by the representative value of the corresponding cell. Data utility is measured inversely as the distortion resulting from the perturbation of quasi-identifiers.
Clearly, a complete version of the patients’ electronic records must be kept, with stringent access permissions. But this is not mutually exclusive with the compilation of anonymized versions with perturbed quasi-identifiers, which could be accessed more widely or even shared across hospitals for a variety of medical and pharmacological studies. In general, SDC would allow keeping several versions of the same data, with different levels of perturbation, that is, protection along the privacy-utility trade-off, with concordantly different access restrictions and uses. More precisely, increasing degrees of k-anonymity, corresponding to decreasing degrees of data utility, could be concurrently employed, and the resulting data made available to a hierarchy of health professionals with appropriate access level. The original, unperturbed data could be available to physicians for accurate diagnosis, while perturbed versions could be released under mild access restrictions for more general medical and statistical studies, with privacy guarantees that would facilitate the patients’ consent and would conform to legal directives and ethical standards.
Sophisticated data perturbation methods in the field of SDC, such as k-anonymous microaggregation, are the natural evolution of more flexible privacy policies allowing partial data access.
CIPSEC project results receive funding from the European Union’s Horizon 2020 Research and Innovation Programme, under Grant Agreement no 700378.
The opinions expressed and arguments employed in this publication do not necessarily reflect the official views of the Research Executive Agency (REA) nor the European Commission.