Data Privacy and Anonymisation for PostgreSQL

Technology
10 Feb 2022 • 11:00 AM MYT
DSA
DSA

Data & Storage Asean News Portal

image is not available

Extra tags: Database

Authored by: Shilpa Oswal, Principal Technical Officer at the Centre for Development of Advanced Computing (C-DAC)

image is not availablePublic, private and government organisations are now, more than ever, concerned about how to implement data privacy. This concern is understandable given the many countries that have begun regulating data privacy by creating national standards and laws to protect sensitive information from being disclosed.

Among the many techniques at their disposal, one that companies would do well to adopt is anonymising their data, when appropriate.

In fact, personal data that has been properly treated this way helps organisations avoid the need for regulation. “You don't need to abide by the privacy laws when you're using the truly anonymised data,” explains Shilpa Oswal, Principal Technical Officer at the Centre for Development of Advanced Computing (C-DAC). “Once data is truly anonymised and individuals are no longer identifiable, the data will not fall within the scope of the GDPR”

Formally speaking, data anonymisation is the process of protecting private or sensitive information by erasing, encrypting, or masking identifiers that connect an individual to stored data. Examples of personally identifiable data include names, social security numbers, mobile numbers, etc.

“The truly anonymised data sets, which do not relate to an identified or identifiable natural person, can be published or shared with any party without legal obligations,” continues Shilpa. “We don't need user consent to share it or use it.”

Techniques to anonymise data
Shilpa works for a premier R&D organisation under the Ministry of Electronics and Information Technology in the Government of India, and has had experience implementing anonymising procedures for e-government systems in the country. She was speaking at an event organised by EDB (formerly EnterpriseDB), a company that “supercharges” the open-source database PostgreSQL, so enterprises can harness its benefits at scale.

Fortunately, database providers are cognizant of the need for anonymisation, and she highlights how, for example, PostgreSQL has an extension available to implement anonymisation.

However, the act of anonymisation is not as easy as flicking a switch, and care must be taken on how it is done, depending on what the data is needed for.

For example, a decision needs to be made as to whether the anonymisation will be static or dynamic. Static means that the data is changed permanently on the database (or more usually, a copy of it). Dynamic means that the change is applied to the results of the query, and not the entire data set.

Shilpa says that most industries use static anonymisation because it is ‘once-and-done’, with the added benefit that once it is anonymised, it doesn’t matter what happens to the data, even if it is stolen. “Whereas dynamic anonymisation is a less mature technology for the moment, and there are very few customer success stories for this,” she says.

Another consideration is how you anonymise the data. There are several different techniques available, each with its own benefits:

  • Attribute or record separation means deleting the attribute or record directly from the data set. There is no risk of re-identification, but there is permanent data loss.

  • Pseudonymisation is the use of fake or pseudo identifiers. Pseudo identifiers are created with a one-to-one mapping to the original identifiers, which means the pseudo data can be “translated” back to the original.

  • Generalisation is to make the data more generic by grouping them into broad areas. For example, although Bob is 28 years old, it is recorded that Bob's age is between 20 to 30 years. However, higher generalisation impacts the utility of the data.

  • Synthetic data uses completely artificial data to replace the original. It is suitable for testing purposes, and there is no risk of re-identification. However, large datasets may require high computing resources, so cost may become a factor.

  • Data perturbation is when the data is modified by adding random noise.Mostly suitable for numeric values.

  • Data swapping is when data sets are rearranged, essentially a reshuffling of data. However, it may create unusual conditions (e.g., if the male and female gender of patients are swapped in a medical database).

Despite the benefits that anonymisation brings, Shilpa notes that it can also bring disadvantages, especially as websites strive to deliver a more personalised experience to visitors. “Many websites are doing that, and it is not possible if we have only anonymised data,” she said. “We cannot use this anonymised data for marketing efforts.”

Nevertheless, Shilpa believes that anonymisation is a key tool in trying to protect data privacy. She thinks that organisations of all types and sizes should endeavour to implement data privacy as best they can to protect the digital identities of their users.

Ill-intentioned people know that digital identities can be stolen. If organisations don’t double down on improving data privacy, digital identities of individuals will inevitably become compromised. “When this happens, the consequences would have serious implications for the individuals whose identities are stolen and for the organisations suffering the breach, including lack of customer trust, negative brand exposure, and potential litigation due to non-compliance with data privacy regulation,” concludes Oswal.