Publication date: April 11, 2023
In the process of data collection, the personal data protection system imposes a relative ban on automated decision-making on controllers (Article 22 of the GDPR), the obligation to take into account data protection by design (Article 25 section 1 of the GDPR) and default data protection (Article 25 section 2 of the GDPR), as well as the need to carry out an assessment of the effects of processing for data protection (data protection impact assessment – art. 35 GDPR). The Data Protection Regulation serves protective purposes, which is to ensure the protection of the rights and freedoms of data subjects in connection with the processing of their personal data, taking into account, however, the principles of data protection regulated in art. 5 of the GDPR, especially reliability and transparency, data minimization and the risk-based mechanism approach.
Under the essence of data protection by design within the meaning of Art. 25 of the GDPR, the controller is obliged to take into account the protection of personal data already at the design stage of a specific solution, service or artificial intelligence system. This is to ensure, among other things, that the protection of personal data will become an immanent element of each project already at the stage of creation.
The first of the techniques discussed is noise addition. Otherwise it is called data perturbation. This technique consists in modifying the data so that they are less accurate, while maintaining the general distribution. It belongs to the family of randomization techniques, and it is worth emphasizing here that this type of technique does not in itself limit the special nature of the record. They can protect against the risk of attack or inference, but each record will still come from one person. An example of adding interference may be replacing the exact data of a person, such as height, with data that will only give this value to approximately 5/10 centimeters. When using the noise addition technique, it is important to make some necessary calculations. Namely, the amount of added noise should depend on the level of protection we want to achieve and the potential risk we can afford. At this point, however, it should be emphasized that a value that is too low may result in insufficient protection, while a value that is too high is associated with the potential uselessness of data that will simply be too general. Adding noise may work best for continuous data protection.
additive noise (Z=X+ε),
multiplicative noise (Z=X•ε) and
logarithmic multiplicative noise (Z=lnX+ε),
where X is the original set, Z denotes a disturbed set, while ε is a continuous variable with a standard normal distribution N (0,1).
The problem that we may encounter when adding noise is also proper attention to outliers. Continuing with the previously given height example, if there is a height value in our database that is unusually high, there is a fairly high probability that despite the disturbances, this value will still be the largest, which significantly increases the chance of using it for re-identification. However, if we decide to increase the noise level, we reduce the risk of identifying an outlier, but at the cost of increasing the loss of other information. The solution to this problem is to notice the outlier and apply a higher noise level to it than to the other values.
Using the noise-add technique is therefore a balancing act between usability and data privacy. This is not an easy task and it is therefore suggested to use noise addition as a complementary means to other techniques.
The second data depersonalization technique discussed is permutation. It is a data reshuffling, a reorganization of values in a way where the range and distribution of values remained the same, but the correlations between values and respective individuals were different. The permutation method is considered to be a special form of noise addition, which makes it difficult to re-identify data, and does not change the ranges and distributions of attribute values. Permutation eliminates all relationships and correlations between attributes by dividing the data set into subsets of articles (groups or partitions), and then appropriate shuffling.
Permutation provides a high level of data usability, however, like other noise addition techniques, it should not be used as the sole anonymizing technique. It is suggested to combine it with the removal of obvious attributes contained in the database.
Sources:
https://www.nask.pl/pl/raporty/raporty/5110,Analiza-rozwiazan-w-zaresie-anonymizator-danych-i-generowania-danych-syntetyczn.html
Article 29 Working Group – Opinion 5/2014 on anonymisation techniques