Data Anonymization Part 3 : Results on public datasets

We are now ready to share with you the results of our findings !

Data comes from the Machine Learning repository:
https://archive.ics.uci.edu/ml/index.php

Each result was obtained with AutoML (train/test split of 0.6/0.4). The goal was to obtain a general idea of performance.

Note on ‘’Wine’’ data : Grouping of the Y variable was made before proceeding to the classification task. Here is the mapping in Python : ({3:0, 4:0, 5:1 , 6:1 , 7:2 , 8:3, 9:3}).

Note on ‘’Census’’ data : Variables education-num and fnlweight are deleted. They do not bring new information.

Note on ‘’Student performance’’ data : The Portuguese dataset is used (Math is also available). Furthermore, there are no results on the K-Anonymity column. The algorithm did not find a solution. This can be explained by the small number of observation with respect to the number of variables.

Impact of anonymization strategies on M.L. performances

Having secured data only costs a small fraction of performance. There are lot of improvements possible in the K-Anonymity performance. The optimization which is used depend on the implemention. Testing several versions of this algorithm might improve results.

The techniques used at ExplorAI depends on your level of comfort in transmitting your data.  Feel free to reach us if you want more details!

The research done on anonymization has been made possible thanks to the support of Mitacs through their Business Strategy Internship program.

Lyes Ould-Ramoul

Lyes Ould-Ramoul