Back to top

Master Thesis Sharada Sowmya

Last modified Mar 16, 2021
   No tags assigned

Implementation of K-Anonymity for the Use Case of the Automotive Industry in Big Data Context

 

Abstract

The advancements in modern technologies have led to the creation of large amounts of
data on an unprecedented level. This has resulted in challenges of different scales for
most industries. The evolution of business models to make the most from such large
data sets is one of them.


Automotive Industry, like most other industries, is also faced with similar challenges
and opportunities. Modern automobiles consist of a network of sensors that gather a
multitude of data. One of the biggest side effects of adapting business models from the
perspective of the Automotive Industry is the concern regarding privacy and security.
Data is continually gathered at a very granular level from users, and in many cases, the
nature of the data is highly sensitive. Therefore, there is a need to ensure the privacy of
users from whom the data is gathered, and data anonymization is one such way.


The primary focus of this thesis was the implementation of an approach to achieve
data anonymization in the context of big data for the Automotive Industry. Prior
to the implementation, an analysis of anonymization techniques such as Mondrian,
Incognito, and Datafly was conducted. Discussions with research partners were held to
understand the use case and requirements. The knowledge gained from the literature
research combined with the list of requirements gathered from the discussions set the
basis for the implementation. The initial period of the implementation phase involved
the generation of multiple data sets based on the snapshot of a car data set from a
European OEM (Original Equipment Manufacturer). To anonymize the large data sets
in an efficient manner, data partitioning became a prerequisite that was met through the
use of Apache Spark. Following this, each partitioned data set was then anonymized
using ARX API. To maximize throughput, anonymization of the data chunks was
executed asynchronously using Java Executor Service. The final step involved merging
the anonymized data sets into a single one which could then be used for different
business requirements.


Lastly, the distributed approach to data anonymization through partitioning was
validated by benchmarking the prototype against data sets containing up to 15 million
records. The performance of the prototype was observed to have improved through the
distributed approach, wherein the results were generated in less time compared to the
centralized approach.

Research Questions

RQ1What are the properties of current k-anonymity implementations/algorithms?

RQ2What are the requirements for k-anonymity implementations in big data context?

RQ3How can a k-anonymity implementation in the context of big data look like?

Files and Subpages