Master's Thesis Kaan Türeyyen

Last modified Jan 21

The rapid advancement and increased utilization of machine learning models for natural language processing (NLP) tasks have led to a surge in the usage and publication of textual data. For these models to perform accurately and effectively, they require access to large
volumes of data for training. This necessity highlights the importance of data privatization. Fundamentally, real-world datasets contain various types of attributes from humans. Consequently, there is a pressing need for robust text-to-text privatization methods that can
privatize text, safeguarding the sensitive attributes of individuals to prevent privacy breaches while maintaining the coherence and utility of the text for NLP tasks. Benchmarking is a widely adopted practice for comparing performance for different NLP use cases. However, the inherently contextual nature of natural language complicates determining what information should be protected and from whom. This complexity poses a significant challenge, as no universally accepted method for benchmarking text-to-text privatization techniques is widely recognized and established as a benchmark. As a result, testing text privatization methods are often done only within the context of their specific use cases, leading to a lack of comparability across different studies.

The objective of this thesis is to propose and test a threat-based benchmarking option for measuring the capabilities of text-to-text privatization methods. Initially, we evaluated which proxy privacy tasks are suitable for assessing the notions of privacy and utility and determined the appropriate metrics for evaluation. Following this, we propose a comprehensive benchmarking framework that addresses privacy risks associated with text-to-text privatization while reasoning about its properties. Finally, we implemented and executed the proposed benchmark to obtain initial results on various privatization techniques using publicly available datasets from diverse domains.

Our experimental results revealed that privacy in text cannot be measured through a single metric but requires a multi-faceted evaluation approach considering utility preservation, sensitive data identification, re-identification resistance, and coherent generation capabilities. The findings highlight that privacy evaluation is inherently task-specific, with different scenarios requiring distinct evaluation metrics and methodologies. Furthermore, the results demonstrated that successful text privatization methods must balance between measuring immediate privacy risks, such as sensitive data exposure, and more subtle privacy threats, like stylometric changes and inferred attributes.

Research Questions:

RQ1: What privacy task/dataset combinations should be considered to evaluate text privatization methods and how is its “success” measured?
RQ2: How does a holistic and dynamic benchmark that measures “privatization” look like?
RQ3: To what degree is such a benchmark applicable to compare text privatization approaches?

Incoming references

Files and Subpages

Name	Type	Size	Last Modification
240708 Türeyyen MT Kickoff.pdf	File	869 KB	21.01.2025
241210 Türeyyen MT Final.pdf	File	1,08 MB	21.01.2025
250120 Türeyyen MT Final Presentation.pdf	File	1,95 MB	21.01.2025