Back to top

Bachelor's Thesis Chaeeun Lee

Last modified Apr 25
   No tags assigned

A Linguistics-based Approach for Achieving Sentence-level Differential Privacy  


Abstract

 

Differential privacy (DP) is increasingly recognized as a critical framework for preserving privacy in natural language processing (NLP). Despite attempts to apply DP in textual data, most approaches are focused on using the privacy budget at the word level. As such, a research gap exists regarding handling the limited privacy budget within sentences. Particularly, the reasonable distribution of the privacy budget across individual words in a sentence considering preserving privacy is an open question.

The main goal of this thesis is to establish a concept of sentence-level privacy through leveraging a linguistics-based analysis. Through a comprehensive review of research on DP concepts, this study investigates the research gap. Then, we suggest a new approach to achieve sentence-level DP handling the distribution problem with a limited privacy budget. The underlying hypothesis is that words with more information are more likely to be worthy of privacy protection. Therefore, for the distribution mechanism, we integrate linguistic methodologies to quantify word informativeness. We develop a privacy budget distribution framework prototype that distributes given epsilon values to individual tokens in the sentence based on integrated linguistic analysis. The usage of the prototype is presented with examples.

The impact on privacy preservation and utility maintenance of the epsilon distribution applied to different differential privacy mechanisms on NLP will be evaluated on various datasets. The result shows that the data perturbed with the suggested approach have better privacy preservation while maintaining the utility scores in most cases.

By introducing privacy budget distribution based on the analysis at the sentence level, this thesis contributes to advancing the concept of sentence-level differential privacy with a linguistic approach. Moreover, this work provides a practical solution for applying DP in NLP to handle a limited privacy budget. Finally, we suggest directions for future research to improve the here presented approach.

 

 

The research questions:

RQ1: How can Differential Privacy be effectively applied at the sentence level within Natural Language Processing, considering the intelligent distribution of privacy budgets for individual words within a sentence?

 

RQ2: How can the theoretical concept of sentence-level privacy with informativeness analysis be translated into an implementable framework?

 

RQ3: How well does the suggested differential privacy framework protect private data while preserving the utility of the text data?

 

Files and Subpages

Name Type Size Last Modification Last Editor
240212 BT_kickoff_ChaeeunJoyLee.pdf 1,58 MB 25.04.2024
240415 Chaeeun Lee _ BT.pdf 1,25 MB 25.04.2024
240422 BT_final_ChaeeunJoyLee.pdf 2,68 MB 25.04.2024