Abstract
In the realm of Natural Language Processing (NLP), the advancement and refinement of Large Language Models (LLMs) heavily rely on the quality and quantity of textual data. However, the users, being the primary source of such data, face inherent risks in terms of privacy when sharing or disclosing sensitive information embedded within text. Drawing parallels between Differential Privacy (DP) mechanisms employed in database systems and their application in LLMs underscores their shared emphasis on privacy preservation. Yet, the question arises: how can we discern and assess Differential Privacy rewriting mechanisms concerning their efficacy in privatizing textual content?
To address this question, a systematic literature review is conducted to identify classic and metric DP mechanisms, which will be further categorized according to their granularity: word, sentence, or document level DP. Having identified a representative sample of each evaluation mechanism, their privacy budgets (epsilon) will have to be composed to ensure comparability. The next step of the research methodology will involve implementing the selected mechanisms and aggregating them into a repository. Finally, the chosen mechanisms will be evaluated by users with respect to epsilon on a set of textual samples.
Research Questions
1. How can one systematically structure currently available implementations of differential privacy text privatization mechanisms?
2. How can a representative sample of the identified mechanisms be implemented and evaluated in a user study?
3. What insights can be gained about the factors influencing user perception on differentially private text privatization?
Expected Outcome
There are no subpages or files.