Back to top

Master's Thesis Alisha Riecker

Last modified Dec 20, 2023
   No tags assigned

Studying the Privacy-Utility Trade-off of Word Embedding Perturbations with a Focus on Sensitivity Analysis and Vector Mapping

Text data is often used to communicate information and forms the basis of many natural language processing (NLP) tasks. In order to use text data in NLP tasks, a successful approach is to represent individual words as word embedding vectors. However, such vector representations can pose privacy risks as they may reveal information about the text or its author. Differential Privacy (DP) is a useful countermeasure to mitigate these risks. One type of DP methods adds noise to perturb the vector representations and thereby privatize word embeddings. Though adding more noise leads to stronger privacy guarantees, it can also impair the word’s semantics encoded in the original word embedding and can reduce the utility of the vectors for downstream NLP tasks. Thus, it is crucial for creating effective private word embeddings to find a suitable trade-off between privacy and utility. The goal of this thesis is to explore the privacy-utility trade-off in the context of embedding vector perturbation methods. The focus will be on investigating the impact of two key factors on privacy and utility: i) different approaches for estimating sensitivity and ii) mapping noisy word embeddings to similar embedding vectors associated with real words. A better understanding of their impact will be helpful for adjusting the trade-off between privacy and utility for downstream NLP tasks. The following research questions have been defined to guide the achievement of this goal:

  1. What approaches are there to privatize word embeddings by perturbing word vector representations?
  2. How can we make these privatized word embeddings more effective?
  3. What is the effect of different approaches to estimating sensitivity on privacy and utility for downstream NLP tasks?
  4. What are the implications on privacy and utility resulting from mapping noisy word embeddings to similar embedding vectors which are associated with real words?

Files and Subpages