Master's Thesis Angelo Kleinert

Last modified May 26

No tags assigned

Analysis of Text Decomposition and Privacy Distribution Techniques in Text Privatization

With the rapid growth of large‑scale natural language processing (NLP) systems, ever larger volumes of unstructured text, often containing sensitive personal information, are being collected and processed, increasing privacy concerns. While differential privacy has emerged as a robust framework for protecting individual data in structured settings (e.g., databases), its direct application to unstructured text, like in documents, remains challenging. One big problem among these challenges is linguistic context: perturbing individual words in isolation frequently produces grammatically incorrect or semantically incoherent outputs, negatively affecting both fluency and downstream utility. To address this, we explore alternative text decomposition strategies ranging from words, sentences, and entire documents to collocations, which group words whose meaning depends on one another, and evaluate their impact on privatized text quality. Building on recent work (“A Collocation‑based Method for Addressing Challenges in Word‑level Metric Differential Privacy”, Meisenbacher et al., 2024), we investigate how the choice of the size of the linguistic unit affects the privacy-utility trade‑off.

A second challenge lies in allocating the overall privacy budget across these units. The overall privacy budget dictates how strongly a given document is perturbed and thus privatized. This budget has to be distributed across the linguistic units in the document. The standard of uniform distribution assumes all words are equally sensitive, yet content words (e.g., nouns, adjectives) often carry far more information than function words (e.g., articles, conjunctions). Drawing on “Spend Your Budget Wisely: Towards an Intelligent Distribution of the Privacy Budget in Differentially Private Text Rewriting” (Meisenbacher et al., 2025), we propose and evaluate non‑uniform budget allocation schemes prioritizing semantically important units.

To systematically compare these approaches, we develop a modular evaluation pipeline that allows for interchangeable decomposition and budget‑allocation methods. We empirically assess each combination on standard DP benchmarks, measuring both privacy guarantees and utility preservation, and provide guidance on selecting optimal strategies for different application scenarios.

Research Questions:

RQ1: What are current methods for privacy budget distribution and for decomposition of text?

RQ2: How can text decomposition and budget allocation methods be modularly combined for differential private text rewriting?

RQ3: How can text decomposition and privacy budget allocation methods be combined to optimize the privacy-utility trade-off for differential private text rewriting?

Incoming references

Files and Subpages

Name	Type	Size	Last Modification	Last Editor
Kickoff MA 23.06.25.pdf	File	5,46 MB	23.06.2025