Back to top

Guided Research Tejas Srinivasan

Last modified Apr 18

Applications of Differential Privacy (DP) in Natural Language Processing have so far been at a word level, sentence level and at a document level. [1,2] However, one of the issues of DP on text remains that there is a lack of preservation of semantic context in the privatized output text.  Adding noise to text data to achieve Differential Privacy can degrade the semantic meaning and coherence of the text. [3] This is particularly problematic for NLP tasks such as machine translation, summarization, and sentiment analysis, where preserving the semantic content of the input text is crucial for task performance. The innate connectedness of graphs provides a potential for producing privatized sentences that have more or less retained context.

Graph DP methods have proven to be quite successful at node, edge and graph level and providing a tradeoff between privacy and utility. [4] Graph-based representations can capture the relationships and dependencies between different pieces of text data. By representing text data as nodes in a graph and their relationships as edges, it becomes possible to apply Differential Privacy at a more granular level, such as the level of individual nodes or edges in the graph. This enables finer-grained privacy guarantees while preserving the overall structure and utility of the data.

Graphs can also enable scalable privacy-preserving NLP by allowing for efficient processing and analysis of large-scale graph data. By leveraging distributed graph processing frameworks and graph-based privacy mechanisms, it becomes feasible to apply DP to large-scale NLP datasets and models in a computationally efficient manner.

Our proposed pipeline would be to take as input sensitive text, preprocess this (using stop word removal, lemmatization, NER etc.) and create contextual embeddings. From this, a text graph would be constructed and DP noise would be added to this to create a privatized graph. From this graph, privatized rewritten text would be reconstructed and then this would be sent through some post processing steps using LLMs. Finally the tradeoff between the utility and the privacy would be evaluated.

The research conducted as part of this Guided Research would be as follows: (1) investigate approaches to represent textual data in graph-structured formats, (2) test methods of achieving DP in graph-structured representations of text, and (3) explore and evaluate methods for transforming graph-structured representations of text back to plain text. The resulting pipeline will be evaluated in a series of utility and privacy tests, where the proposed new approach will be compared to previous state-of-the-art DP text rewriting methods.

 

Research Questions :

1. How would a text graph be constructed in such a way that the context of the initial text is preserved?
2. What is the overhead of the graph creation and reconstruction process?
3. What metrics can we possibly use to evaluate this graph-based text privatization approach?
4. What is the tradeoff between privacy and utility of this approach?

 

References -

[1] Mattern, Justus, Benjamin Weggenmann, and Florian Kerschbaum. ``The Limits of Word Level Differential Privacy." In Findings of the Association for Computational Linguistics: NAACL 2022, pp. 867-881. 2022.

[2] Meehan, Casey, Khalil Mrini, and Kamalika Chaudhuri. ``Sentence-level Privacy for Document Embeddings." In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3367-3380. 2022

[3] Klymenko, Oleksandra, Stephen Meisenbacher, and Florian Matthes. ``Differential Privacy in Natural Language Processing The Story So Far." In Proceedings of the Fourth Workshop on Privacy in Natural Language Processing, pp. 1-11. 2022.

[4] Mueller, Tamara T., Dmitrii Usynin, Johannes C. Paetzold, Daniel Rueckert, and Georgios Kaissis. ``SoK: Differential privacy on graph-structured data." arXiv preprint arXiv:2203.09205

[5] Koncel-Kedziorski, Rik, Dhanush Bekal, Yi Luan, Mirella Lapata, and Hannaneh Hajishirzi. ``Text Generation from Knowledge Graphs with Graph Transformers." In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 2284-2293. 2019.

Files and Subpages

There are no subpages or files.