Guided Research Pratik Sen

Last modified May 15, 2024

No tags assigned

How to Evaluate Disparities in the Quality of Post-hoc Explanations on NLP Models?

Abstract

Research in explainable artificial intelligence (XAI) is driven by the need to identify, quantify, and mitigate bias in machine learning models, ultimately striving for fairness. This is due to the acknowledgment that black box AI models can produce biased predictions due to flawed reasoning learned from biased data. Thus, transparency becomes crucial in uncovering and comprehending these biases. To address this, researchers often employ post-hoc explainability methods like LIME, SHAP, and TCAV to investigate model behavior. However, these methods are susceptible to "fairwashing," where explanations may not faithfully represent the model's internal processes, potentially inspiring false confidence in model performance and perpetuating disparities, especially across social groups.

In the context of natural language processing (NLP) and machine learning, disparate system performance can hinder language diversity and understanding among social groups. For instance, the more frequent misclassification of African American Language as hate speech when compared to Standard American English highlights the significance of fairness in language-based models. Therefore, a crucial aspect that requires attention is evaluating the fairness of these explainability methods, especially in the context of text data, which is abundant and increasingly used for research and improving language models. We feel this is a crucial gap to fill for more fairer research in the future.

In this Guided Research, we will briefly review trends in post hoc explainability methods and disparate fairness in NLP research. We also identify evaluation metrics for these tools and then investigate them on their performance on these metrics to check if they show disparity in their explanations.

Research Questions:

What post-hoc explanation methods are to be evaluated? What are the possible datasets that we can use for testing these methods for this task?
What are the possible metrics we can use for the defined task?
How do we evaluate the chosen post-hoc explainability methods for disparity in their explanations based on these metrics?
What can we learn by analyzing the evaluation results of these methods?

Guided Research Pratik Sen

Incoming references