How to Evaluate Disparities in the Quality of Post-hoc Explanations on NLP Models?
Abstract
Research in explainable artificial intelligence (XAI) is driven by the need to identify, quantify, and mitigate bias in machine learning models, ultimately striving for fairness. This is due to the acknowledgment that black box AI models can produce biased predictions due to flawed reasoning learned from biased data. Thus, transparency becomes crucial in uncovering and comprehending these biases. To address this, researchers often employ post-hoc explainability methods like LIME, SHAP, and TCAV to investigate model behavior. However, these methods are susceptible to "fairwashing," where explanations may not faithfully represent the model's internal processes, potentially inspiring false confidence in model performance and perpetuating disparities, especially across social groups.
In the context of natural language processing (NLP) and machine learning, disparate system performance can hinder language diversity and understanding among social groups. For instance, the more frequent misclassification of African American Language as hate speech when compared to Standard American English highlights the significance of fairness in language-based models. Therefore, a crucial aspect that requires attention is evaluating the fairness of these explainability methods, especially in the context of text data, which is abundant and increasingly used for research and improving language models. We feel this is a crucial gap to fill for more fairer research in the future.
In this Guided Research, we will briefly review trends in post hoc explainability methods and disparate fairness in NLP research. We also identify evaluation metrics for these tools and then investigate them on their performance on these metrics to check if they show disparity in their explanations.