Privacy preservation is a crucial component of any real-world application. Yet, in applications relying on machine learning backends, this is challenging because models often capture more than a designer may have envisioned, resulting in the potential leakage of sensitive information. For example, emotion recognition models are susceptible to learning patterns between the target variable and other sensitive variables, patterns that can be maliciously re-purposed to obtain protected information. In this paper, we concentrate on using interpretable methods to evaluate a model's efficacy to preserve privacy with respect to sensitive variables. We focus on saliency-based explanations, explanations that highlight regions of the input text, which allows us to understand how model explanations shift when models are trained to preserve privacy. We show how certain commonly-used methods that seek to preserve privacy might not align with human perception of privacy preservation. We also show how some of these induce spurious correlations in the model between the input and the primary as well as secondary task, even if the improvement in evaluation metric is significant. Such correlations can hence lead to false assurances about the perceived privacy of the model because especially when used in cross corpus conditions. We conduct crowdsourcing experiments to evaluate the inclination of the evaluators to choose a particular model for a given task when model explanations are provided, and find that correlation of interpretation differences with sociolinguistic biases can be used as a proxy for user trust.
翻译:保护隐私是任何现实世界应用的关键组成部分。然而,在依赖机器学习后端的应用中,这具有挑战性,因为模型往往捕捉到的比设计者想象的更多,导致敏感信息的潜在泄漏。例如,情绪识别模型容易在目标变量和其他敏感变量之间形成学习模式,这些模式可以恶意地重新使用,以获得受保护的信息。在本文件中,我们侧重于使用可解释的方法来评价模型在敏感变量方面保护隐私的效力。我们侧重于基于显著特征的解释,这些解释能够突出输入文本的区域,使我们能够了解模型培训保护隐私时示范解释的转变方式。我们展示了某些寻求维护隐私的常用方法可能与人类对隐私保护的认知不一致。我们还展示了其中一些方法如何在模型中引起投入和主要以及次要任务之间的虚假关联,即使评价衡量标准有重大改进。因此,这种关联可能导致对模型所认识的隐私作出错误的保证,因为特别是在跨系统条件下使用。我们进行了群包实验,以评价用户解释的偏向性解释的倾向,在使用用户解释时,作为选择特定信任时,作为用户解释的一种选择一种选择选择选择的模型时,即用户分析的倾向时,作为选择的模型。