好莱坞身份比亚斯数据集:电影对话以环境为导向的比亚分析 (Hollywood Identity Bias Dataset: A Context Oriented Bias Analysis of Movie Dialogues)

Sandhya Singh,Prapti Roy,Nihar Sahoo,Niteesh Mallela,Himanshu Gupta,Pushpak Bhattacharyya,Milind Savagaonkar, Nidhi,Roshni Ramnani,Anutosh Maitra,Shubhashis Sengupta

Movies reflect society and also hold power to transform opinions. Social biases and stereotypes present in movies can cause extensive damage due to their reach. These biases are not always found to be the need of storyline but can creep in as the author's bias. Movie production houses would prefer to ascertain that the bias present in a script is the story's demand. Today, when deep learning models can give human-level accuracy in multiple tasks, having an AI solution to identify the biases present in the script at the writing stage can help them avoid the inconvenience of stalled release, lawsuits, etc. Since AI solutions are data intensive and there exists no domain specific data to address the problem of biases in scripts, we introduce a new dataset of movie scripts that are annotated for identity bias. The dataset contains dialogue turns annotated for (i) bias labels for seven categories, viz., gender, race/ethnicity, religion, age, occupation, LGBTQ, and other, which contains biases like body shaming, personality bias, etc. (ii) labels for sensitivity, stereotype, sentiment, emotion, emotion intensity, (iii) all labels annotated with context awareness, (iv) target groups and reason for bias labels and (v) expert-driven group-validation process for high quality annotations. We also report various baseline performances for bias identification and category detection on our dataset.

翻译：电影院更愿意确定剧本中的偏见是故事的需求。如今,当深层次的学习模式能够给人带来多重任务的准确性时,拥有一个识别剧本中存在的偏见的AI解决方案,可以帮助他们避免在剧本中出现的偏见,因为剧本中出现的社会偏见和陈规定型可能因其触及范围而造成广泛的损害。由于AI解决方案是数据密集的,而且没有解决剧本偏见问题的域特定数据,因此我们引入了一套新的电影剧本数据集,以说明身份偏见。电影院希望确定剧本中的偏见是剧本的需求。今天,当深层次的学习模式能够给七类(例如性别、种族/种族、宗教、年龄、职业、LGBTQ)和其他包含身体诽谤、人格偏见等偏见时,可以帮助他们避免这些偏见。由于AI解决方案是数据密集的,而且没有解决剧本偏见问题的域特定数据,因此我们引入了一套新的电影剧本数据集,以说明身份偏见。数据集中包含以下七个类别的偏见标签:(i)性别、种族、种族/族裔、宗教、年龄、职业、LGBQ等偏见等偏见的标签,以及高质量指标(我们用于检测的标签)和高质量的标签。