Personality and demographics are important variables in social sciences, while in NLP they can aid in interpretability and removal of societal biases. However, datasets with both personality and demographic labels are scarce. To address this, we present PANDORA, the first large-scale dataset of Reddit comments labeled with three personality models (including the well-established Big 5 model) and demographics (age, gender, and location) for more than 10k users. We showcase the usefulness of this dataset on three experiments, where we leverage the more readily available data from other personality models to predict the Big 5 traits, analyze gender classification biases arising from psycho-demographic variables, and carry out a confirmatory and exploratory analysis based on psychological theories. Finally, we present benchmark prediction models for all personality and demographic variables.
翻译:个人和人口是社会科学的重要变量,而在《国家劳工政策》中,个人和人口是社会科学的重要变量,它们可以帮助解释和消除社会偏见,然而,具有个性和人口标签的数据集却很少,为此,我们提供了第一组大型数据,即Reddit评论中标有三种个性模型(包括公认的五大模型)和人口(年龄、性别和地点)的一组数据,供10个以上用户使用。我们用三个实验展示了这一数据集的有用性,我们利用其他个性模型中较容易获得的数据预测五大特征,分析来自心理人口变量的性别分类偏差,并根据心理理论进行确认性和探索性分析。最后,我们为所有个性和人口变量提供了基准预测模型。