多领域道德学习的数据融合框架 (A Data Fusion Framework for Multi-Domain Morality Learning)

Language models can be trained to recognize the moral sentiment of text, creating new opportunities to study the role of morality in human life. As interest in language and morality has grown, several ground truth datasets with moral annotations have been released. However, these datasets vary in the method of data collection, domain, topics, instructions for annotators, etc. Simply aggregating such heterogeneous datasets during training can yield models that fail to generalize well. We describe a data fusion framework for training on multiple heterogeneous datasets that improve performance and generalizability. The model uses domain adversarial training to align the datasets in feature space and a weighted loss function to deal with label shift. We show that the proposed framework achieves state-of-the-art performance in different datasets compared to prior works in morality inference.

翻译：语言模型可以进行文本道德情感识别，从而创造了研究道德在人类生活中作用的新机会。随着对语言和道德的兴趣不断增长，已发布了几个具有道德注释的基准数据集。然而，这些数据集在数据收集方法、领域、主题、注释人员指令等方面存在差异。简单地聚合这样的异构性数据集进行训练可能会产生无法很好地泛化的模型。我们描述了一个数据融合框架，可以在训练多个异构数据集时提高性能和泛化能力。该模型使用领域对抗训练来在特征空间中对齐数据集，并使用加权损失函数处理标签偏移。我们展示了所提出的框架在不同数据集上取得了比德道先前道德推断方面更好的性能。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。