Intimacy is an essential element of human relationships and language is a crucial means of conveying it. Textual intimacy analysis can reveal social norms in different contexts and serve as a benchmark for testing computational models' ability to understand social information. In this paper, we propose a novel weak-labeling strategy for data augmentation in text regression tasks called WADER. WADER uses data augmentation to address the problems of data imbalance and data scarcity and provides a method for data augmentation in cross-lingual, zero-shot tasks. We benchmark the performance of State-of-the-Art pre-trained multilingual language models using WADER and analyze the use of sampling techniques to mitigate bias in data and optimally select augmentation candidates. Our results show that WADER outperforms the baseline model and provides a direction for mitigating data imbalance and scarcity in text regression tasks.
翻译:亲密关系是人类关系和语言的一个基本要素,是传递这种关系和语言的关键手段。文字亲密分析可以揭示不同背景下的社会规范,并可作为测试计算模型理解社会信息能力的基准。在本文中,我们提出了在称为WADER的文本回归任务中增加数据的新颖的微弱标签战略。WADER利用数据增强来解决数据不平衡和数据稀缺的问题,并为跨语言零点任务中的数据增强提供了方法。我们用WADER来衡量最先受过训练的多语言模型的绩效,并分析抽样技术的使用情况,以减少数据中的偏差和最佳选择的增强候选人。我们的结果显示WADER比基线模型更完善,为减缓文本回归任务中的数据不平衡和稀缺提供了方向。</s>