Machine-learned models for author profiling in social media often rely on data acquired via self-reporting-based psychometric tests (questionnaires) filled out by social media users. This is an expensive but accurate data collection strategy. Another, less costly alternative, which leads to potentially more noisy and biased data, is to rely on labels inferred from publicly available information in the profiles of the users, for instance self-reported diagnoses or test results. In this paper, we explore a third strategy, namely to directly use a corpus of items from validated psychometric tests as training data. Items from psychometric tests often consist of sentences from an I-perspective (e.g., "I make friends easily."). Such corpora of test items constitute 'small data', but their availability for many concepts is a rich resource. We investigate this approach for personality profiling, and evaluate BERT classifiers fine-tuned on such psychometric test items for the big five personality traits (openness, conscientiousness, extraversion, agreeableness, neuroticism) and analyze various augmentation strategies regarding their potential to address the challenges coming with such a small corpus. Our evaluation on a publicly available Twitter corpus shows a comparable performance to in-domain training for 4/5 personality traits with T5-based data augmentation.
翻译:社交媒体作者特征分析的机学模型往往依赖于通过社交媒体用户填写的基于自我报告的心理测试(问卷)获得的数据。这是一个昂贵但准确的数据收集战略。另一个成本较低的替代方法,它可能导致数据更加吵闹和偏颇。另一个成本较低的方法,就是依赖用户概况中公开信息所推断的标签,例如自我报告的诊断或测试结果。在本文中,我们探索了第三个战略,即直接使用经验证的心理测试(问卷)中的一系列项目作为培训数据。心理测试中的项目通常包括来自I-pervision(例如“我很容易交朋友 ” ) 的句子。这种测试项目的体构成“小数据”,但许多概念的可用性是丰富的资源。我们调查了这种人格特征分析方法,并评估了BERT分类人员对五大个个性特征(开放、自觉、超常、可接受性、可接受性、神经论)的这类精神测试项目进行精细调。我们分析了各种增强能力战略,以其潜力解决以这种小体形体构成的挑战(例如“易交朋友 ” ) 。我们关于可公开的性扩增压性能数据评估。