Laughter is considered one of the most overt signals of joy. Laughter is well-recognized as a multimodal phenomenon but is most commonly detected by sensing the sound of laughter. It is unclear how perception and annotation of laughter differ when annotated from other modalities like video, via the body movements of laughter. In this paper we take a first step in this direction by asking if and how well laughter can be annotated when only audio, only video (containing full body movement information) or audiovisual modalities are available to annotators. We ask whether annotations of laughter are congruent across modalities, and compare the effect that labeling modality has on machine learning model performance. We compare annotations and models for laughter detection, intensity estimation, and segmentation, three tasks common in previous studies of laughter. Our analysis of more than 4000 annotations acquired from 48 annotators revealed evidence for incongruity in the perception of laughter, and its intensity between modalities. Further analysis of annotations against consolidated audiovisual reference annotations revealed that recall was lower on average for video when compared to the audio condition, but tended to increase with the intensity of the laughter samples. Our machine learning experiments compared the performance of state-of-the-art unimodal (audio-based, video-based and acceleration-based) and multi-modal models for different combinations of input modalities, training label modality, and testing label modality. Models with video and acceleration inputs had similar performance regardless of training label modality, suggesting that it may be entirely appropriate to train models for laughter detection from body movements using video-acquired labels, despite their lower inter-rater agreement.
翻译:笑笑被认为是最公开的喜悦信号之一。 笑笑被公认为一种多式现象, 但通常通过感知笑声的声音来发现。 当笑笑的感知和批注与其他方式(例如视频)相比, 通过笑笑的体运动来附加注释时, 笑笑的感知和批注如何不同。 在本文中, 我们迈出第一步, 我们问笑笑是否和如何都可以附加注释, 只有音频、 只有视频( 包含完整的身体运动信息 ) 或视听模式才有。 我们问, 笑声的预言是否与各种模式一致, 并比较标签模式对机器学习模式的效绩的影响。 我们比较了笑声的感知和模型, 对比了4000多式的感知, 显示了对视频感知和方式之间的混杂性。 进一步分析对照综合视听参考说明显示, 视频的回顾与音频状况相比较低,但往往与模拟的加速性评估模式相比, 我们的机性表现比了4000多式, 多式测试模式, 对比了模型的性测试。