In the domain of social signal processing, audio event detection is a promising avenue for accessing daily behaviors that contribute to health and well-being. However, despite advances in mobile computing and machine learning, audio behavior detection models are largely constrained to data collected in controlled settings, such as call centers. This is problematic as it means their performance is unlikely to generalize to real-world applications. In this paper, we present a novel dataset of infant distress vocalizations compiled from over 780 hours of real-world audio data, collected via recorders worn by infants. We develop a model that combines deep spectrum and acoustic features to detect and classify infant distress vocalizations, which dramatically outperforms models trained on equivalent real-world data (F1 score of 0.630 vs 0.166). We end by discussing how dataset size can facilitate such gains in accuracy, critical when considering noisy and complex naturalistic data.
翻译:在社会信号处理领域,音频事件探测是获取有助于健康和福祉的日常行为的有希望的途径,然而,尽管在移动计算和机器学习方面有所进展,但音频行为探测模型主要限于在诸如呼叫中心等受控环境中收集的数据。这意味着其性能不可能概括为现实世界应用,因此存在问题。在本文件中,我们提供了一套新颖的关于婴儿遇难声的数据集,该数据集来自780多小时的真实世界音频数据,通过婴儿戴的录音机收集。我们开发了一种模型,将深海频谱和声学特征结合起来,以探测和分类婴儿遇险声学,该模型大大优于在同等现实世界数据方面受过培训的模型(F1分,0.630比0.166)。我们最后讨论了数据集大小如何促进准确性收益,在考虑吵闹和复杂的自然数据时至关重要。