We propose an automatic data processing pipeline to extract vocal productions from large-scale natural audio recordings. Through a series of computational steps (windowing, creation of a noise class, data augmentation, re-sampling, transfer learning, Bayesian optimisation), it automatically trains a neural network for detecting various types of natural vocal productions in a noisy data stream without requiring a large sample of labeled data. We test it on two different data sets, one from a group of Guinea baboons recorded from a primate research center and one from human babies recorded at home. The pipeline trains a model on 72 and 77 minutes of labeled audio recordings, with an accuracy of 94.58% and 99.76%. It is then used to process 443 and 174 hours of natural continuous recordings and it creates two new databases of 38.8 and 35.2 hours, respectively. We discuss the strengths and limitations of this approach that can be applied to any massive audio recording.
翻译:我们建议自动数据处理管道,从大规模自然录音录制中提取声音制作。通过一系列计算步骤(窗口、制作噪音类、数据增强、再抽样、转移学习、贝叶斯式优化),它自动培训神经网络,在不要求大量标签数据样本的情况下,在噪音数据流中检测各种自然声音制作类型。我们用两种不同的数据集测试它,一种来自几内亚一组从灵长类研究中心录制的兔子,另一种来自在家里录制的人类婴儿;管道以72和77分钟的贴有标签的录音录制为模型,精确度为94.58%和99.76%;然后用于处理443和174小时的自然连续录音,并创建两个分别为38.8和35.2小时的新数据库。我们讨论这一方法的长处和短处,可以应用于任何大规模的录音记录。