We apply the vision transformer, a deep machine learning model build around the attention mechanism, on mel-spectrogram representations of raw audio recordings. When adding mel-based data augmentation techniques and sample-weighting, we achieve comparable performance on both (PRS and CCS challenge) tasks of ComParE21, outperforming most single model baselines. We further introduce overlapping vertical patching and evaluate the influence of parameter configurations. Index Terms: audio classification, attention, mel-spectrogram, unbalanced data-sets, computational paralinguistics
翻译:我们应用视觉变压器,这是围绕关注机制、原始录音的Mel-spectrogram表示法建立的深机器学习模型。在添加基于Mel的数据增强技术和样本加权时,我们在COMParE21的(PRS和CCS挑战)任务上取得了可比业绩,超过了最单一的模型基线。我们进一步引入重叠的垂直补丁并评估参数配置的影响。索引术语:音频分类、注意、Mel-spectrogram、不平衡的数据集、计算对应语言学。