Audio-based classification techniques on body sounds have long been studied to support diagnostic decisions, particularly in pulmonary diseases. In response to the urgency of the COVID-19 pandemic, a growing number of models are developed to identify COVID-19 patients based on acoustic input. Most models focus on cough because the dry cough is the best-known symptom of COVID-19. However, other body sounds, such as breath and speech, have also been revealed to correlate with COVID-19 as well. In this work, rather than relying on a specific body sound, we propose Fused Audio Instance and Representation for COVID-19 Detection (FAIR4Cov). It relies on constructing a joint feature vector obtained from a plurality of body sounds in waveform and spectrogram representation. The core component of FAIR4Cov is a self-attention fusion unit that is trained to establish the relation of multiple body sounds and audio representations and integrate it into a compact feature vector. We set up our experiments on different combinations of body sounds using only waveform, spectrogram, and a joint representation of waveform and spectrogram. Our findings show that the use of self-attention to combine extracted features from cough, breath, and speech sounds leads to the best performance with an Area Under the Receiver Operating Characteristic Curve (AUC) score of 0.8658, a sensitivity of 0.8057, and a specificity of 0.7958. This AUC is 0.0227 higher than the one of the models trained on spectrograms only and 0.0847 higher than the one of the models trained on waveforms only. The results demonstrate that the combination of spectrogram with waveform representation helps to enrich the extracted features and outperforms the models with single representation.
翻译:身体声音的基于音频的分类技术长期以来一直被用于支持诊断决策,特别是在肺部疾病方面。针对COVID-19大流行的紧迫性,越来越多的模型被开发出来,基于声音信号来识别COVID-19患者。大多数模型都专注于咳嗽,因为干咳是COVID-19最著名的症状。然而,其他身体声音,如呼吸和语音,也被发现与COVID-19存在相关性。在本研究中,我们提出了一种名为FAIR4Cov的方法,它是一种基于融合音频实例和表示的COVID-19检测方法。该方法依赖于构建一个由多种身体声音的波形和频谱图表示构成的联合特征向量。FAIR4Cov的核心组件是一个自我关注融合单元,它被训练来建立多种身体声音和音频表示之间的关系,并将其集成到紧凑的特征向量中。我们使用仅波形、频谱图和波形和频谱图的联合表示来设置我们的实验。我们的发现表明,使用自我关注来组合从咳嗽、呼吸和语音声音中提取的特征会导致最佳性能,ROC曲线下面积(AUC)得分为0.8658,灵敏度为0.8057,特异度为0.7958。该AUC比仅使用频谱图训练的模型高0.0227,比仅使用波形训练的模型高0.0847。结果表明,频谱图与波形表示的结合有助于丰富提取的特征并优于单一表示的模型。