This work presents an outer product-based approach to fuse the embedded representations generated from the spectrograms of cough, breath, and speech samples for the automatic detection of COVID-19. To extract deep learnt representations from the spectrograms, we compare the performance of a CNN trained from scratch and a ResNet18 architecture fine-tuned for the task at hand. Furthermore, we investigate whether the patients' sex and the use of contextual attention mechanisms is beneficial. Our experiments use the dataset released as part of the Second Diagnosing COVID-19 using Acoustics (DiCOVA) Challenge. The results suggest the suitability of fusing breath and speech information to detect COVID-19. An Area Under the Curve (AUC) of 84.06% is obtained on the test partition when using a CNN trained from scratch with contextual attention mechanisms. When using the ResNet18 architecture for feature extraction, the baseline model scores the highest performance with an AUC of 84.26%.
翻译:这项工作以外部产品为基础,将咳嗽、呼吸和言语谱谱谱生成的嵌入式表象连接起来,以便自动检测COVID-19。为了从光谱图中提取深思熟虑的表象,我们比较了从头到尾受训的CNN的性能和RESNet18结构,并对手头的任务作了微调。此外,我们调查病人的性格和使用背景关注机制是否有益。我们的实验使用作为使用声学挑战(DiCOVA)第二次诊断COVID-19的一部分而释放的数据集。结果显示,呼吸和言语信息对检测COVID-19的适宜性。在使用CNN从头到手的注意机制使用CNN时,在测试分区中获得了84.06%的测试区。在使用RENet18结构进行特征提取时,基线模型的性能得分数最高,而AUC的性能为84.26%。