The purpose of this paper is to compare different learnable frontends in medical acoustics tasks. A framework has been implemented to classify human respiratory sounds and heartbeats in two categories, i.e. healthy or affected by pathologies. After obtaining two suitable datasets, we proceeded to classify the sounds using two learnable state-of-art frontends -- LEAF and nnAudio -- plus a non-learnable baseline frontend, i.e. Mel-filterbanks. The computed features are then fed into two different CNN models, namely VGG16 and EfficientNet. The frontends are carefully benchmarked in terms of the number of parameters, computational resources, and effectiveness. This work demonstrates how the integration of learnable frontends in neural audio classification systems may improve performance, especially in the field of medical acoustics. However, the usage of such frameworks makes the needed amount of data even larger. Consequently, they are useful if the amount of data available for training is adequately large to assist the feature learning process.
翻译:本文的目的是比较医学声学任务中不同的可学前端,已经实施了一个框架,将人类呼吸道声音和心跳分为两类,即健康或受病理影响。在获得两个合适的数据集后,我们着手利用两个可学的最先进的前端 -- -- 莱亚弗和纳奥迪奥 -- -- 加上一个不可读取的基线前端,即梅尔-过滤银行,对声音进行分类,然后将计算出的功能输入两个不同的CNN模型,即VGG16和高效网络。前端在参数数量、计算资源和有效性方面进行了仔细基准。这项工作表明,将可学的前端纳入神经音学分类系统可以如何提高性能,特别是在医学声学领域。但是,使用这种框架可以使所需的数据量更大。因此,如果可用于培训的数据数量足以帮助特征学习过程,则有用。