The convolutional neural network (CNN) and transformer are two of the most widely implemented models in the computer vision field. However, the former (latter) one mainly captures local (global) features only. To address the limitation in model performance caused by the lack of features, we develop a novel classification network CECT by controllable ensemble CNN and transformer. CECT is composed of a convolutional encoder block, a transposed-convolutional decoder block, and a transformer classification block. Different from existing methods, our CECT can capture features at both multi-local and global scales without any bells and whistles. Moreover, the contribution of local features at different scales can be controlled with the proposed ensemble coefficients. We evaluate CECT on two public COVID-19 datasets and it outperforms existing state-of-the-art methods. With remarkable feature capture ability, we believe CECT can be extended to other medical image classification scenarios as a diagnosis assistant. Code is available at https://github.com/NUS-Tim/CECT.
翻译:暂无翻译