Learning spatiotemporal features is an important task for efficient video understanding especially in medical images such as echocardiograms. Convolutional neural networks (CNNs) and more recent vision transformers (ViTs) are the most commonly used methods with limitations per each. CNNs are good at capturing local context but fail to learn global information across video frames. On the other hand, vision transformers can incorporate global details and long sequences but are computationally expensive and typically require more data to train. In this paper, we propose a method that addresses the limitations we typically face when training on medical video data such as echocardiographic scans. The algorithm we propose (EchoCoTr) utilizes the strength of vision transformers and CNNs to tackle the problem of estimating the left ventricular ejection fraction (LVEF) on ultrasound videos. We demonstrate how the proposed method outperforms state-of-the-art work to-date on the EchoNet-Dynamic dataset with MAE of 3.95 and $R^2$ of 0.82. These results show noticeable improvement compared to all published research. In addition, we show extensive ablations and comparisons with several algorithms, including ViT and BERT. The code is available at https://github.com/BioMedIA-MBZUAI/EchoCoTr.
翻译:学习时空特征是高效视频理解的重要任务,特别是在回声心电图等医疗图像中。进化神经网络(CNNs)和最新的视觉变压器(ViTs)是最常用的方法,每种方法都有局限性。CNN很擅长捕捉当地环境,但未能在视频框中学习全球信息。另一方面,视觉变压器可以包含全球细节和长序列,但计算成本昂贵,通常需要更多数据培训。在本文中,我们建议了一种方法,解决我们在对回声心电图等医疗视频数据进行培训时通常面临的限制。我们提议的算法(EchoCoCotr)利用视觉变压器和CNNs的力量来解决在超声波视频中估算左心投影部分的问题。我们展示了拟议方法如何超越了与MAE网络-Dynnamic数据集(3.95美元)和0.82美元的数据。这些结果显示与所有已公布的研究相比有显著的改进。此外,我们展示了包括VichoMU/MLA在内的大量数据。