Deep machine learning models including Convolutional Neural Networks (CNN) have been successful in the detection of Mild Cognitive Impairment (MCI) using medical images, questionnaires, and videos. This paper proposes a novel Multi-branch Classifier-Video Vision Transformer (MC-ViViT) model to distinguish MCI from those with normal cognition by analyzing facial features. The data comes from the I-CONECT, a behavioral intervention trial aimed at improving cognitive function by providing frequent video chats. MC-ViViT extracts spatiotemporal features of videos in one branch and augments representations by the MC module. The I-CONECT dataset is challenging as the dataset is imbalanced containing Hard-Easy and Positive-Negative samples, which impedes the performance of MC-ViViT. We propose a loss function for Hard-Easy and Positive-Negative Samples (HP Loss) by combining Focal loss and AD-CORRE loss to address the imbalanced problem. Our experimental results on the I-CONECT dataset show the great potential of MC-ViViT in predicting MCI with a high accuracy of 90.63\% accuracy on some of the interview videos.
翻译:深度机器学习模型,包括卷积神经网络(CNN),已成功地在使用医学影像、问卷和视频检测轻度认知障碍(MCI)方面。本文提出了一种新颖的多分支分类器-视频视觉Transformer(MC-ViViT)模型,通过分析面部特征来区分MCI和正常认知的人群。数据来自于I-CONECT,一个旨在通过提供频繁视频聊天来改善认知功能的行为干预试验。MC-ViViT在一个分支中提取视频的时空特征,并通过MC模块增强表示。I-CONECT数据集具有挑战性,因为该数据集不平衡,包含Hard-Easy和Positive-Negative样本,这阻碍了MC-ViViT的性能。我们提出了一个Hard-Easy和Positive-Negative样本(HP Loss)的损失函数,通过组合聚焦损失(focal loss)和AD-CORRE损失来解决不平衡问题。我们对I-CONECT数据集进行的实验结果显示,MC-ViViT在预测MCI方面具有巨大的潜力,在某些面试视频上的准确性高达90.63%。