In this technical report, we introduce TempT, a novel method for test time adaptation on videos by ensuring temporal coherence of predictions across sequential frames. TempT is a powerful tool with broad applications in computer vision tasks, including facial expression recognition (FER) in videos. We evaluate TempT's performance on the AffWild2 dataset as part of the Expression Classification Challenge at the 5th Workshop and Competition on Affective Behavior Analysis in the wild (ABAW). Our approach focuses solely on the unimodal visual aspect of the data and utilizes a popular 2D CNN backbone, in contrast to larger sequential or attention based models. Our experimental results demonstrate that TempT has competitive performance in comparison to previous years reported performances, and its efficacy provides a compelling proof of concept for its use in various real world applications.
翻译:在这篇技术报告中,我们介绍了TempT,一种通过确保连续帧之间的预测的时间一致性来进行视频测试时间自适应的新方法。TempT是计算机视觉任务中的强大工具,包括视频中的面部表情识别(FER)。我们使用流行的2D CNN骨干,专注于数据的单模视觉方面,并在AffWild2数据集上作为第5届野外情感行为分析研讨会(ABAW)中的表情分类挑战的一部分对TempT的性能进行评估。我们的实验结果表明,TempT与往年报告的性能相比有竞争力,并且其功效为其在各种实际应用中的使用提供了令人信服的概念证明。