Batch Normalization's (BN) unique property of depending on other samples in a batch is known to cause problems in several tasks, including sequential modeling. Yet, BN-related issues are hardly studied for long video understanding, despite the ubiquitous use of BN in CNNs for feature extraction. Especially in surgical workflow analysis, where the lack of pretrained feature extractors has lead to complex, multi-stage training pipelines, limited awareness of BN issues may have hidden the benefits of training CNNs and temporal models end to end. In this paper, we %present and analyze known as well as novel pitfalls of BN in video learning, including issues specific to online tasks such as a 'cheating' effect in anticipation. We observe that BN's properties create major obstacles for end-to-end learning. However, using BN-free backbones, even simple CNN-LSTMs beat state of the art in two surgical tasks by utilizing adequate end-to-end training strategies which maximize temporal context. We conclude that awareness of BN's pitfalls is crucial for effective end-to-end learning in surgical tasks. By reproducing results on natural-video datasets, we hope our insights will benefit other areas of video learning as well. Code: \url{https://gitlab.com/nct_tso_public/pitfalls_bn}.
翻译:Batch Normalization(BN)依赖于批次中的其他样本的独特属性已知会在多个任务中导致问题,包括顺序建模。然而,对于长视频理解,尽管在CNN提取特征中普遍使用BN,但BN相关问题很少被研究。特别是在外科手术工作流分析中,预训练的特征提取器的缺乏导致了复杂的多阶段训练管道,BN问题的有限认识可能隐藏了训练CNN和时间模型的端到端学习的好处。本文中,我们提出并分析了BN在视频学习中已知及新的陷阱,包括特定于在线任务的问题,例如预测中的“作弊”效应。我们观察到,BN的属性为端到端学习创造了重大障碍。然而,使用无BN的骨干网,即使是简单的CNN-LSTM也能通过利用适当的端到端训练策略(最大化时序上下文)在两个外科任务中打败最先进的技术。我们得出结论,意识到BN的陷阱对于外科任务的有效端到端学习至关重要。通过在自然视频数据集上重现结果,我们希望我们的洞见也能造福于其他视频学习领域。代码:\url{https://gitlab.com/nct_tso_public/pitfalls_bn}。