This paper presents an investigation into long-tail video recognition. We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. Most critically, they lack few-shot classes in their tails. In response, we propose new video benchmarks that better assess long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT. We then propose a method, Long-Tail Mixed Reconstruction, which reduces overfitting to instances from few-shot classes by reconstructing them as weighted combinations of samples from head classes. LMR then employs label mixing to learn robust decision boundaries. It achieves state-of-the-art average class accuracy on EPIC-KITCHENS and the proposed SSv2-LT and VideoLT-LT. Benchmarks and code at: tobyperrett.github.io/lmr
翻译:本文介绍长尾视频识别的研究。我们展示了当前的视频基准测试与自然收集的视频数据集和现有的长尾图像基准测试相比,在多个长尾属性方面都存在缺陷。最重要的是,它们在尾部缺乏几个示例类别。因此,我们提出了新的视频基准测试,通过对两个数据集SSv2和VideoLT的子集进行采样来更好地评估长尾识别。接着,我们提出一种方法,Long-Tail Mixed Reconstruction(长尾混合重建),它通过将尾部类别的实例重建为来自头部类别样本的加权组合来减少过度拟合。接着,LMR采用标签混合来学习强健的决策边界。它在EPIC-KITCHENS和SSv2-LT、VideoLT-LT上实现了最先进的平均类准确性。基准测试和代码位于:tobyperrett.github.io/lmr。