Deep learning has shown remarkable progress in a wide range of problems. However, efficient training of such models requires large-scale datasets, and getting annotations for such datasets can be challenging and costly. In this work, we explore the use of user-generated freely available labels from web videos for video understanding. We create a benchmark dataset consisting of around 2 million videos with associated user-generated annotations and other meta information. We utilize the collected dataset for action classification and demonstrate its usefulness with existing small-scale annotated datasets, UCF101 and HMDB51. We study different loss functions and two pretraining strategies, simple and self-supervised learning. We also show how a network pretrained on the proposed dataset can help against video corruption and label noise in downstream datasets. We present this as a benchmark dataset in noisy learning for video understanding. The dataset, code, and trained models will be publicly available for future research.
翻译:深层学习显示,在一系列广泛问题上取得了显著进展,然而,高效培训此类模型需要大规模数据集,获取此类数据集的说明可能具有挑战性和成本。在这项工作中,我们探索如何使用用户从网络视频中自由生成的标签,以了解视频。我们创建了一个由约200万个视频组成的基准数据集,其中含有相关用户生成的注释和其他元信息。我们利用所收集的数据集进行行动分类,并通过现有的小规模附加说明数据集(UCF101和HMDB51)展示其有用性。我们研究了不同的损失功能和两个预培训策略,简单和自监督的学习。我们还展示了在拟议数据集上预先培训的网络如何有助于防止视频腐败和下游数据集中的标签噪音。我们将此作为为了解视频而进行吵闹学习的基准数据集。数据集、代码和经过培训的模型将公开供今后研究使用。