The goal of this work is to understand the way actions are performed in videos. That is, given a video, we aim to predict an adverb indicating a modification applied to the action (e.g. cut "finely"). We cast this problem as a regression task. We measure textual relationships between verbs and adverbs to generate a regression target representing the action change we aim to learn. We test our approach on a range of datasets and achieve state-of-the-art results on both adverb prediction and antonym classification. Furthermore, we outperform previous work when we lift two commonly assumed conditions: the availability of action labels during testing and the pairing of adverbs as antonyms. Existing datasets for adverb recognition are either noisy, which makes learning difficult, or contain actions whose appearance is not influenced by adverbs, which makes evaluation less reliable. To address this, we collect a new high quality dataset: Adverbs in Recipes (AIR). We focus on instructional recipes videos, curating a set of actions that exhibit meaningful visual changes when performed differently. Videos in AIR are more tightly trimmed and were manually reviewed by multiple annotators to ensure high labelling quality. Results show that models learn better from AIR given its cleaner videos. At the same time, adverb prediction on AIR is challenging, demonstrating that there is considerable room for improvement.
翻译:本文的目的是了解视频中动作的执行方式。也就是说,给定一个视频,我们的目标是预测一个副词,表示对动作的修改(例如,"细微地"切)。我们将这个问题视为回归任务。我们测量动词和副词之间的文本关系,以生成标记回归目标,表示我们要学习的动作变化。我们在一系列数据集上测试了我们的方法,并在副词预测和反义词分类方面取得了最先进的结果。此外,当我们消除了两个通常假定的条件时,我们的表现超过了以前的工作:测试期间的动作标签可用性和将副词作为反义词的配对。现有的用于副词识别的数据集要么存在噪声,使学习变得困难,要么包含其外观不受副词影响的动作,使评估不太可靠。为解决这个问题,我们收集了一个新的高质量数据集:烹饪食谱中的副词(AIR)。我们专注于指导视频食谱,策划了一组在执行不同操作时展现有意义视觉变化的动作。AIR中的视频更紧凑,经过多位注释人员的手动审查,以确保标注质量高。结果表明,模型从AIR中学习更好,因为其视频更干净。同时,在AIR上的副词预测是具有挑战性的,表明有很大的改进空间。