We aim to understand how actions are performed and identify subtle differences, such as 'fold firmly' vs. 'fold gently'. To this end, we propose a method which recognizes adverbs across different actions. However, such fine-grained annotations are difficult to obtain and their long-tailed nature makes it challenging to recognize adverbs in rare action-adverb compositions. Our approach therefore uses semi-supervised learning with multiple adverb pseudo-labels to leverage videos with only action labels. Combined with adaptive thresholding of these pseudo-adverbs we are able to make efficient use of the available data while tackling the long-tailed distribution. Additionally, we gather adverb annotations for three existing video retrieval datasets, which allows us to introduce the new tasks of recognizing adverbs in unseen action-adverb compositions and unseen domains. Experiments demonstrate the effectiveness of our method, which outperforms prior work in recognizing adverbs and semi-supervised works adapted for adverb recognition. We also show how adverbs can relate fine-grained actions.
翻译:我们的目标是了解行动是如何执行的, 并辨别微妙的差别, 比如“ 固定” 和“ 轻轻” 。 为此, 我们提出一种方法, 来识别不同动作的对应方。 但是, 微细的批注很难获得, 其长尾的特性使得难以识别以稀有动作- adverb 构成的对应方。 因此, 我们的方法用多副伪标签来半监督的学习来利用只有动作标签的视频。 结合了我们在处理长尾分发时能够有效利用现有数据的适应阈值。 此外, 我们为三个现有的视频检索数据集收集对应方说明, 使我们能够引入在看不见动作- adver组成和未知域中识别对应方的新任务 。 实验显示了我们方法的有效性, 它比先前在识别adverb 和为adverb 识别而调整的半超导型作品的工作要好得多。 我们还展示了adverb 如何将精细的动作联系起来 。