We introduce EQUI-VOCAL: a new system that automatically synthesizes queries over videos from limited user interactions. The user only provides a handful of positive and negative examples of what they are looking for. EQUI-VOCAL utilizes these initial examples and additional ones collected through active learning to efficiently synthesize complex user queries. Our approach enables users to find events without database expertise, with limited labeling effort, and without declarative specifications or sketches. Core to EQUI-VOCAL's design is the use of spatio-temporal scene graphs in its data model and query language and a novel query synthesis approach that works on large and noisy video data. Our system outperforms two baseline systems -- in terms of F1 score, synthesis time, and robustness to noise -- and can flexibly synthesize complex queries that the baselines do not support.
翻译:我们引入了EQUI-VOCAL:一个自动合成来自有限用户互动的视频查询的新系统。用户只提供了他们所寻找的少量正面和负面例子。 EQUI-VOCAL利用这些初步例子和通过积极学习收集的额外例子来有效合成复杂的用户查询。我们的方法使用户能够找到没有数据库专门知识、标签工作有限、没有声明性规格或草图的事件。 EQUI-VOCOL设计的核心是在其数据模型和查询语言中使用时空场景图,以及一种新的查询合成方法,该方法在大型和噪音视频数据方面起作用。我们的系统在F1评分、合成时间和对噪音的稳健度方面超越了两个基线系统,并且能够灵活合成基线不支持的复杂查询。