Multimedia event detection is the task of detecting a specific event of interest in an user-generated video on websites. The most fundamental challenge facing this task lies in the enormously varying quality of the video as well as the high-level semantic abstraction of event inherently. In this paper, we decompose the video into several segments and intuitively model the task of complex event detection as a multiple instance learning problem by representing each video as a "bag" of segments in which each segment is referred to as an instance. Instead of treating the instances equally, we associate each instance with a reliability variable to indicate its importance and then select reliable instances for training. To measure the reliability of the varying instances precisely, we propose a visual-semantic guided loss by exploiting low-level feature from visual information together with instance-event similarity based high-level semantic feature. Motivated by curriculum learning, we introduce a negative elastic-net regularization term to start training the classifier with instances of high reliability and gradually taking the instances with relatively low reliability into consideration. An alternative optimization algorithm is developed to solve the proposed challenging non-convex non-smooth problem. Experimental results on standard datasets, i.e., TRECVID MEDTest 2013 and TRECVID MEDTest 2014, demonstrate the effectiveness and superiority of the proposed method to the baseline algorithms.
翻译:多媒体事件探测是发现网站用户制作的视频中感兴趣的具体事件的任务,这项任务面临的最根本挑战在于视频质量和高层次的语义抽象性本身差异极大。在本文中,我们将视频分解成几个部分,并直观地将复杂事件探测任务作为多实例学习问题,将每个视频作为每个部分被称作实例的段段的“袋”来表示。我们不一视同仁地对待这些实例,而是将每个实例与可靠性变量联系起来,以表明其重要性,然后选择可靠的培训实例。为了准确衡量不同实例的可靠性,我们建议通过从视觉信息中利用低层次的特征以及基于事件发生的类似性高层次的语义特征,将视频分解成一个复杂的事件探测任务。我们受课程学习的驱使,我们引入了一个负面的弹性网络正规化术语,以开始对分类者进行高可靠性的培训,并逐渐将相对可靠性的事例考虑在内。我们开发了一种替代的优化算法,以解决拟议的具有挑战性的非Conx-D-D非模范级的2014年高端标准的TRestal 和TRestal ASimal AStical ASyal ASetal ASetal ASetal 问题。