There is a growing trend in placing video advertisements on social platforms for online marketing, which demands automatic approaches to understand the contents of advertisements effectively. Taking the 2021 TAAC competition as an opportunity, we developed a multimodal system to improve the ability of structured analysis of advertising video content. In our framework, we break down the video structuring analysis problem into two tasks, i.e., scene segmentation and multi-modal tagging. In scene segmentation, we build upon a temporal convolution module for temporal modeling to predict whether adjacent frames belong to the same scene. In multi-modal tagging, we first compute clip-level visual features by aggregating frame-level features with NeXt-SoftDBoF. The visual features are further complemented with textual features that are derived using a global-local attention mechanism to extract useful information from OCR (Optical Character Recognition) and ASR (Audio Speech Recognition) outputs. Our solution achieved a score of 0.2470 measured in consideration of localization and prediction accuracy, ranking fourth in the 2021 TAAC final leaderboard.
翻译:在网上营销的社会平台上张贴视频广告的趋势日益增长,这要求采取自动方法有效理解广告内容。以2021年TAAC竞争为契机,我们开发了一个多式联运系统,以提高对广告视频内容进行结构性分析的能力。在我们的框架内,我们将视频分析问题分为两个任务,即场景分割和多式标记。在场景分割中,我们利用一个时间变速模型模块来预测相邻框架是否属于同一场景。在多式标记中,我们首先通过将框架级特征与NeXt-SoftDBOF合并来计算短视级特征。视觉特征进一步得到文字特征的补充,这些特征是利用全球-地方关注机制从OCR(承认功能)和ASR(承认语言)产出中提取有用信息的文字特征。我们的解决方案在考虑本地化和预测准确性时,达到了0.2470分的分,在2021 TAAC最后头板中排名第四位。