為視頻事件關係預測辯護的結構符號表示 (In Defense of Structural Symbolic Representation for Video Event-Relation Prediction)

Understanding event relationships in videos requires a model to understand the underlying structures of events (i.e. the event type, the associated argument roles, and corresponding entities) and factual knowledge for reasoning. Structural symbolic representation (SSR) based methods directly take event types and associated argument roles/entities as inputs to perform reasoning. However, the state-of-the-art video event-relation prediction system shows the necessity of using continuous feature vectors from input videos; existing methods based solely on SSR inputs fail completely, even when given oracle event types and argument roles. In this paper, we conduct an extensive empirical analysis to answer the following questions: 1) why SSR-based method failed; 2) how to understand the evaluation setting of video event relation prediction properly; 3) how to uncover the potential of SSR-based methods. We first identify suboptimal training settings as causing the failure of previous SSR-based video event prediction models. Then through qualitative and quantitative analysis, we show how evaluation that takes only video as inputs is currently unfeasible, as well as the reliance on oracle event information to obtain an accurate evaluation. Based on these findings, we propose to further contextualize the SSR-based model to an Event-Sequence Model and equip it with more factual knowledge through a simple yet effective way of reformulating external visual commonsense knowledge bases into an event-relation prediction pretraining dataset. The resultant new state-of-the-art model eventually establishes a 25% Macro-accuracy performance boost.

翻译：理解視頻中的事件關係需要一個能夠理解事件結構的模型（即事件類型、相關參數角色和相應實體），以及進行推理的實際知識。基於結構符號表示（SSR）的方法直接將事件類型和相關的參數角色/實體作為輸入，以進行推理。然而，目前最先進的視頻事件關係預測系統展示了使用從輸入視頻中得到的連續特徵向量的必要性；基於僅使用SSR輸入的現有方法將完全失敗，即使給予真實的事件類型和參數角色。本文通過大量實證分析來回答以下問題：1）為什麼基於SSR的方法失敗了？2）如何正確理解視頻事件關係預測的評估設置？3）如何發掘基於SSR的方法的潛力？我們首先確定子優訓練設置導致過去基於SSR的視頻事件預測模型失敗。然後，通過定性和定量分析，我們展示只考慮視頻作為輸入的評估目前是不實際的，以及實現準確評估的依賴性。根據這些發現，我們提出進一步將SSR模型情境化為事件序列模型，並通過將外部視覺常識知識庫重塑為事件關係預測預訓練數據集的簡單而有效的方式來備註更多實際知識。結果得到了新的最先進模型，最終確立了25％的整體準確率表現提升。