Obtaining viewer responses from videos can be useful for creators and streaming platforms to analyze the video performance and improve the future user experience. In this report, we present our method for 2021 Evoked Expression from Videos Challenge. In particular, our model utilizes both audio and image modalities as inputs to predict emotion changes of viewers. To model long-range emotion changes, we use a GRU-based model to predict one sparse signal with 1Hz. We observe that the emotion changes are smooth. Therefore, the final dense prediction is obtained via linear interpolating the signal, which is robust to the prediction fluctuation. Albeit simple, the proposed method has achieved pearson's correlation score of 0.04430 on the final private test set.
翻译:从视频中获取查看者的答复对创作者和流式平台分析视频性能并改进未来的用户经验可能有用。 在本报告中, 我们展示了2021年视频挑战的“ 发声表达” 方法。 特别是, 我们的模型使用音频和图像模式作为预测观众情绪变化的投入。 为了模拟远程情感变化, 我们使用基于 GRU 的模型来预测一个1Hz的稀疏信号。 我们观察到情感变化是顺利的。 因此, 最终的密集预测是通过线性内插信号获得的, 该信号对预测波动非常有利。 简而言之, 拟议的方法在最后的私人测试中达到了皮尔森的0.04430的相对得分。