Investments in movie production are associated with a high level of risk as movie revenues have long-tailed and bimodal distributions. Accurate prediction of box-office revenue may mitigate the uncertainty and encourage investment. However, learning effective representations for actors, directors, and user-generated content-related keywords remains a challenging open problem. In this work, we investigate the effects of self-supervised pretraining and propose visual grounding of content keywords in objects from movie posters as a pertaining objective. Experiments on a large dataset of 35,794 movies demonstrate significant benefits of self-supervised training and visual grounding. In particular, visual grounding pretraining substantially improves learning on movies with content keywords and achieves 14.5% relative performance gains compared to a finetuned BERT model with identical architecture.
翻译:影片制作投资的不确定性很大,因为影片收益呈长尾和双峰分布。准确地预测票房收入可能有助于减少不确定性并鼓励投资。但是,为演员、导演和与用户生成内容相关的关键词学习有效表示仍然是一个具有挑战性的问题。在这项工作中,我们调查了自我监督预训练的效果,并提出了将内容关键字与电影海报对象进行视觉对接强制执行的方法。对包含35,794部电影的大型数据集的实验显示,自我监督训练和视觉对接预训练都带来了显著的成效。特别是,采用视觉对接预先训练,可显著提高在具有内容关键词的电影上的学习效果,并相比于具有相同体系结构的微调BERT模型,实现了14.5%的相对性能提高。