Annotation of multimedia data by humans is time-consuming and costly, while reliable automatic generation of semantic metadata is a major challenge. We propose a framework to extract semantic metadata from automatically generated video captions. As metadata, we consider entities, the entities' properties, relations between entities, and the video category. We employ two state-of-the-art dense video captioning models with masked transformer (MT) and parallel decoding (PVDC) to generate captions for videos of the ActivityNet Captions dataset. Our experiments show that it is possible to extract entities, their properties, relations between entities, and the video category from the generated captions. We observe that the quality of the extracted information is mainly influenced by the quality of the event localization in the video as well as the performance of the event caption generation.
翻译:人类对多媒体数据的批注既费时又费钱,而可靠的自动生成语义元数据是一项重大挑战。我们提出了一个框架,从自动生成的视频字幕中提取语义元数据。作为元数据,我们考虑实体、实体的属性、实体之间的关系和视频类别。我们使用两种最先进的高密度视频字幕模型,配有遮蔽变压器和平行解码(PVDC),为活动网络数据集的视频制作字幕。我们的实验显示,可以从生成的字幕中提取实体、实体的属性、实体之间的关系和视频类别。我们观察到,所提取信息的质量主要受视频中事件本地化质量以及事件字幕生成情况的影响。