Physical computing infrastructure, data gathering, and algorithms have recently had significant advances to extract information from images and videos. The growth has been especially outstanding in image captioning and video captioning. However, most of the advancements in video captioning still take place in short videos. In this research, we caption longer videos only by using the keyframes, which are a small subset of the total video frames. Instead of processing thousands of frames, only a few frames are processed depending on the number of keyframes. There is a trade-off between the computation of many frames and the speed of the captioning process. The approach in this research is to allow the user to specify the trade-off between execution time and accuracy. In addition, we argue that linking images, videos, and natural language offers many practical benefits and immediate practical applications. From the modeling perspective, instead of designing and staging explicit algorithms to process videos and generate captions in complex processing pipelines, our contribution lies in designing hybrid deep learning architectures to apply in long videos by captioning video keyframes. We consider the technology and the methodology that we have developed as steps toward the applications discussed in this research.
翻译:物理计算基础设施、 数据收集和算法最近在从图像和视频中提取信息方面取得了显著进展。 图像字幕和视频字幕的增长特别突出。 然而, 视频字幕的大多数进步仍然在短视频中发生。 在这项研究中, 我们只通过使用关键框架来给视频提供更长的视频字幕, 它们是整个视频框的一小部分。 与处理数千个框架相比, 仅处理几个框架, 取决于关键框架的数量。 许多框架的计算与字幕进程的速度之间存在着平衡。 本研究的方法是允许用户指定执行时间和准确性之间的权衡。 此外, 我们指出, 将图像、 视频和自然语言联系起来可以带来许多实际的好处和直接的实际应用。 从建模的角度, 而不是设计和启动明确的算法处理视频和生成复杂处理管道的字幕, 我们的贡献在于设计混合深层次的学习架构, 通过给视频关键框架提供字幕, 应用于长视频中。 我们考虑了我们开发的技术和方法, 作为本研究中讨论的应用步骤。