Modern streaming services are increasingly labeling videos based on their visual or audio content. This typically augments the use of technologies such as AI and ML by allowing to use natural speech for searching by keywords and video descriptions. Prior research has successfully provided a number of solutions for speech to text, in the case of a human speech, but this article aims to investigate possible solutions to retrieve sound events based on a natural language query, and estimate how effective and accurate they are. In this study, we specifically focus on the YamNet, AlexNet, and ResNet-50 pre-trained models to automatically classify audio samples using their respective melspectrograms into a number of predefined classes. The predefined classes can represent sounds associated with actions within a video fragment. Two tests are conducted to evaluate the performance of the models on two separate problems: audio classification and intervals retrieval based on a natural language query. Results show that the benchmarked models are comparable in terms of performance, with YamNet slightly outperforming the other two models. YamNet was able to classify single fixed-size audio samples with 92.7% accuracy and 68.75% precision while its average accuracy on intervals retrieval was 71.62% and precision was 41.95%. The investigated method may be embedded into an automated event marking architecture for streaming services.
翻译:现代流流服务正在越来越多地根据视觉或声频内容对视频进行标签,这通常有助于使用AI和ML等技术,如AI和ML等技术,允许使用自然语言搜索关键词和视频描述; 先前的研究成功地为文字语音提供了许多解决办法,但本文章旨在调查根据自然语言查询检索声音事件的可能解决办法,并估计其有效性和准确性; 在本研究中,我们特别侧重于YamNet、AlexNet和ResNet-50预培训模型,以便利用各自的Melspectrogram将音频样本自动分类为若干预先定义的类别。 预定义的类别可以代表与视频碎片中动作相关的声音。 进行了两次测试,以评价两种不同问题的模型性能:根据自然语言查询进行音频分类和间距检索; 结果表明基准模型在性能方面是可比的,YamNet略微优于其他两种模式。 YamNet能够将单个固定规模的音频样本分类为92.7%和68.75%的精确度,而其平均精确度的音频段段段间检索活动是比例的方法。