Video captioning aims to generate natural language sentences that describe the given video accurately. Existing methods obtain favorable generation by exploring richer visual representations in encode phase or improving the decoding ability. However, the long-tailed problem hinders these attempts at low-frequency tokens, which rarely occur but carry critical semantics, playing a vital role in the detailed generation. In this paper, we introduce a novel Refined Semantic enhancement method towards Frequency Diffusion (RSFD), a captioning model that constantly perceives the linguistic representation of the infrequent tokens. Concretely, a Frequency-Aware Diffusion (FAD) module is proposed to comprehend the semantics of low-frequency tokens to break through generation limitations. In this way, the caption is refined by promoting the absorption of tokens with insufficient occurrence. Based on FAD, we design a Divergent Semantic Supervisor (DSS) module to compensate for the information loss of high-frequency tokens brought by the diffusion process, where the semantics of low-frequency tokens is further emphasized to alleviate the long-tailed problem. Extensive experiments indicate that RSFD outperforms the state-of-the-art methods on two benchmark datasets, i.e., MSR-VTT and MSVD, demonstrate that the enhancement of low-frequency tokens semantics can obtain a competitive generation effect. Code is available at https://github.com/lzp870/RSFD.
翻译:视频字幕旨在生成能够准确描述视频内容的自然语言句子。 现有方法通过在编码阶段或提高解码能力中探索更丰富的视觉表达方式获得了有利的生成。 然而,长期的问题阻碍了低频符号的这些尝试,这种尝试很少发生,但带有关键的语义,在详细一代中发挥着关键作用。 在本文中,我们为频率扩散(RSFD)引入了一种新型精炼的语义强化方法,即频流传(RSFD),这是一种不断看到不常见符号语言代表的字幕模式。具体地说,建议了一个频率-Aware Difmission(FAD)模块来理解低频符号的语义表达方式,以打破生成限制。 如此,通过促进吸收没有足够发生的代代代号来改进了字幕。 根据FADD,我们设计了一个Dvergent Smantict 督导(DSS)模块,以补偿传播过程带来的高频符号信息损失,在此过程中,进一步强调低频符号的语义表达(FD)模块,以缓解长尾带问题。 广泛的实验表明, RFDMFDD 的S- 的代SDDMex- 演示中的现有数据升级。