自适应自改进知识框架用于音频文本检索 (ASK: Adaptive Self-improving Knowledge Framework for Audio Text Retrieval)

The dominant paradigm for Audio-Text Retrieval (ATR) relies on mini-batch-based contrastive learning. This process, however, is inherently limited by what we formalize as the Gradient Locality Bottleneck (GLB), which structurally prevents models from leveraging out-of-batch knowledge and thus impairs fine-grained and long-tail learning. While external knowledge-enhanced methods can alleviate the GLB, we identify a critical, unaddressed side effect: the Representation-Drift Mismatch (RDM), where a static knowledge base becomes progressively misaligned with the evolving model, turning guidance into noise. To address this dual challenge, we propose the Adaptive Self-improving Knowledge (ASK) framework, a model-agnostic, plug-and-play solution. ASK breaks the GLB via multi-grained knowledge injection, systematically mitigates RDM through dynamic knowledge refinement, and introduces a novel adaptive reliability weighting scheme to ensure consistent knowledge contributes to optimization. Experimental results on two benchmark datasets with superior, state-of-the-art performance justify the efficacy of our proposed ASK framework.

翻译：当前音频文本检索（ATR）的主流范式依赖于基于小批量的对比学习。然而，这一过程本质上受限于我们形式化的梯度局部性瓶颈（GLB），该瓶颈在结构上阻碍模型利用批次外的知识，从而损害细粒度与长尾学习。尽管外部知识增强方法可以缓解GLB，但我们发现了一个关键且尚未解决的副作用：表征漂移失配（RDM），即静态知识库会逐渐与不断演化的模型失配，使指导信息转变为噪声。为应对这一双重挑战，我们提出了自适应自改进知识（ASK）框架，这是一个与模型无关的即插即用解决方案。ASK通过多粒度知识注入打破GLB，通过动态知识精炼系统性地缓解RDM，并引入一种新颖的自适应可靠性加权方案，以确保一致的知识对优化过程做出贡献。在两个基准数据集上取得的优越且达到最先进水平的实验结果，验证了我们提出的ASK框架的有效性。