The dominant paradigm for Audio-Text Retrieval (ATR) relies on mini-batch-based contrastive learning. This process, however, is inherently limited by what we formalize as the Gradient Locality Bottleneck (GLB), which structurally prevents models from leveraging out-of-batch knowledge and thus impairs fine-grained and long-tail learning. While external knowledge-enhanced methods can alleviate the GLB, we identify a critical, unaddressed side effect: the Representation-Drift Mismatch (RDM), where a static knowledge base becomes progressively misaligned with the evolving model, turning guidance into noise. To address this dual challenge, we propose the Adaptive Self-improving Knowledge (ASK) framework, a model-agnostic, plug-and-play solution. ASK breaks the GLB via multi-grained knowledge injection, systematically mitigates RDM through dynamic knowledge refinement, and introduces a novel adaptive reliability weighting scheme to ensure consistent knowledge contributes to optimization. Experimental results on two benchmark datasets with superior, state-of-the-art performance justify the efficacy of our proposed ASK framework.
翻译:当前音频文本检索(ATR)的主流范式依赖于基于小批量的对比学习。然而,这一过程本质上受限于我们形式化的梯度局部性瓶颈(GLB),该瓶颈在结构上阻碍模型利用批次外的知识,从而损害细粒度与长尾学习。尽管外部知识增强方法可以缓解GLB,但我们发现了一个关键且尚未解决的副作用:表征漂移失配(RDM),即静态知识库会逐渐与不断演化的模型失配,使指导信息转变为噪声。为应对这一双重挑战,我们提出了自适应自改进知识(ASK)框架,这是一个与模型无关的即插即用解决方案。ASK通过多粒度知识注入打破GLB,通过动态知识精炼系统性地缓解RDM,并引入一种新颖的自适应可靠性加权方案,以确保一致的知识对优化过程做出贡献。在两个基准数据集上取得的优越且达到最先进水平的实验结果,验证了我们提出的ASK框架的有效性。