The HLTCOE Evaluation team participated in TREC VQA's Answer Generation (AG) task, for which we developed a listwise learning framework that aims to improve semantic precision and ranking consistency in answer generation. Given a video-question pair, a base multimodal model first generates multiple candidate answers, which are then reranked using a model trained with a novel Masked Pointer Cross-Entropy Loss with Rank Weights. This objective integrates pointer-based candidate selection, rank-dependent weighting, and masked cross-entropy under vocabulary restriction, enabling stable and interpretable listwise optimization. By bridging generative modeling with discriminative ranking, our method produces coherent, fine-grained answer lists. Experiments reveal consistent gains in accuracy and ranking stability, especially for questions requiring temporal reasoning and semantic disambiguation.
翻译:HLTCOE评估团队参与了TREC VQA的答案生成(AG)任务,为此我们开发了一种列表式学习框架,旨在提升答案生成的语义精确性与排序一致性。给定视频-问题对,基础多模态模型首先生成多个候选答案,随后使用一种基于新型掩码指针交叉熵损失与排序权重训练的模型进行重排序。该目标函数在词汇限制下整合了基于指针的候选选择、排序依赖加权以及掩码交叉熵,实现了稳定且可解释的列表式优化。通过将生成式建模与判别式排序相结合,我们的方法能够生成连贯且细粒度的答案列表。实验结果表明,该方法在准确性和排序稳定性方面均取得持续提升,尤其适用于需要时序推理和语义消歧的问题。