Visual understanding requires interpreting both natural scenes and the textual information that appears within them, motivating tasks such as Visual Question Answering (VQA). However, current VQA benchmarks overlook scenarios with visually embedded questions, whereas advanced agents should be able to see the question without separate text input as humans. We introduce Visual-only Question Answering (VoQA), where both the scene and the question appear within a single image, requiring models to perceive and reason purely through vision. This setting supports more realistic visual understanding and interaction in scenarios where questions or instructions are embedded directly in the visual scene. Evaluations under pure visual-only zero-shot, prompt-guided and OCR-assisted settings show that current models exhibit a clear performance drop compared to traditional VQA. To address this, we investigate question-alignment fine-tuning strategies designed to guide models toward interpreting the visual question prior to reasoning. Leveraging VoQA dataset together with these strategies yields robust vision-only reasoning while preserving cross-task generalization to traditional VQA, reflecting the complementary visual and textual reasoning capabilities fostered through VoQA training. The code and data are publicly available.
翻译:视觉理解需要同时解释自然场景及其中出现的文本信息,这推动了视觉问答(VQA)等任务的发展。然而,当前的VQA基准测试忽略了问题嵌入视觉场景的情形,而先进的智能体应能像人类一样,无需独立的文本输入即可“看见”问题。本文提出了纯视觉问答(VoQA),其中场景和问题均呈现在单张图像内,要求模型完全通过视觉进行感知与推理。这一设定支持了在问题或指令直接嵌入视觉场景时更真实的视觉理解与交互。在纯视觉零样本、提示引导及OCR辅助设置下的评估表明,当前模型相比传统VQA存在明显的性能下降。为解决此问题,我们研究了问题对齐微调策略,旨在引导模型在推理前先解读视觉问题。利用VoQA数据集及这些策略,可在保持跨任务泛化至传统VQA的同时,实现鲁棒的纯视觉推理,这反映了通过VoQA训练所培养的视觉与文本推理能力的互补性。代码与数据已公开。