Neural language models (LMs) are arguably less data-efficient than humans -- why does this gap occur? In this study, we hypothesize that this gap stems from the learners' accessibility to modalities other than text, specifically, vision. We conducted two complementary experiments (using noisy, realistic data and a simplified, artificial one) toward the advantage of vision in the syntactic generalization of LMs. Our results showed that vision accelerated a proper linguistic generalization in the simplified, artificial setting, but LMs struggled with the noisy, realistic setting. These mixed results indicate several possibilities, e.g., vision can potentially boost language acquisition, but learners' additional visual/linguistic prior knowledge should be needed to robustly make use of raw images for efficient language acquisition.
翻译:可以说,神经语言模型(LMs)的数据效率比人类低,为什么出现这种差距?在本研究中,我们假设这一差距来自学生获得文本以外的模式,特别是视觉。我们进行了两个互补实验(使用吵闹、现实的数据和一个简化、人工的实验),以利在LMs的合成集思广益中实现视觉的优势。我们的结果表明,愿景加速了简化、人工环境中适当的语言概括化,但LMs与噪音、现实环境挣扎在一起。这些混杂的结果表明几种可能性,例如,视觉可以促进语言获取,但学生需要更多先前的视觉/语言知识才能有力地利用原始图像来有效获取语言。