We present a large, multilingual study into how vision constrains linguistic choice, covering four languages and five linguistic properties, such as verb transitivity or use of numerals. We propose a novel method that leverages existing corpora of images with captions written by native speakers, and apply it to nine corpora, comprising 600k images and 3M captions. We study the relation between visual input and linguistic choices by training classifiers to predict the probability of expressing a property from raw images, and find evidence supporting the claim that linguistic properties are constrained by visual context across languages. We complement this investigation with a corpus study, taking the test case of numerals. Specifically, we use existing annotations (number or type of objects) to investigate the effect of different visual conditions on the use of numeral expressions in captions, and show that similar patterns emerge across languages. Our methods and findings both confirm and extend existing research in the cognitive literature. We additionally discuss possible applications for language generation.
翻译:我们对视觉如何限制语言选择,包括四种语言和五种语言特性,例如动词的中转性或使用数字等语言特性进行大规模多语种研究。我们提出一种新颖的方法,利用现有带有土著演讲者所写字幕的图像团体,并将其应用于9个公司,包括600k图像和3M字幕。我们研究视觉输入与语言选择之间的关系,培训分类人员,以预测从原始图像表达属性的可能性,并找到证据支持语言特性受到不同语言视觉背景限制的说法。我们用体力研究补充这一调查,采用数字试验案例。具体地说,我们使用现有说明(物体的数量或类型)来调查不同视觉条件对在字幕中使用数字表达方式的影响,并表明不同语言出现类似的模式。我们的方法和研究结果都确认并扩展了在认知文献中现有的研究。我们还讨论了语言生成的可能应用。