In this paper, we investigate whether artificial agents can develop a shared language in an ecological setting where communication relies on a sensory-motor channel. To this end, we introduce the Graphical Referential Game (GREG) where a speaker must produce a graphical utterance to name a visual referent object while a listener has to select the corresponding object among distractor referents, given the delivered message. The utterances are drawing images produced using dynamical motor primitives combined with a sketching library. To tackle GREG we present CURVES: a multimodal contrastive deep learning mechanism that represents the energy (alignment) between named referents and utterances generated through gradient ascent on the learned energy landscape. We demonstrate that CURVES not only succeeds at solving the GREG but also enables agents to self-organize a language that generalizes to feature compositions never seen during training. In addition to evaluating the communication performance of our approach, we also explore the structure of the emerging language. Specifically, we show that the resulting language forms a coherent lexicon shared between agents and that basic compositional rules on the graphical productions could not explain the compositional generalization.
翻译:在本文中, 我们调查人工代理器是否可以在生态环境中开发一种共享的语言, 因为在生态环境中通信依赖于感官运动频道。 为此, 我们引入了图形参考游戏( GREG ), 该游戏的演讲者必须生成图形表达来命名视觉参考对象, 而听众则需要根据发送的信息在分散引用对象中选择相应的对象。 该演讲是用动态运动原始和素描图书馆一起制作的图像。 为了处理 GREG, 我们提出了 CURVES : 一个多式对比性深层次学习机制, 代表了在所学能源景观中通过梯度生成的参考和表达的能量。 我们证明 CURVES 不仅成功解决了视觉参考对象, 而且还使代理商能够自我组织一种语言, 概括在培训期间从未见过的构成。 除了评估我们方法的交流表现外, 我们还探索了正在形成的语言的结构。 具体地说, 我们表明, 由此产生的语言构成了代理人之间一致的词汇( ) 以及图形制作的基本构成规则无法解释一般的构成。