Human infant learning happens during exploration of the environment, by interaction with objects, and by listening to and repeating utterances casually, which is analogous to unsupervised learning. Only occasionally, a learning infant would receive a matching verbal description of an action it is committing, which is similar to supervised learning. Such a learning mechanism can be mimicked with deep learning. We model this weakly supervised learning paradigm using our Paired Gated Autoencoders (PGAE) model, which combines an action and a language autoencoder. After observing a performance drop when reducing the proportion of supervised training, we introduce the Paired Transformed Autoencoders (PTAE) model, using Transformer-based crossmodal attention. PTAE achieves significantly higher accuracy in language-to-action and action-to-language translations, particularly in realistic but difficult cases when only few supervised training samples are available. We also test whether the trained model behaves realistically with conflicting multimodal input. In accordance with the concept of incongruence in psychology, conflict deteriorates the model output. Conflicting action input has a more severe impact than conflicting language input, and more conflicting features lead to larger interference. PTAE can be trained on mostly unlabelled data where labeled data is scarce, and it behaves plausibly when tested with incongruent input.
翻译:人类婴儿的学习发生在环境探索过程中,通过与对象的互动,通过倾听和重复与无人监督的学习相似的言语,通过听和重复偶然的言语来进行,这类似于无人监督的学习。只有偶尔,学习的婴儿才会得到与其所实施的行动相匹配的口头描述,这类似于监督的学习。这种学习机制可以模仿深层次的学习。我们用我们的Paired Gated Autoencoders(PGAE)模型来模拟这种受监督的学习模式,这种模式将行动和语言自动编码自动编码器结合起来。在观察了在降低受监督的培训比例时的性能下降之后,我们引入了Paired 变换自动编码器(PTAE) 模型,使用以变换器为基础的跨式关注。PTAE在语言对行动和行动对语言的翻译方面实现的精确度要高得多,特别是在只有很少受监督的培训的培训培训的培训样本的情况下,这种学习方式比较困难。我们还测试经过训练的模型是否现实地使用相互矛盾的多式联运输入。根据心理学的融合概念,冲突使模型产出恶化。在模型产出上产生更严重的影响。冲突,在语言输入中比被训练的标签更严重。