Imagine a robot is shown new concepts visually together with spoken tags, e.g. "milk", "eggs", "butter". After seeing one paired audio-visual example per class, it is shown a new set of unseen instances of these objects, and asked to pick the "milk". Without receiving any hard labels, could it learn to match the new continuous speech input to the correct visual instance? Although unimodal one-shot learning has been studied, where one labelled example in a single modality is given per class, this example motivates multimodal one-shot learning. Our main contribution is to formally define this task, and to propose several baseline and advanced models. We use a dataset of paired spoken and visual digits to specifically investigate recent advances in Siamese convolutional neural networks. Our best Siamese model achieves twice the accuracy of a nearest neighbour model using pixel-distance over images and dynamic time warping over speech in 11-way cross-modal matching.
翻译:想象一下机器人在视觉上与口语标签( 如“ milk ” 、 “ eggs ” 、 “ butter ” ) 一同展示了新概念。 在每班看到一组配对的视听示例后, 它展示了一组新的这些天体的隐形实例, 并被要求选择“ milk ” 。 在没有接受任何硬标签的情况下, 它能学会将新的连续语音输入与正确的视觉实例匹配吗? 尽管已经研究了单式单式单张学习方法, 每班都给出一个单一模式的标注示例, 但这个示例鼓励了多式一拍学习。 我们的主要贡献是正式定义这个任务, 并提出了几个基线和高级模型。 我们使用一组配对的口头和视觉数字来具体调查西亚共和神经网络的最新进展。 我们最好的暹玛模型能比近邻模型的精确度高出一倍, 使用像素距离图像和动态时间在11个跨式对调时对话。