We propose PIGLeT: a model that learns physical commonsense knowledge through interaction, and then uses this knowledge to ground language. We factorize PIGLeT into a physical dynamics model, and a separate language model. Our dynamics model learns not just what objects are but also what they do: glass cups break when thrown, plastic ones don't. We then use it as the interface to our language model, giving us a unified model of linguistic form and grounded meaning. PIGLeT can read a sentence, simulate neurally what might happen next, and then communicate that result through a literal symbolic representation, or natural language. Experimental results show that our model effectively learns world dynamics, along with how to communicate them. It is able to correctly forecast "what happens next" given an English sentence over 80% of the time, outperforming a 100x larger, text-to-text approach by over 10%. Likewise, its natural language summaries of physical interactions are also judged by humans as more accurate than LM alternatives. We present comprehensive analysis showing room for future work.
翻译:我们建议 PIGLeT : 一种通过互动学习物理常识的模型, 然后将这种知识用于地面语言。 我们将 PIGLeT 纳入物理动态模型和单独的语言模型中。 我们的动态模型不仅学习什么是物体, 而且还学习它们做什么: 抛出时玻璃杯会断裂, 塑料不会。 然后用它作为我们语言模型的界面, 给我们提供一种语言形式和有根据含义的统一模型。 PIGLeT 可以读一个句子, 模拟下一步可能发生的事情, 然后通过字面符号或自然语言进行交流。 实验结果显示, 我们的模型可以有效地学习世界动态, 并学习如何进行交流。 它能够正确预测“ 下一步发生什么”, 80 % 的英语句子比10 % 的英语大, 文本到文本方法要快。 同样, 它关于物理互动的自然语言摘要也被人类评为比 LM 替代品更准确。 我们提出全面的分析, 显示未来工作的空间 。