Grounding spatial relations in natural language for object placing could have ambiguity and compositionality issues. To address the issues, we introduce ParaGon, a PARsing And visual GrOuNding framework for language-conditioned object placement. It parses language instructions into relations between objects and grounds those objects in visual scenes. A particle-based GNN then conducts relational reasoning between grounded objects for placement generation. ParaGon encodes all of those procedures into neural networks for end-to-end training, which avoids annotating parsing and object reference grounding labels. Our approach inherently integrates parsing-based methods into a probabilistic, data-driven framework. It is data-efficient and generalizable for learning compositional instructions, robust to noisy language inputs, and adapts to the uncertainty of ambiguous instructions.
翻译:为了解决这些问题,我们引入了ParaGon, 这是一种以语言为条件的物体定位的ParaGon和视觉格鲁恩丁框架,它将语言指导用于物体之间的关系,并将这些物体置于视觉中。一个基于颗粒的GNN在以自然语言进行定位的物体之间进行关联推理。ParaGon将所有这些程序编码成神经网络,用于端到端培训,避免说明解析和对象参考基底标签。我们的方法必然将基于分解的方法纳入一个概率性、数据驱动的框架。它具有数据效率和通用性,用于学习构成指导,对吵闹的语言投入具有很强的力度,并适应模棱两可的指示的不确定性。