Recent research shows synthetic data as a source of supervision helps pretrained language models (PLM) transfer learning to new target tasks/domains. However, this idea is less explored for spatial language. We provide two new data resources on multiple spatial language processing tasks. The first dataset is synthesized for transfer learning on spatial question answering (SQA) and spatial role labeling (SpRL). Compared to previous SQA datasets, we include a larger variety of spatial relation types and spatial expressions. Our data generation process is easily extendable with new spatial expression lexicons. The second one is a real-world SQA dataset with human-generated questions built on an existing corpus with SPRL annotations. This dataset can be used to evaluate spatial language processing models in realistic situations. We show pretraining with automatically generated data significantly improves the SOTA results on several SQA and SPRL benchmarks, particularly when the training data in the target domain is small.
翻译:最近的研究显示,合成数据是一种监督来源,有助于预先培训的语言模型向新的目标任务/域转移学习。然而,这一想法在空间语言方面探索较少。我们为多种空间语言处理任务提供了两个新的数据资源。第一个数据集是合成的,用于空间问题回答和空间角色标签方面的转让学习。与以前的SQA数据集相比,我们包括了更多的空间关系类型和空间表达方式。我们的数据生成过程很容易随着新的空间表达法而扩展。第二个是现实世界SQA数据集,其中含有以 SPRL 批注建立在现有文体上的人产生的问题。该数据集可用于在现实情况下评估空间问题解答和空间角色标签方面的空间学习模式。我们用自动生成的数据进行预先培训,极大地改进了SQA和SPRL几个基准的SOTA结果,特别是当目标领域的培训数据很小时。