Being able to parse code-switched (CS) utterances, such as Spanish+English or Hindi+English, is essential to democratize task-oriented semantic parsing systems for certain locales. In this work, we focus on Spanglish (Spanish+English) and release a dataset, CSTOP, containing 5800 CS utterances alongside their semantic parses. We examine the CS generalizability of various Cross-lingual (XL) models and exhibit the advantage of pre-trained XL language models when data for only one language is present. As such, we focus on improving the pre-trained models for the case when only English corpus alongside either zero or a few CS training instances are available. We propose two data augmentation methods for the zero-shot and the few-shot settings: fine-tune using translate-and-align and augment using a generation model followed by match-and-filter. Combining the few-shot setting with the above improvements decreases the initial 30-point accuracy gap between the zero-shot and the full-data settings by two thirds.
翻译:能够解析密码转换(CS)语句,例如西班牙语+英语或印地语+英语,对于使某些地方的任务导向语义解析系统民主化至关重要。 在这项工作中,我们侧重于Spanglish(西班牙语+英语)并发布数据集,即SUSP, 包含5800 CS语及其语义剖析。我们检查了多种跨语言(XL)模式的CS通用性,并展示了在只有一种语言的数据存在时受过预先训练的 XL 语言模型的优势。因此,我们侧重于在只有英语材料同时提供零点或少量 CS 培训时改进案件预先培训的模式。我们建议了两种数据增强方法,即:使用翻译和对等的生成模型进行微调,并使用配对和过滤器辅助生成模型。将微小的设定与以上改进相结合,将零点和全数据设置之间的初始30点准确度差距缩小三分之二。