The grammatical analysis of texts in any human language typically involves a number of basic processing tasks, such as tokenization, morphological tagging, and dependency parsing. State-of-the-art systems can achieve high accuracy on these tasks for languages with large datasets, but yield poor results for languages such as Tagalog which have little to no annotated data. To address this issue for the Tagalog language, we investigate the use of auxiliary data sources for creating task-specific models in the absence of annotated Tagalog data. We also explore the use of word embeddings and data augmentation to improve performance when only a small amount of annotated Tagalog data is available. We show that these zero-shot and few-shot approaches yield substantial improvements on grammatical analysis of both in-domain and out-of-domain Tagalog text compared to state-of-the-art supervised baselines.
翻译:对任何人类语文文本的语法分析通常涉及若干基本处理任务,例如象征性化、形态标记和依赖性分析。对于拥有大量数据集的语文来说,最先进的系统能够在这些任务上达到很高的精确度,但对于诸如Tagalog等几乎没有或没有附加说明的数据的语文,结果却很差。为解决这一问题,我们调查在没有附加说明的Tagalog数据的情况下使用辅助数据源创建具体任务模型的情况。我们还探索在只有少量附加说明的Tagalog数据的情况下使用字嵌入和数据增强来改进性能。我们显示,这些零光和少见的方法与最新监管基线相比,在对主文和主文塔加洛文的语表学分析方面有重大改进。