The grammatical analysis of texts in any written language typically involves a number of basic processing tasks, such as tokenization, morphological tagging, and dependency parsing. State-of-the-art systems can achieve high accuracy on these tasks for languages with large datasets, but yield poor results for languages which have little to no annotated data. To address this issue for the Tagalog language, we investigate the use of alternative language resources for creating task-specific models in the absence of dependency-annotated Tagalog data. We also explore the use of word embeddings and data augmentation to improve performance when only a small amount of annotated Tagalog data is available. We show that these zero-shot and few-shot approaches yield substantial improvements on grammatical analysis of both in-domain and out-of-domain Tagalog text compared to state-of-the-art supervised baselines.
翻译:对任何书面语文文本的语法分析通常涉及若干基本处理任务,例如象征性化、形态标记和依赖性分析。最先进的系统对于拥有大量数据集的语文来说,在这些任务上可以实现高度准确性,但对于几乎没有或根本没有附加说明的数据的语文则产生不良结果。为解决这一问题,我们在没有附带说明的塔加洛语数据的情况下,调查使用替代语文资源创建任务特定模型的问题。我们还探索使用单词嵌入和数据增强来改进业绩,只要只有少量附带说明的塔加洛语数据。我们表明,这些零点和少点的方法与最新监管基线相比,在对内部和外部塔加洛语的语文进行语法分析方面大有改进。