A lack of large-scale human-annotated data has hampered the hierarchical discourse parsing of Chinese. In this paper, we present GCDT, the largest hierarchical discourse treebank for Mandarin Chinese in the framework of Rhetorical Structure Theory (RST). GCDT covers over 60K tokens across five genres of freely available text, using the same relation inventory as contemporary RST treebanks for English. We also report on this dataset's parsing experiments, including state-of-the-art (SOTA) scores for Chinese RST parsing and RST parsing on the English GUM dataset, using cross-lingual training in Chinese and English with multilingual embeddings.
翻译:缺乏大规模人文附加说明的数据妨碍了中国人的分级对话。本文介绍中华人在风力结构理论(RST)框架内最大的中华普通人分级对话树库(GCDTT) 。 GCDT 覆盖了五种免费文本的60多公斤标语,使用与当代RST英语树库相同的关系清单。 我们还报告了该数据集的分级实验,包括中国RST分级和中华人中华语分级和RST在英语GUM数据集的分级,使用中文和英文的跨语言培训以及多语言嵌入。