We present a corpus professionally annotated for grammatical error correction (GEC) and fluency edits in the Ukrainian language. To the best of our knowledge, this is the first GEC corpus for the Ukrainian language. We collected texts with errors (20,715 sentences) from a diverse pool of contributors, including both native and non-native speakers. The data cover a wide variety of writing domains, from text chats and essays to formal writing. Professional proofreaders corrected and annotated the corpus for errors relating to fluency, grammar, punctuation, and spelling. This corpus can be used for developing and evaluating GEC systems in Ukrainian. More generally, it can be used for researching multilingual and low-resource NLP, morphologically rich languages, document-level GEC, and fluency correction. The corpus is publicly available at https://github.com/grammarly/ua-gec
翻译:根据我们所知,这是乌克兰语言首个GEC文集,我们从包括母语和非母语发言者在内的各种撰稿人库收集了有错误的文本(20 715句),数据涵盖从文字聊天和作文到正式写作等各种写作领域,专业校对员纠正了流利、语法、标语和拼写方面的错误,并附加了与流利、语法、标语和拼写有关的错误的文体。该文集可用于开发和评价乌克兰语的GEC系统,更广泛地说,可用于研究多语种和低资源NLP、形式丰富的语言、文件级别GEC和流利校正。该文可在https://github.com/grammarly/ua-gec上公开查阅。