We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute to the still scarce data resources in this domain for languages other than English. The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts, where errors are expected to be much less common. We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research. Finally, we meta-evaluate common GEC metrics against human judgements on our data. We make the new Czech GEC corpus publicly available under the CC BY-SA 4.0 license at http://hdl.handle.net/11234/1-4639 .
翻译:捷克语语法错误校正(GECC)提供了四大领域,涵盖错误分布,从非母语发言人撰写的高误差密度论文到网站文本,预计错误会少得多。我们比较了捷克的几套GEC系统,包括一些基于变异器的系统,为未来研究确定了强有力的基准。最后,我们根据人类对我们数据所作的判断,对通用的GEC指标进行元化评价。我们根据CC BY-SA 4.0的许可证,在http://hdl.handle.net/11234-14639上公布新的捷克GEC系统。