The SIGMORPHON 2022 shared task on morpheme segmentation challenged systems to decompose a word into a sequence of morphemes and covered most types of morphology: compounds, derivations, and inflections. Subtask 1, word-level morpheme segmentation, covered 5 million words in 9 languages (Czech, English, Spanish, Hungarian, French, Italian, Russian, Latin, Mongolian) and received 13 system submissions from 7 teams and the best system averaged 97.29% F1 score across all languages, ranging English (93.84%) to Latin (99.38%). Subtask 2, sentence-level morpheme segmentation, covered 18,735 sentences in 3 languages (Czech, English, Mongolian), received 10 system submissions from 3 teams, and the best systems outperformed all three state-of-the-art subword tokenization methods (BPE, ULM, Morfessor2) by 30.71% absolute. To facilitate error analysis and support any type of future studies, we released all system predictions, the evaluation script, and all gold standard datasets.
翻译:SIGMORPHON 2022年,SIGMORPHON 2022年,共同承担了将一个单词分解成一个单词的系统任务,它挑战了将一个单词分解成一个单词序列的系统,并覆盖了大多数类型的形态学:化合物、衍生物和反射物。Subtaxk 2, 单词层次的单词分解(字级)1, 单词层次的单词分解以9种语言(捷克、英文、西班牙文、匈牙利、法文、意大利、俄文、拉丁、蒙古文)覆盖了500万个单词, 7个团队提交了13个系统文件, 在所有语言中, 平均为97.29% F1分, 从英语(93.84%)到拉丁语(99.38%)不等。 Subtaxk 2, 单词层次的单词层分解(18 735) 3种语言(捷克、英文、蒙古文、蒙古文) 收到了10个系统呈件, 最佳系统在30.71%的绝对比例上超越了所有三种最先进的子代号代号代号方法(BPBPBPBDE、ULM、ULM、M、Morfessor2) 。为了便利错误分析和支持任何类型的未来研究,我们发布了所有系统预测、评价脚和所有标准数据集。