We introduce Korean Language Understanding Evaluation (KLUE) benchmark. KLUE is a collection of 8 Korean natural language understanding (NLU) tasks, including Topic Classification, Semantic Textual Similarity, Natural Language Inference, Named Entity Recognition, Relation Extraction, Dependency Parsing, Machine Reading Comprehension, and Dialogue State Tracking. We build all of the tasks from scratch from diverse source corpora while respecting copyrights, to ensure accessibility for anyone without any restrictions. With ethical considerations in mind, we carefully design annotation protocols. Along with the benchmark tasks and data, we provide suitable evaluation metrics and fine-tuning recipes for pretrained language models for each task. We furthermore release the pretrained language models (PLM), KLUE-BERT and KLUE-RoBERTa, to help reproduce baseline models on KLUE and thereby facilitate future research. We make a few interesting observations from the preliminary experiments using the proposed KLUE benchmark suite, already demonstrating the usefulness of this new benchmark suite. First, we find KLUE-RoBERTa-large outperforms other baselines, including multilingual PLMs and existing open-source Korean PLMs. Second, we see minimal degradation in performance even when we replace personally identifiable information from the pretraining corpus, suggesting that privacy and NLU capability are not at odds with each other. Lastly, we find that using BPE tokenization in combination with morpheme-level pre-tokenization is effective in tasks involving morpheme-level tagging, detection and generation. In addition to accelerating Korean NLP research, our comprehensive documentation on creating KLUE will facilitate creating similar resources for other languages in the future. KLUE is available at this https URL (https://klue-benchmark.com/).
翻译:我们引入韩国语言理解评估(KLUE) 基准。 KLUE 是韩国8项自然语言理解(NLU)任务集, 包括主题分类、 语义文本相似性、 自然语言推断、 命名实体识别、 关系提取、 依赖性剖析、 机读理解 和对话状态跟踪。 我们从零开始从不同来源建立所有任务, 同时尊重版权, 以确保任何人不受任何限制的无障碍。 我们仔细设计了批注协议。 除了基准任务和数据外, 我们还为每项任务提供了合适的评估指标和微调配方。 我们还发布了预先培训的语言模型( PLM)、 校正、 KLUE-BERTA, 帮助复制KLUE的基线模型, 从而便利未来的研究。 我们从初步实验中得出一些有趣的观察意见, 已经证明了新的基准套件的有用性。 首先, 我们发现 KLUE- ROBERTA 超越了预先培训的语言模式。 我们的第二个测试级标准中, 将展示了我们最起码的版本的版本数据, 将展示了我们最起码的数据。