English based datasets are commonly available from Kaggle, GitHub, or recently published papers. Although benchmark tests with English datasets are sufficient to show off the performances of new models and methods, still a researcher need to train and validate the models on Korean based datasets to produce a technology or product, suitable for Korean processing. This paper introduces 15 popular Korean based NLP datasets with summarized details such as volume, license, repositories, and other research results inspired by the datasets. Also, I provide high-resolution instructions with sample or statistics of datasets. The main characteristics of datasets are presented on a single table to provide a rapid summarization of datasets for researchers.
翻译:Kaggle、GitHub或最近发表的论文通常提供基于英文的数据集,虽然对英文数据集的基准测试足以显示新模型和方法的性能,但研究者仍需要培训和验证基于韩国的数据集的模型,以产生适合韩国处理的技术或产品。本文介绍了15个韩国流行的基于韩国的NLP数据集,并附有摘要细节,如数量、许可证、储存库和受数据集启发的其他研究成果。此外,我还提供高分辨率指示,提供数据集的样本或统计数据。数据集的主要特征在单张表格上显示,以快速地对研究人员的数据集进行汇总。