一次关于优秀韩国NLP数据集的调查 (A Survey on Awesome Korean NLP Datasets)

English based datasets are commonly available from Kaggle, GitHub, or recently published papers. Although benchmark tests with English datasets are sufficient to show off the performances of new models and methods, still a researcher need to train and validate the models on Korean based datasets to produce a technology or product, suitable for Korean processing. This paper introduces 15 popular Korean based NLP datasets with summarized details such as volume, license, repositories, and other research results inspired by the datasets. Also, I provide high-resolution instructions with sample or statistics of datasets. The main characteristics of datasets are presented on a single table to provide a rapid summarization of datasets for researchers.

翻译：Kaggle、GitHub或最近发表的论文通常提供基于英文的数据集,虽然对英文数据集的基准测试足以显示新模型和方法的性能,但研究者仍需要培训和验证基于韩国的数据集的模型,以产生适合韩国处理的技术或产品。本文介绍了15个韩国流行的基于韩国的NLP数据集,并附有摘要细节,如数量、许可证、储存库和受数据集启发的其他研究成果。此外,我还提供高分辨率指示,提供数据集的样本或统计数据。数据集的主要特征在单张表格上显示,以快速地对研究人员的数据集进行汇总。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

自然语言处理顶会COLING2020最佳论文出炉！

专知会员服务

24+阅读 · 2020年12月12日

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

专知会员服务

69+阅读 · 2020年1月2日

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日