《开放朝鲜公司:实用报告》 (Open Korean Corpora: A Practical Report)

Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.

翻译：韩国语常常被称为研究界的低资源语言,虽然这一说法部分属实,但这也是因为资源的可用性没有得到充分的宣传和调节。这项工作整理和审查了韩国公司清单,首先介绍机构一级资源开发情况,然后通过当前不同任务类型开放数据集清单进一步循环。我们然后提出一个方向,说明应如何为资源不足的语文进行公开源数据集构建和发布,以促进研究。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【开放书】贝叶斯推理与机器学习，690页pdf，Bayesian Reasoning and Machine Learning

专知会员服务

192+阅读 · 2020年5月30日

【文章|BERT三步使用NLP迁移学习】NLP Transfer Learning In 3 Steps

专知会员服务

51+阅读 · 2019年11月26日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日