Korean is often referred to as a low-resource language in the research community. While this claim is partially true, it is also because the availability of resources is inadequately advertised and curated. This work curates and reviews a list of Korean corpora, first describing institution-level resource development, then further iterate through a list of current open datasets for different types of tasks. We then propose a direction on how open-source dataset construction and releases should be done for less-resourced languages to promote research.
翻译:韩国语常常被称为研究界的低资源语言,虽然这一说法部分属实,但这也是因为资源的可用性没有得到充分的宣传和调节。 这项工作整理和审查了韩国公司清单,首先介绍机构一级资源开发情况,然后通过当前不同任务类型开放数据集清单进一步循环。 我们然后提出一个方向,说明应如何为资源不足的语文进行公开源数据集构建和发布,以促进研究。