非洲低资源语言的文本正常化 (Text Normalization for Low-Resource Languages of Africa) - 专知论文

会员服务 ·

0

规范化的 · NLTK · MoDELS · 情景 · 语言模型化 ·

2021 年 3 月 29 日

Text Normalization for Low-Resource Languages of Africa

翻译：非洲低资源语言的文本正常化

Andrew Zupon,Evan Crew,Sandy Ritchie

from arxiv, to be presented at AfricaNLP 2021

Training data for machine learning models can come from many different sources, which can be of dubious quality. For resource-rich languages like English, there is a lot of data available, so we can afford to throw out the dubious data. For low-resource languages where there is much less data available, we can't necessarily afford to throw out the dubious data, in case we end up with a training set which is too small to train a model. In this study, we examine the effects of text normalization and data set quality for a set of low-resource languages of Africa -- Afrikaans, Amharic, Hausa, Igbo, Malagasy, Somali, Swahili, and Zulu. We describe our text normalizer which we built in the Pynini framework, a Python library for finite state transducers, and our experiments in training language models for African languages using the Natural Language Toolkit (NLTK), an open-source Python library for NLP.

翻译：机器学习模式的培训数据可能来自许多不同来源,其质量可能令人怀疑。对于像英语这样的资源丰富的语言,有很多数据可供使用,因此我们可以承担丢弃可疑的数据。对于可用数据少得多的低资源语言,我们不能承担丢弃可疑的数据,万一我们最终要用一个小到无法培训模型的训练组来完成。在这项研究中,我们研究了一套非洲低资源语言 -- -- 南非荷兰语、阿姆哈拉语、豪萨语、伊格博语、马达加斯加语、索马里语、斯瓦希里语和祖鲁语 -- -- 的文本正常化和数据集质量的影响。我们描述了我们在皮尼框架内建造的文本标准化器,一个供有限的国家转基因师使用的Python图书馆,以及我们在非洲语言培训模式中使用自然语言工具包(NLTK)的实验,这是NLP的开放源的Python图书馆。

0

相关内容

规范化的

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

专知会员服务

23+阅读 · 2020年4月22日

【电子书】大数据挖掘，Mining of Massive Datasets，附513页PDF

【电子书】大数据挖掘，Mining of Massive Datasets，附513页PDF

专知会员服务

105+阅读 · 2020年3月22日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

专知会员服务

80+阅读 · 2019年10月27日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

【深度学习视频分析/多模态学习资源大列表】

【深度学习视频分析/多模态学习资源大列表】

专知会员服务

92+阅读 · 2019年10月16日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

已删除

将门创投

14+阅读 · 2019年5月29日

自然语言处理常见数据集、论文最全整理分享

自然语言处理常见数据集、论文最全整理分享

深度学习与NLP

11+阅读 · 2019年1月26日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

资源 | Python 中文书籍大集合

资源 | Python 中文书籍大集合

AI研习社

13+阅读 · 2018年12月20日

文本数据分析（三）：用Python实现文本数据预处理

文本数据分析（三）：用Python实现文本数据预处理

论智

7+阅读 · 2018年4月12日

五个精彩实用的自然语言处理资源

五个精彩实用的自然语言处理资源

机器学习研究会

6+阅读 · 2018年2月23日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

Relationship between changing malaria burden and low birth weight in sub-Saharan Africa: a difference-in-differences study via a pair-of-pairs approach

Arxiv

0+阅读 · 2021年5月23日

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

Arxiv

0+阅读 · 2021年5月21日

Multilingual Offensive Language Identification for Low-resource Languages

Arxiv

0+阅读 · 2021年5月20日

Exploiting Adapters for Cross-lingual Low-resource Speech Recognition

Arxiv

1+阅读 · 2021年5月18日

All Word Embeddings from One Embedding

Arxiv

4+阅读 · 2020年5月25日

Improving Candidate Generation for Low-resource Cross-lingual Entity Linking

Arxiv

8+阅读 · 2020年3月3日

Zero-Resource Cross-Lingual Named Entity Recognition

Arxiv

5+阅读 · 2019年11月22日

Low-Resource Response Generation with Template Prior

Arxiv

4+阅读 · 2019年9月26日

Investigating Meta-Learning Algorithms for Low-Resource Natural Language Understanding Tasks

Arxiv

5+阅读 · 2019年8月27日

Lessons from the Bible on Modern Topics: Low-Resource Multilingual Topic Model Evaluation

Arxiv

4+阅读 · 2018年4月26日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

专知会员服务

23+阅读 · 2020年4月22日

【电子书】大数据挖掘，Mining of Massive Datasets，附513页PDF

【电子书】大数据挖掘，Mining of Massive Datasets，附513页PDF

专知会员服务

105+阅读 · 2020年3月22日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

【下载】Python自然语言处理实战书籍和代码《Natural Language Processing in Action》

专知会员服务

80+阅读 · 2019年10月27日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

【深度学习视频分析/多模态学习资源大列表】

【深度学习视频分析/多模态学习资源大列表】

专知会员服务

92+阅读 · 2019年10月16日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

【ACL2025教程】大语言模型的护栏与安全性：对其应用的安全、可靠与可控引导

《实现协同自主：从人机协作到多智能体系统》最新190页

【ICML2025】SToFM：一种用于空间转录组学的多尺度基础模型

通信网络智能体白皮书V1.0，61页pdf

相关资讯

已删除

将门创投

14+阅读 · 2019年5月29日

自然语言处理常见数据集、论文最全整理分享

自然语言处理常见数据集、论文最全整理分享

深度学习与NLP

11+阅读 · 2019年1月26日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

资源 | Python 中文书籍大集合

资源 | Python 中文书籍大集合

AI研习社

13+阅读 · 2018年12月20日

文本数据分析（三）：用Python实现文本数据预处理

文本数据分析（三）：用Python实现文本数据预处理

论智

7+阅读 · 2018年4月12日

五个精彩实用的自然语言处理资源

五个精彩实用的自然语言处理资源

机器学习研究会

6+阅读 · 2018年2月23日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

【学习】Hierarchical Softmax

【学习】Hierarchical Softmax

机器学习研究会

4+阅读 · 2017年8月6日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

Relationship between changing malaria burden and low birth weight in sub-Saharan Africa: a difference-in-differences study via a pair-of-pairs approach

Arxiv

0+阅读 · 2021年5月23日

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

Arxiv

0+阅读 · 2021年5月21日

Multilingual Offensive Language Identification for Low-resource Languages

Arxiv

0+阅读 · 2021年5月20日

Exploiting Adapters for Cross-lingual Low-resource Speech Recognition

Arxiv

1+阅读 · 2021年5月18日

All Word Embeddings from One Embedding

Arxiv

4+阅读 · 2020年5月25日

Improving Candidate Generation for Low-resource Cross-lingual Entity Linking

Arxiv

8+阅读 · 2020年3月3日

Zero-Resource Cross-Lingual Named Entity Recognition

Arxiv

5+阅读 · 2019年11月22日

Low-Resource Response Generation with Template Prior

Arxiv

4+阅读 · 2019年9月26日

Investigating Meta-Learning Algorithms for Low-Resource Natural Language Understanding Tasks

Arxiv

5+阅读 · 2019年8月27日

Lessons from the Bible on Modern Topics: Low-Resource Multilingual Topic Model Evaluation

Arxiv

4+阅读 · 2018年4月26日

微信扫码咨询专知VIP会员