研究阿拉伯语-英语代码交替数据增强中的词汇替换 (Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation) - 专知论文

会员服务 ·

0

数据增强 · 随机方法 · 语音识别 · 计算机科学 · 机器翻译 ·

2023 年 4 月 4 日

Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

翻译：研究阿拉伯语-英语代码交替数据增强中的词汇替换

Injy Hamed,Nizar Habash,Slim Abdennadher,Ngoc Thang Vu

from arxiv, Accepted to LoResMT 2023

Data sparsity is a main problem hindering the development of code-switching (CS) NLP systems. In this paper, we investigate data augmentation techniques for synthesizing dialectal Arabic-English CS text. We perform lexical replacements using word-aligned parallel corpora where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We compare these approaches against dictionary-based replacements. We assess the quality of the generated sentences through human evaluation and evaluate the effectiveness of data augmentation on machine translation (MT), automatic speech recognition (ASR), and speech translation (ST) tasks. Results show that using a predictive model results in more natural CS sentences compared to the random approach, as reported in human judgements. In the downstream tasks, despite the random approach generating more data, both approaches perform equally (outperforming dictionary-based replacements). Overall, data augmentation achieves 34% improvement in perplexity, 5.2% relative improvement on WER for ASR task, +4.0-5.1 BLEU points on MT task, and +2.1-2.2 BLEU points on ST over a baseline trained on available data without augmentation.

翻译：数据稀疏性是阻碍代码交替（CS）NLP系统发展的主要问题。在本文中，我们调查数据增强技术，用于合成方言阿拉伯语-英语的CS文本。我们使用单词对齐的并行语料库执行词汇替换，其中CS点是随机选择或使用序列到序列模型学习的。我们将这些方法与基于字典的替换进行比较。我们通过人类评估来评估生成的句子的质量，并评估数据增强对机器翻译（MT），自动语音识别（ASR）和语音翻译（ST）任务的有效性。结果显示，使用预测模型比随机方法生成更自然的CS句子，如人类判断所述。在下游任务中，尽管随机方法生成更多数据，但两种方法的表现相同（优于基于字典的替换）。总体而言，数据增强使得基线在没有增强的可用数据上训练时，语言流畅度方面有34％的提升，在ASR任务中相对提高了5.2％的WER，在MT任务上增加了+4.0-5.1 BLEU分数，在ST上增加了+2.1-2.2 BLEU分数。

0

相关内容

数据增强

数据增强在机器学习领域多指采用一些方法（比如数据蒸馏，正负样本均衡等）来提高模型数据集的质量，增强数据。

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【ICML2020】文本摘要生成模型PEGASUS

【ICML2020】文本摘要生成模型PEGASUS

专知会员服务

35+阅读 · 2020年8月23日

【ICML2020-Google】预训练提取的空白句子以便进行抽象摘要

【ICML2020-Google】预训练提取的空白句子以便进行抽象摘要

专知会员服务

20+阅读 · 2020年7月1日

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

专知会员服务

61+阅读 · 2020年5月15日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【AAAI2020论文-清华大学】Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources，最小资源增强的元学习跨语言命名实体识别

【AAAI2020论文-清华大学】Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources，最小资源增强的元学习跨语言命名实体识别

专知会员服务

31+阅读 · 2019年11月17日

【AAAI2020接受论文】Emu:使用语义专门化增强多语言句子嵌入，Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

【AAAI2020接受论文】Emu:使用语义专门化增强多语言句子嵌入，Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

专知会员服务

26+阅读 · 2019年11月11日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

举一反三：示例增强的（example augmented）自然语言处理

举一反三：示例增强的（example augmented）自然语言处理

RUC AI Box

1+阅读 · 2022年5月13日

【Github】BERT-NER-Pytorch：三种不同模式的BERT中文NER实验

【Github】BERT-NER-Pytorch：三种不同模式的BERT中文NER实验

AINLP

14+阅读 · 2020年1月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

麦冬皂苷通过下调lnc-MALAT1抑制NSCLC血管生成的机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

广东话背景的失乐症者声调和音乐的发声和感知

国家自然科学基金

0+阅读 · 2015年12月31日

先天性失乐症相关的语言声调加工障碍及其脑机制的研究

国家自然科学基金

0+阅读 · 2014年12月31日

单原子填充方钴矿热电材料微观力学行为的分子动力学模拟研究

国家自然科学基金

0+阅读 · 2013年12月31日

野外耕作条件下土壤-植物系统中外源纳米金属氧化物的运移转化与生态效应研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于海量语料自然标注信息的汉语自然语块分析

国家自然科学基金

0+阅读 · 2013年12月31日

中药对糖尿病KK-Ay小鼠肾小管上皮细胞转分化调控机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于Web及知识获取的无指导汉语词义消歧技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

语篇中话题的韵律编码方式及其对语篇理解的影响:汉语和彝语对比研究

国家自然科学基金

0+阅读 · 2009年12月31日

脂类组学解析植物膜脂分子组成及磷脂酶D对gamma辐射的响应

国家自然科学基金

0+阅读 · 2008年12月31日

LLM-powered Data Augmentation for Enhanced Crosslingual Performance

Arxiv

0+阅读 · 2023年5月23日

Exploring Chain-of-Thought Style Prompting for Text-to-SQL

Arxiv

0+阅读 · 2023年5月23日

Assessing Linguistic Generalisation in Language Models: A Dataset for Brazilian Portuguese

Arxiv

0+阅读 · 2023年5月23日

SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations

Arxiv

0+阅读 · 2023年5月22日

Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

Arxiv

0+阅读 · 2023年5月21日

Enhancing Few-shot NER with Prompt Ordering based Data Augmentation

Arxiv

1+阅读 · 2023年5月19日

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Arxiv

0+阅读 · 2023年5月19日

Is GPT-3 all you need for Visual Question Answering in Cultural Heritage?

Arxiv

0+阅读 · 2023年5月19日

Data Augmentation using Pre-trained Transformer Models

Arxiv

17+阅读 · 2020年3月4日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

VIP会员

文章信息

相关主题

计算机科学

相关VIP内容

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【ICML2020】文本摘要生成模型PEGASUS

【ICML2020】文本摘要生成模型PEGASUS

专知会员服务

35+阅读 · 2020年8月23日

【ICML2020-Google】预训练提取的空白句子以便进行抽象摘要

【ICML2020-Google】预训练提取的空白句子以便进行抽象摘要

专知会员服务

20+阅读 · 2020年7月1日

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

【ACL2020】命名实体识别即依存解析，Named Entity Recognition as Dependency Parsing

专知会员服务

61+阅读 · 2020年5月15日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

【Amazon】使用预先训练的Transformer模型进行数据增强，Data Augmentation using Pre-trained Transformer Models

专知会员服务

51+阅读 · 2020年3月7日

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

【微软亚洲研究院】CodeBERT:用于编程和自然语言的预训练模型，CodeBERT: A Pre-Trained Model for Programming and Natural Languages

专知会员服务

32+阅读 · 2020年2月21日

【AAAI2020论文-清华大学】Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources，最小资源增强的元学习跨语言命名实体识别

【AAAI2020论文-清华大学】Enhanced Meta-Learning for Cross-lingual Named Entity Recognition with Minimal Resources，最小资源增强的元学习跨语言命名实体识别

专知会员服务

31+阅读 · 2019年11月17日

【AAAI2020接受论文】Emu:使用语义专门化增强多语言句子嵌入，Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

【AAAI2020接受论文】Emu:使用语义专门化增强多语言句子嵌入，Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

专知会员服务

26+阅读 · 2019年11月11日

热门VIP内容

开通专知VIP会员享更多权益服务

【新书】《知识图谱与大语言模型的协同应用》，544页pdf

军事通信系统：安全行动的支柱

《缓解大语言模型（LLMs）幻觉：面向应用的检索增强生成（RAG）、推理与智能体系统综述》

【新书】机器学习系统，2620页pdf

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

举一反三：示例增强的（example augmented）自然语言处理

举一反三：示例增强的（example augmented）自然语言处理

RUC AI Box

1+阅读 · 2022年5月13日

【Github】BERT-NER-Pytorch：三种不同模式的BERT中文NER实验

【Github】BERT-NER-Pytorch：三种不同模式的BERT中文NER实验

AINLP

14+阅读 · 2020年1月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

谷歌发表的史上最强NLP模型BERT的官方代码和预训练模型可以下载了

AINLP

12+阅读 · 2018年11月1日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

相关论文

LLM-powered Data Augmentation for Enhanced Crosslingual Performance

Arxiv

0+阅读 · 2023年5月23日

Exploring Chain-of-Thought Style Prompting for Text-to-SQL

Arxiv

0+阅读 · 2023年5月23日

Assessing Linguistic Generalisation in Language Models: A Dataset for Brazilian Portuguese

Arxiv

0+阅读 · 2023年5月23日

SegAugment: Maximizing the Utility of Speech Translation Data with Segmentation-based Augmentations

Arxiv

0+阅读 · 2023年5月22日

Contextualized End-to-End Speech Recognition with Contextual Phrase Prediction Network

Arxiv

0+阅读 · 2023年5月21日

Enhancing Few-shot NER with Prompt Ordering based Data Augmentation

Arxiv

1+阅读 · 2023年5月19日

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Arxiv

0+阅读 · 2023年5月19日

Is GPT-3 all you need for Visual Question Answering in Cultural Heritage?

Arxiv

0+阅读 · 2023年5月19日

Data Augmentation using Pre-trained Transformer Models

Arxiv

17+阅读 · 2020年3月4日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

相关基金

麦冬皂苷通过下调lnc-MALAT1抑制NSCLC血管生成的机制研究

国家自然科学基金

0+阅读 · 2015年12月31日

广东话背景的失乐症者声调和音乐的发声和感知

国家自然科学基金

0+阅读 · 2015年12月31日

先天性失乐症相关的语言声调加工障碍及其脑机制的研究

国家自然科学基金

0+阅读 · 2014年12月31日

单原子填充方钴矿热电材料微观力学行为的分子动力学模拟研究

国家自然科学基金

0+阅读 · 2013年12月31日

野外耕作条件下土壤-植物系统中外源纳米金属氧化物的运移转化与生态效应研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于海量语料自然标注信息的汉语自然语块分析

国家自然科学基金

0+阅读 · 2013年12月31日

中药对糖尿病KK-Ay小鼠肾小管上皮细胞转分化调控机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于Web及知识获取的无指导汉语词义消歧技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

语篇中话题的韵律编码方式及其对语篇理解的影响:汉语和彝语对比研究

国家自然科学基金

0+阅读 · 2009年12月31日

脂类组学解析植物膜脂分子组成及磷脂酶D对gamma辐射的响应

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员