阿拉伯文-英文代码转换数据增强调查用词汇替换 (Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation) - 专知论文

会员服务 ·

0

Perplexity · 数据增强 · Performer · MoDELS · 计算机科学 ·

2022 年 5 月 25 日

Investigating Lexical Replacements for Arabic-English Code-Switched Data Augmentation

翻译：阿拉伯文-英文代码转换数据增强调查用词汇替换

Injy Hamed,Nizar Habash,Slim Abdennadher,Ngoc Thang Vu

Code-switching (CS) poses several challenges to NLP tasks, where data sparsity is a main problem hindering the development of CS NLP systems. In this paper, we investigate data augmentation techniques for synthesizing Dialectal Arabic-English CS text. We perform lexical replacements using parallel corpora and alignments where CS points are either randomly chosen or learnt using a sequence-to-sequence model. We evaluate the effectiveness of data augmentation on language modeling (LM), machine translation (MT), and automatic speech recognition (ASR) tasks. Results show that in the case of using 1-1 alignments, using trained predictive models produces more natural CS sentences, as reflected in perplexity. By relying on grow-diag-final alignments, we then identify aligning segments and perform replacements accordingly. By replacing segments instead of words, the quality of synthesized data is greatly improved. With this improvement, random-based approach outperforms using trained predictive models on all extrinsic tasks. Our best models achieve 33.6% improvement in perplexity, +3.2-5.6 BLEU points on MT task, and 7% relative improvement on WER for ASR task. We also contribute in filling the gap in resources by collecting and publishing the first Arabic English CS-English parallel corpus.

翻译：代码转换( CS) 给 NLP 任务带来了若干挑战, 数据宽度是阻碍 CS NLP 系统开发的一个主要问题。在本文中, 我们调查了用于合成阿拉伯文- 英文 CS 文本的数据增强技术。我们使用平行的 Cosora 进行词汇替换, 使用随机选择 CS 点或使用顺序顺序序列模型学习 CS 点的校对。我们评估了语言建模、机器翻译( MT) 和自动语音识别( ASR) 任务的数据增加的有效性。结果显示, 在使用 1-1 校准( 1-1 校准) 的情况下, 使用经过培训的预测模型, 生成了更自然的 CS 句, 这反映在不易解中。我们通过依赖增长- diag 最终校正校正校正校正校正校正校正校正校正校正校正 ABEUEU 任务中, 我们的最佳模型的改进了33. 6

0

相关内容

Perplexity

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

幽门螺杆菌调控lncRNA-AK096550诱导SOCS3促进胰岛素抵抗发生的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

平方本征函数对称与随机矩阵

国家自然科学基金

0+阅读 · 2013年12月31日

LncRNA-P2RX7调控树突细胞NLRP3炎症小体通路参与白塞氏病的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

PSMA介导前列腺癌靶向与细胞内触发释药的核交联胶束递药系统的构建及其评价

国家自然科学基金

0+阅读 · 2012年12月31日

citron kinase促进HIV-1病毒颗粒包装出芽机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

轴突导向分子Sema4D及其可溶性片段参与动脉粥样硬化形成的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

退化k-Hessian方程解的正则性研究

国家自然科学基金

0+阅读 · 2011年12月31日

Klotho蛋白在缺血再灌注急性肾损伤中的抗氧化应激机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

壳聚糖-聚乳酸接枝共聚物制备生物可吸收水凝胶药物缓释体系

国家自然科学基金

0+阅读 · 2008年12月31日

How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

Arxiv

0+阅读 · 2022年7月13日

DocCoder: Generating Code by Retrieving and Reading Docs

Arxiv

0+阅读 · 2022年7月13日

Causal Conceptions of Fairness and their Consequences

Arxiv

0+阅读 · 2022年7月12日

Language-specific Characteristic Assistance for Code-switching Speech Recognition

Arxiv

0+阅读 · 2022年7月12日

Building Korean Sign Language Augmentation (KoSLA) Corpus with Data Augmentation Technique

Arxiv

0+阅读 · 2022年7月12日

FairDistillation: Mitigating Stereotyping in Language Models

Arxiv

0+阅读 · 2022年7月10日

Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

Arxiv

0+阅读 · 2022年7月9日

OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering

Arxiv

0+阅读 · 2022年7月8日

K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering

Arxiv

15+阅读 · 2021年9月22日

Distance-based Self-Attention Network for Natural Language Inference

Arxiv

10+阅读 · 2017年12月6日

VIP会员

文章信息

相关主题

计算机科学

相关VIP内容

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

160+阅读 · 2019年10月12日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】数据驱动决策中的激励、信息与不确定性

DGP双粒度提示框架：图增强大模型助力欺诈检测

【ICCV2025】ESSENTIAL：用于视频类增量学习的情景记忆与语义记忆整合

唯快不破：大型语言模型高效架构综述

相关资讯

ACM MM 2022 Call for Papers

ACM MM 2022 Call for Papers

CCF多媒体专委会

5+阅读 · 2022年3月29日

IEEE TII Call For Papers

IEEE TII Call For Papers

CCF多媒体专委会

3+阅读 · 2022年3月24日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Latest News & Announcements of the Workshop

【ICIG2021】Latest News & Announcements of the Workshop

中国图象图形学学会CSIG

0+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium8

中国图象图形学学会CSIG

0+阅读 · 2021年11月16日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium2

中国图象图形学学会CSIG

0+阅读 · 2021年11月8日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

相关论文

How Much More Data Do I Need? Estimating Requirements for Downstream Tasks

Arxiv

0+阅读 · 2022年7月13日

DocCoder: Generating Code by Retrieving and Reading Docs

Arxiv

0+阅读 · 2022年7月13日

Causal Conceptions of Fairness and their Consequences

Arxiv

0+阅读 · 2022年7月12日

Language-specific Characteristic Assistance for Code-switching Speech Recognition

Arxiv

0+阅读 · 2022年7月12日

Building Korean Sign Language Augmentation (KoSLA) Corpus with Data Augmentation Technique

Arxiv

0+阅读 · 2022年7月12日

FairDistillation: Mitigating Stereotyping in Language Models

Arxiv

0+阅读 · 2022年7月10日

Internal Language Model Estimation based Language Model Fusion for Cross-Domain Code-Switching Speech Recognition

Arxiv

0+阅读 · 2022年7月9日

OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering

Arxiv

0+阅读 · 2022年7月8日

K-AID: Enhancing Pre-trained Language Models with Domain Knowledge for Question Answering

Arxiv

15+阅读 · 2021年9月22日

Distance-based Self-Attention Network for Natural Language Inference

Arxiv

10+阅读 · 2017年12月6日

相关基金

幽门螺杆菌调控lncRNA-AK096550诱导SOCS3促进胰岛素抵抗发生的机制研究

国家自然科学基金

0+阅读 · 2014年12月31日

平方本征函数对称与随机矩阵

国家自然科学基金

0+阅读 · 2013年12月31日

LncRNA-P2RX7调控树突细胞NLRP3炎症小体通路参与白塞氏病的机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

PSMA介导前列腺癌靶向与细胞内触发释药的核交联胶束递药系统的构建及其评价

国家自然科学基金

0+阅读 · 2012年12月31日

citron kinase促进HIV-1病毒颗粒包装出芽机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

轴突导向分子Sema4D及其可溶性片段参与动脉粥样硬化形成的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

退化k-Hessian方程解的正则性研究

国家自然科学基金

0+阅读 · 2011年12月31日

Klotho蛋白在缺血再灌注急性肾损伤中的抗氧化应激机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于list-mode数据的快速SART真3D PET断层重建算法的研究

国家自然科学基金

0+阅读 · 2011年12月31日

壳聚糖-聚乳酸接枝共聚物制备生物可吸收水凝胶药物缓释体系

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员