如何分割:语言翻译中的文字分割对性别偏见的影响 (How to Split: the Effect of Word Segmentation on Gender Bias in Speech Translation) - 专知论文

会员服务 ·

0

语音翻译 · 有偏 · CONCUR · state-of-the-art · 评论员 ·

2021 年 5 月 28 日

How to Split: the Effect of Word Segmentation on Gender Bias in Speech Translation

翻译：如何分割:语言翻译中的文字分割对性别偏见的影响

Marco Gaido,Beatrice Savoldi,Luisa Bentivogli,Matteo Negri,Marco Turchi

from arxiv, Accepted in Findings of ACL 2021

Having recognized gender bias as a major issue affecting current translation technologies, researchers have primarily attempted to mitigate it by working on the data front. However, whether algorithmic aspects concur to exacerbate unwanted outputs remains so far under-investigated. In this work, we bring the analysis on gender bias in automatic translation onto a seemingly neutral yet critical component: word segmentation. Can segmenting methods influence the ability to translate gender? Do certain segmentation approaches penalize the representation of feminine linguistic markings? We address these questions by comparing 5 existing segmentation strategies on the target side of speech translation systems. Our results on two language pairs (English-Italian/French) show that state-of-the-art sub-word splitting (BPE) comes at the cost of higher gender bias. In light of this finding, we propose a combined approach that preserves BPE overall translation quality, while leveraging the higher ability of character-based segmentation to properly translate gender.

翻译：由于认识到性别偏见是影响当前翻译技术的一个主要问题,研究人员主要试图通过在数据前沿开展工作来减轻这种偏见,然而,算法方面同意加剧不想要的产出,但迄今是否调查不足。在这项工作中,我们将自动翻译中的性别偏见分析纳入一个看似中立但关键的组成部分:字分割。分割方法能否影响翻译性别的能力?某些分化方法是否惩罚了女性语言标识的代表性?我们通过比较语言翻译系统目标一侧现有的5种分化战略来解决这些问题。我们对两种语言(英语-意大利语/法语)的结果表明,最先进的分化小词(英语-意大利语/法语)的代价是性别偏见的加剧。根据这一发现,我们提出一个综合方法,既能保持BPE的总体翻译质量,又能利用更高程度的分化能力来正确翻译性别。

0

相关内容

语音翻译

通过计算机进行不同语言之间的直接语音翻译，辅助不同语言背景的人们进行沟通已经成为世界各国研究的重点。和一般的文本翻译不同，语音翻译需要把语音识别、机器翻译和语音合成三大技术进行集成，具有很大的挑战性。

预训练语言模型fine-tuning近期进展概述

预训练语言模型fine-tuning近期进展概述

专知会员服务

40+阅读 · 2021年4月9日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

所有跨语言嵌入式都应该讲英语吗? | Should All Cross-Lingual Embeddings Speak English?

所有跨语言嵌入式都应该讲英语吗? | Should All Cross-Lingual Embeddings Speak English?

专知会员服务

7+阅读 · 2020年4月16日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

人工智能乳房x线照相术和数字化乳房人工合成:当前的概念和未来的展望综述论文

人工智能乳房x线照相术和数字化乳房人工合成:当前的概念和未来的展望综述论文

专知会员服务

5+阅读 · 2019年9月25日

已删除

将门创投

5+阅读 · 2019年4月29日

CCF A类 | 顶级会议RTSS 2019诚邀稿件

CCF A类 | 顶级会议RTSS 2019诚邀稿件

Call4Papers

10+阅读 · 2019年4月17日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

Call4Papers

5+阅读 · 2018年12月7日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

【推荐】视频目标分割基础

【推荐】视频目标分割基础

机器学习研究会

9+阅读 · 2017年9月19日

大数据的分布式算法

大数据的分布式算法

待字闺中

3+阅读 · 2017年6月13日

Translatotron 2: Robust direct speech-to-speech translation

Arxiv

0+阅读 · 2021年7月19日

Image-to-Image Translation: Methods and Applications

Arxiv

17+阅读 · 2021年1月21日

All Word Embeddings from One Embedding

Arxiv

4+阅读 · 2020年5月25日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

Improving the Transformer Translation Model with Document-Level Context

Arxiv

4+阅读 · 2018年10月8日

Unsupervised Multilingual Word Embeddings

Arxiv

3+阅读 · 2018年8月27日

How Do Source-side Monolingual Word Embeddings Impact Neural Machine Translation?

Arxiv

5+阅读 · 2018年6月5日

Unsupervised Cross-Modality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss

Arxiv

10+阅读 · 2018年4月29日

Enhancing Network Embedding with Auxiliary Information: An Explicit Matrix Factorization Perspective

Arxiv

3+阅读 · 2018年3月5日

Word Translation Without Parallel Data

Arxiv

8+阅读 · 2018年1月30日

VIP会员

文章信息

相关主题

state-of-the-art

相关VIP内容

预训练语言模型fine-tuning近期进展概述

预训练语言模型fine-tuning近期进展概述

专知会员服务

40+阅读 · 2021年4月9日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

所有跨语言嵌入式都应该讲英语吗? | Should All Cross-Lingual Embeddings Speak English?

所有跨语言嵌入式都应该讲英语吗? | Should All Cross-Lingual Embeddings Speak English?

专知会员服务

7+阅读 · 2020年4月16日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

人工智能乳房x线照相术和数字化乳房人工合成:当前的概念和未来的展望综述论文

人工智能乳房x线照相术和数字化乳房人工合成:当前的概念和未来的展望综述论文

专知会员服务

5+阅读 · 2019年9月25日

热门VIP内容

开通专知VIP会员享更多权益服务

《用于全球导航卫星系统电子干扰检测与分类的人工智能模型》2025最新107页

《利用射频传感器载荷增强无人机的侦察、监视与目标获取（ISR）能力》报告

《以人工智能为基准推动现代后勤领域创新和生产力的军事经验》

人工智能驱动的国防战术通信与网络：提升现代战争中的态势感知、安全性与自主决策 | 万字长文

相关资讯

已删除

将门创投

5+阅读 · 2019年4月29日

CCF A类 | 顶级会议RTSS 2019诚邀稿件

CCF A类 | 顶级会议RTSS 2019诚邀稿件

Call4Papers

10+阅读 · 2019年4月17日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

CCF C类 | IJCNN 2019 Special Section : 信息论与深度学习

Call4Papers

5+阅读 · 2018年12月7日

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

【论文推荐】最新5篇图像分割（Image Segmentation）相关论文—多重假设、超像素分割、自监督、图、生成对抗网络

专知

27+阅读 · 2018年2月7日

计算机视觉近一年进展综述

计算机视觉近一年进展综述

机器学习研究会

9+阅读 · 2017年11月25日

【论文】图上的表示学习综述

【论文】图上的表示学习综述

机器学习研究会

15+阅读 · 2017年9月24日

【推荐】视频目标分割基础

【推荐】视频目标分割基础

机器学习研究会

9+阅读 · 2017年9月19日

大数据的分布式算法

大数据的分布式算法

待字闺中

3+阅读 · 2017年6月13日

相关论文

Translatotron 2: Robust direct speech-to-speech translation

Arxiv

0+阅读 · 2021年7月19日

Image-to-Image Translation: Methods and Applications

Arxiv

17+阅读 · 2021年1月21日

All Word Embeddings from One Embedding

Arxiv

4+阅读 · 2020年5月25日

On Feature Normalization and Data Augmentation

On Feature Normalization and Data Augmentation

Arxiv

15+阅读 · 2020年2月25日

Improving the Transformer Translation Model with Document-Level Context

Arxiv

4+阅读 · 2018年10月8日

Unsupervised Multilingual Word Embeddings

Arxiv

3+阅读 · 2018年8月27日

How Do Source-side Monolingual Word Embeddings Impact Neural Machine Translation?

Arxiv

5+阅读 · 2018年6月5日

Unsupervised Cross-Modality Domain Adaptation of ConvNets for Biomedical Image Segmentations with Adversarial Loss

Arxiv

10+阅读 · 2018年4月29日

Enhancing Network Embedding with Auxiliary Information: An Explicit Matrix Factorization Perspective

Arxiv

3+阅读 · 2018年3月5日

Word Translation Without Parallel Data

Arxiv

8+阅读 · 2018年1月30日

微信扫码咨询专知VIP会员