读书报告 | WORD TRANSLATION WITHOUT PARALLEL DATA - 专知

会员服务 ·

0

读书报告 | WORD TRANSLATION WITHOUT PARALLEL DATA

2017 年 12 月 30 日 科技创新与创业 尹伊淳

Under review as a conference paper at ICLR 2018

链接：https://arxiv.org/pdf/1710.04087.pdf

摘要
当前跨语言的词向量学习的最有效的方法大多基于双语辞典和平行语料，无监督的方法在效果上并不令人满意。本文提出一种无监督的跨语言词向量学习方法，在仅使用单语词向量的情况下达到了有监督学习相同的效果。实验证明所提出方法在性质不同的两种语言上，例如English-Russian和English-Chinese，也能达到很好的效果。

模型

在词向量中两个有意思的发现：（1）连续低维的词向量空间在不同的语言中有相似的结构（线性映射可以将一种语言投射到另一种语言中）。（2）中心问题：存在一些词是很多词的邻居点，也存在另一些词它们分布偏离中心位置，不是任何词的邻居点。本文基于这两种发现来设计相关模型。

本文首先使用了领域-对抗生成（domain-adversarial）的方法来学习语言对之间的线性映射参数，判别器（Discriminator）目标在于判别向量来自于目标语言还是源语言映射，其目标函数如下：

映射器函数目标在于尽量迷惑判别器，让其难以分辨两个向量的来源，目标函数如下：

因为稀少词的向量表征并不稳定，会为训练带来噪声。所以这里只使用了高频词进行训练，同时使用self-training来进一步增加模型的准确率。也就是利用已经学习得到的W来自动标注对齐的词对，再利用这些词对进行训练学习。本文提出一种新的相似性度量的方法来产生词对，公式如下：

为了缓解中心问题，公式分别使用了第二项和第三项来分别对中心词和偏离词进行惩罚和缓解。

实验

实验部分包含三个部分：词翻译，跨语言词相似度，句子翻译检索。其中词翻译部分中，本文构造了一个100K的词对数据进行衡量。

实验发现本文提出的无监督方法与有监督方法在所有三个任务上有同样甚至略好的表现。

思考

本文的motivation非常清晰，方法上也有让人眼前一亮的点：使用对抗生成的方法来学习线性映射以及提出一种新的度量方法来处理中心分布问题。

有开源的代码，可以考虑缓解low-resource词向量的表征问题。

作者：尹伊淳，北京大学在读博士，研究方向为自然语言处理。

登录查看更多

0

相关内容

词向量

【伯克利】黑盒机器翻译系统的模仿攻击与防御，Imitation Attacks and Defenses for Black-box Machine Translation Systems

【伯克利】黑盒机器翻译系统的模仿攻击与防御，Imitation Attacks and Defenses for Black-box Machine Translation Systems

专知会员服务

7+阅读 · 2020年5月4日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

专知会员服务

23+阅读 · 2020年4月22日

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

专知会员服务

23+阅读 · 2020年4月21日

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

专知会员服务

54+阅读 · 2020年3月5日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

【Tom Kocmi博士论文】探讨迁移学习在神经机器翻译中的应用，Exploring Benefits of Transfer Learning in Neural Machine Translation

【Tom Kocmi博士论文】探讨迁移学习在神经机器翻译中的应用，Exploring Benefits of Transfer Learning in Neural Machine Translation

专知会员服务

10+阅读 · 2020年1月9日

【论文】多语言神经机器翻译综述（A Comprehensive Survey of Multilingual Neural Machine Translation）

【论文】多语言神经机器翻译综述（A Comprehensive Survey of Multilingual Neural Machine Translation）

专知会员服务

20+阅读 · 2020年1月7日

【互信息与自监督学习，32页ppt】'Notes and tutorials on "Mutual information and self-supervised learning‘“

【互信息与自监督学习，32页ppt】'Notes and tutorials on "Mutual information and self-supervised learning‘“

专知会员服务

26+阅读 · 2019年12月25日

【AAAI2020接受论文】Emu:使用语义专门化增强多语言句子嵌入，Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

【AAAI2020接受论文】Emu:使用语义专门化增强多语言句子嵌入，Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

专知会员服务

26+阅读 · 2019年11月11日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART V）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART V）

AINLP

38+阅读 · 2019年9月3日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART IV）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART IV）

AINLP

15+阅读 · 2019年8月26日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART III）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART III）

AINLP

25+阅读 · 2019年8月21日

论文浅尝 | 基于微量资源的神经网络跨语言命名实体识别

论文浅尝 | 基于微量资源的神经网络跨语言命名实体识别

开放知识图谱

6+阅读 · 2019年8月19日

论文浅尝 | Interaction Embeddings for Prediction and Explanation

论文浅尝 | Interaction Embeddings for Prediction and Explanation

开放知识图谱

11+阅读 · 2019年2月1日

【论文笔记】ICLR 2018 Wasserstein自编码器

【论文笔记】ICLR 2018 Wasserstein自编码器

专知

31+阅读 · 2018年6月29日

论文浅尝 | Distant Supervision for Relation Extraction

论文浅尝 | Distant Supervision for Relation Extraction

开放知识图谱

4+阅读 · 2017年12月25日

Facebook开源MUSE：多语言无监督和监督词向量库

Facebook开源MUSE：多语言无监督和监督词向量库

论智

20+阅读 · 2017年12月23日

论文报告 | Graph-based Neural Multi-Document Summarization

论文报告 | Graph-based Neural Multi-Document Summarization

科技创新与创业

15+阅读 · 2017年12月15日

本期最新 9 篇论文，每一篇都想推荐给你 | PaperDaily #14

本期最新 9 篇论文，每一篇都想推荐给你 | PaperDaily #14

PaperWeekly

7+阅读 · 2017年11月15日

Doubly Attentive Transformer Machine Translation

Doubly Attentive Transformer Machine Translation

Arxiv

4+阅读 · 2018年7月30日

Multi-Source Neural Machine Translation with Missing Data

Arxiv

5+阅读 · 2018年6月7日

Bi-Directional Neural Machine Translation with Synthetic Parallel Data

Arxiv

6+阅读 · 2018年5月29日

Inducing Grammars with and for Neural Machine Translation

Arxiv

4+阅读 · 2018年5月28日

Unsupervised Machine Translation Using Monolingual Corpora Only

Arxiv

5+阅读 · 2018年4月13日

Handling Homographs in Neural Machine Translation

Arxiv

3+阅读 · 2018年3月28日

Self-Attentive Residual Decoder for Neural Machine Translation

Arxiv

5+阅读 · 2018年3月22日

Joint Training for Neural Machine Translation Models with Monolingual Data

Arxiv

4+阅读 · 2018年3月1日

Unsupervised Neural Machine Translation

Arxiv

6+阅读 · 2018年2月26日

Word Translation Without Parallel Data

Arxiv

7+阅读 · 2018年1月30日

VIP会员

相关主题

相关VIP内容

【伯克利】黑盒机器翻译系统的模仿攻击与防御，Imitation Attacks and Defenses for Black-box Machine Translation Systems

【伯克利】黑盒机器翻译系统的模仿攻击与防御，Imitation Attacks and Defenses for Black-box Machine Translation Systems

专知会员服务

7+阅读 · 2020年5月4日

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

多语言神经机器翻译综述论文，34页pdf，A Comprehensive Survey of Multilingual Neural Machine Translation

专知会员服务

19+阅读 · 2020年4月25日

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

【领域对抗学习的低资源文本分类】Low-Resource Text Classification using Domain-Adversarial Learning

专知会员服务

23+阅读 · 2020年4月22日

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

【微软亚洲研究院】无监督词嵌入对齐的几何感知域自适应，Geometry-aware Domain Adaptation for Unsupervised Alignment of Word Embeddings

专知会员服务

23+阅读 · 2020年4月21日

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

最大均方差正则化贝叶斯神经网络，Bayesian Neural Networks With Maximum Mean Discrepancy Regularization

专知会员服务

54+阅读 · 2020年3月5日

【Google】无监督机器翻译，Unsupervised Machine Translation

【Google】无监督机器翻译，Unsupervised Machine Translation

专知会员服务

36+阅读 · 2020年3月3日

【Tom Kocmi博士论文】探讨迁移学习在神经机器翻译中的应用，Exploring Benefits of Transfer Learning in Neural Machine Translation

【Tom Kocmi博士论文】探讨迁移学习在神经机器翻译中的应用，Exploring Benefits of Transfer Learning in Neural Machine Translation

专知会员服务

10+阅读 · 2020年1月9日

【论文】多语言神经机器翻译综述（A Comprehensive Survey of Multilingual Neural Machine Translation）

【论文】多语言神经机器翻译综述（A Comprehensive Survey of Multilingual Neural Machine Translation）

专知会员服务

20+阅读 · 2020年1月7日

【互信息与自监督学习，32页ppt】'Notes and tutorials on "Mutual information and self-supervised learning‘“

【互信息与自监督学习，32页ppt】'Notes and tutorials on "Mutual information and self-supervised learning‘“

专知会员服务

26+阅读 · 2019年12月25日

【AAAI2020接受论文】Emu:使用语义专门化增强多语言句子嵌入，Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

【AAAI2020接受论文】Emu:使用语义专门化增强多语言句子嵌入，Emu: Enhancing Multilingual Sentence Embeddings with Semantic Specialization

专知会员服务

26+阅读 · 2019年11月11日

热门VIP内容

开通专知VIP会员享更多权益服务

扩散模型中的 Transformer：图像生成及其延展应用询问 ChatGPT

281页pdf《神经网络设计入门》

【普林斯顿博士论文】以奖励推动生成式人工智能的发展：奖励引导生成的理论与方法

中文版 | 火力支援与巡飞弹药的未来（附原文）

相关资讯

【论文】Awesome Relation Extraction Paper（关系抽取）（PART V）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART V）

AINLP

38+阅读 · 2019年9月3日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART IV）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART IV）

AINLP

15+阅读 · 2019年8月26日

【论文】Awesome Relation Extraction Paper（关系抽取）（PART III）

【论文】Awesome Relation Extraction Paper（关系抽取）（PART III）

AINLP

25+阅读 · 2019年8月21日

论文浅尝 | 基于微量资源的神经网络跨语言命名实体识别

论文浅尝 | 基于微量资源的神经网络跨语言命名实体识别

开放知识图谱

6+阅读 · 2019年8月19日

论文浅尝 | Interaction Embeddings for Prediction and Explanation

论文浅尝 | Interaction Embeddings for Prediction and Explanation

开放知识图谱

11+阅读 · 2019年2月1日

【论文笔记】ICLR 2018 Wasserstein自编码器

【论文笔记】ICLR 2018 Wasserstein自编码器

专知

31+阅读 · 2018年6月29日

论文浅尝 | Distant Supervision for Relation Extraction

论文浅尝 | Distant Supervision for Relation Extraction

开放知识图谱

4+阅读 · 2017年12月25日

Facebook开源MUSE：多语言无监督和监督词向量库

Facebook开源MUSE：多语言无监督和监督词向量库

论智

20+阅读 · 2017年12月23日

论文报告 | Graph-based Neural Multi-Document Summarization

论文报告 | Graph-based Neural Multi-Document Summarization

科技创新与创业

15+阅读 · 2017年12月15日

本期最新 9 篇论文，每一篇都想推荐给你 | PaperDaily #14

本期最新 9 篇论文，每一篇都想推荐给你 | PaperDaily #14

PaperWeekly

7+阅读 · 2017年11月15日

相关论文

Doubly Attentive Transformer Machine Translation

Doubly Attentive Transformer Machine Translation

Arxiv

4+阅读 · 2018年7月30日

Multi-Source Neural Machine Translation with Missing Data

Arxiv

5+阅读 · 2018年6月7日

Bi-Directional Neural Machine Translation with Synthetic Parallel Data

Arxiv

6+阅读 · 2018年5月29日

Inducing Grammars with and for Neural Machine Translation

Arxiv

4+阅读 · 2018年5月28日

Unsupervised Machine Translation Using Monolingual Corpora Only

Arxiv

5+阅读 · 2018年4月13日

Handling Homographs in Neural Machine Translation

Arxiv

3+阅读 · 2018年3月28日

Self-Attentive Residual Decoder for Neural Machine Translation

Arxiv

5+阅读 · 2018年3月22日

Joint Training for Neural Machine Translation Models with Monolingual Data

Arxiv

4+阅读 · 2018年3月1日

Unsupervised Neural Machine Translation

Arxiv

6+阅读 · 2018年2月26日

Word Translation Without Parallel Data

Arxiv

7+阅读 · 2018年1月30日

大家都在搜

大型语言模型

斯坦福博士论文

无人机系统

久别重逢话双塔

无人机航拍交通事故现场勘查处置系统——行业第一的警用事故处理软件

微信扫码咨询专知VIP会员