Cross-language信息检索训练数据的合成 (Synthetic Cross-language Information Retrieval Training Data) - 专知论文

会员服务 ·

0

训练数据 · 训练集 · MS MARCO · 信息检索 · MS ·

2023 年 4 月 29 日

Synthetic Cross-language Information Retrieval Training Data

翻译：Cross-language信息检索训练数据的合成

James Mayfield,Eugene Yang,Dawn Lawrie,Samuel Barham,Orion Weller,Marc Mason,Suraj Nair,Scott Miller

from arxiv, 11 pages, 4 figures

A key stumbling block for neural cross-language information retrieval (CLIR) systems has been the paucity of training data. The appearance of the MS MARCO monolingual training set led to significant advances in the state of the art in neural monolingual retrieval. By translating the MS MARCO documents into other languages using machine translation, this resource has been made useful to the CLIR community. Yet such translation suffers from a number of problems. While MS MARCO is a large resource, it is of fixed size; its genre and domain of discourse are fixed; and the translated documents are not written in the language of a native speaker of the language, but rather in translationese. To address these problems, we introduce the JH-POLO CLIR training set creation methodology. The approach begins by selecting a pair of non-English passages. A generative large language model is then used to produce an English query for which the first passage is relevant and the second passage is not relevant. By repeating this process, collections of arbitrary size can be created in the style of MS MARCO but using naturally-occurring documents in any desired genre and domain of discourse. This paper describes the methodology in detail, shows its use in creating new CLIR training sets, and describes experiments using the newly created training data.

翻译：针对神经网络跨语言信息检索（CLIR）系统的一个关键障碍是缺乏训练数据。MS MARCO单语言训练集的出现使神经单语检索领域取得了重大进展。通过使用机器翻译将MS MARCO文档翻译成其他语言，这个资源已经被推广到CLIR社区中使用。然而，这样的翻译存在一些问题。虽然MS MARCO是一个大型资源，但它的大小是固定的；它的类型和话语领域也是固定的；而翻译文档并不是用其语言的母语者所写，而是翻译术语。为了解决这些问题，我们介绍了JH-POLO CLIR训练集创建方法。该方法首先选择一对非英语段落，然后使用生成性的大型语言模型为第一个段落生成一个相关的英文查询，而第二个段落则不相关。通过重复这个过程，可以创建任意数量和类型的文档集合，其风格类似于MS MARCO，并使用任何所需的流派和话语领域中的自然出现的文档。本文详细描述了这种方法，展示了它在创建新的CLIR训练集方面的应用，并描述了使用新创建的训练数据的实验结果。

0

相关内容

训练数据

LLM in Medical Domain: 大语言模型在医学领域的应用

LLM in Medical Domain: 大语言模型在医学领域的应用

专知会员服务

103+阅读 · 2023年6月17日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

【2022新书】Transformer自然语言处理，Natural Language Processing with Transformers: Building Language Applications with Hugging Face

【2022新书】Transformer自然语言处理，Natural Language Processing with Transformers: Building Language Applications with Hugging Face

专知会员服务

521+阅读 · 2022年1月31日

预训练模型如何用于文本挖掘？看这份KDD2021-UIUC《预训练文本表示:模型与应用在文本挖掘》教程，附200页Slides

专知会员服务

44+阅读 · 2021年8月18日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

自然语言处理中的注意力机制，Attention in Natural Language Processing

自然语言处理中的注意力机制，Attention in Natural Language Processing

专知会员服务

136+阅读 · 2020年5月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【KDD2019|讲座推荐】从海量文本中构建和挖掘异构信息网络：Constructing and Mining Heterogeneous Information Networks from Massive Text

【KDD2019|讲座推荐】从海量文本中构建和挖掘异构信息网络：Constructing and Mining Heterogeneous Information Networks from Massive Text

专知会员服务

47+阅读 · 2019年12月11日

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

专知会员服务

20+阅读 · 2019年11月24日

EMNLP 2022 | ClidSum: 跨语言对话摘要

EMNLP 2022 | ClidSum: 跨语言对话摘要

PaperWeekly

3+阅读 · 2022年11月25日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

专知

12+阅读 · 2018年6月9日

IJCAI 2018 | 腾讯知文等提出新型生成式摘要模型：结合主题信息和强化训练生成更优摘要

IJCAI 2018 | 腾讯知文等提出新型生成式摘要模型：结合主题信息和强化训练生成更优摘要

机器之心

12+阅读 · 2018年5月18日

机器阅读理解 / 知识库 / 深度学习 / 对话系统 / 神经机器翻译 | 本周值得读

机器阅读理解 / 知识库 / 深度学习 / 对话系统 / 神经机器翻译 | 本周值得读

PaperWeekly

11+阅读 · 2018年3月21日

自然语言处理 (NLP)资源大全

自然语言处理 (NLP)资源大全

机械鸡

35+阅读 · 2017年9月17日

锂离子吸附剂Li/Al LDH-Cl 的可控合成、结构与吸附机理研究

国家自然科学基金

0+阅读 · 2015年12月31日

分级多孔过渡金属钛酸盐基光还原CO2催化剂的设计合成及表面改性

国家自然科学基金

0+阅读 · 2015年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

微拟球藻甾体激素合成通路和调控网络研究

国家自然科学基金

0+阅读 · 2012年12月31日

一维核壳结构M/C-N@GC（M=Fe、Co）非铂催化剂的可控制备及氧还原催化性能的研究

国家自然科学基金

0+阅读 · 2012年12月31日

新型抗生素Bagremycins生物合成基因簇的鉴定与解析

国家自然科学基金

0+阅读 · 2012年12月31日

跨语言信息检索中的机器翻译研究

国家自然科学基金

2+阅读 · 2011年12月31日

5d电子体系铱氧化物中的自旋-轨道调控

国家自然科学基金

0+阅读 · 2011年12月31日

炭疽杆菌S-层蛋白BA3338功能研究

国家自然科学基金

0+阅读 · 2010年12月31日

跨语言文本自动分类关键技术研究

国家自然科学基金

2+阅读 · 2008年12月31日

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Arxiv

0+阅读 · 2023年6月15日

Document Entity Retrieval with Massive and Noisy Pre-training

Arxiv

0+阅读 · 2023年6月15日

Towards training Bilingual and Code-Switched Speech Recognition models from Monolingual data sources

Arxiv

0+阅读 · 2023年6月14日

A Mechanistic Transform Model for Synthesizing Eye Movement Data with Improved Realism

Arxiv

0+阅读 · 2023年6月14日

Gen-IR @ SIGIR 2023: The First Workshop on Generative Information Retrieval

Arxiv

0+阅读 · 2023年6月13日

Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages

Arxiv

0+阅读 · 2023年6月13日

Multi-Modal Knowledge Graph Construction and Application: A Survey

Arxiv

79+阅读 · 2022年2月11日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

Graph Neural Networks for Natural Language Processing: A Survey

Arxiv

36+阅读 · 2021年6月10日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

VIP会员

文章信息

相关主题

相关VIP内容

LLM in Medical Domain: 大语言模型在医学领域的应用

LLM in Medical Domain: 大语言模型在医学领域的应用

专知会员服务

103+阅读 · 2023年6月17日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

【2022新书】Transformer自然语言处理，Natural Language Processing with Transformers: Building Language Applications with Hugging Face

【2022新书】Transformer自然语言处理，Natural Language Processing with Transformers: Building Language Applications with Hugging Face

专知会员服务

521+阅读 · 2022年1月31日

预训练模型如何用于文本挖掘？看这份KDD2021-UIUC《预训练文本表示:模型与应用在文本挖掘》教程，附200页Slides

专知会员服务

44+阅读 · 2021年8月18日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

自然语言处理中的注意力机制，Attention in Natural Language Processing

自然语言处理中的注意力机制，Attention in Natural Language Processing

专知会员服务

136+阅读 · 2020年5月30日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【KDD2019|讲座推荐】从海量文本中构建和挖掘异构信息网络：Constructing and Mining Heterogeneous Information Networks from Massive Text

【KDD2019|讲座推荐】从海量文本中构建和挖掘异构信息网络：Constructing and Mining Heterogeneous Information Networks from Massive Text

专知会员服务

47+阅读 · 2019年12月11日

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

【NLP| 推荐文章】从统一文本到文本探讨迁移学习的局限性（Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer）

专知会员服务

20+阅读 · 2019年11月24日

热门VIP内容

开通专知VIP会员享更多权益服务

《乌克兰无人机产业：志愿者与政策在构建新兴无人机产业中的协同作用》最新报告

《人工智能辅助决策中的数据可视化：系统性综述》

人工智能驱动弹药制造现代化：美国陆军转型之路

《敏捷作战部署中枢纽-辐条基地选址优化研究》80页

相关资讯

EMNLP 2022 | ClidSum: 跨语言对话摘要

EMNLP 2022 | ClidSum: 跨语言对话摘要

PaperWeekly

3+阅读 · 2022年11月25日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

BERT/注意力机制/Transformer/迁移学习NLP资源大列表：awesome-bert-nlp

AINLP

40+阅读 · 2019年6月9日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

【论文推荐】最新六篇视觉问答相关论文—深度嵌入学习、句子表征学习、深度特征聚合、3D匹配、细粒度文本摘要

专知

12+阅读 · 2018年6月9日

IJCAI 2018 | 腾讯知文等提出新型生成式摘要模型：结合主题信息和强化训练生成更优摘要

IJCAI 2018 | 腾讯知文等提出新型生成式摘要模型：结合主题信息和强化训练生成更优摘要

机器之心

12+阅读 · 2018年5月18日

机器阅读理解 / 知识库 / 深度学习 / 对话系统 / 神经机器翻译 | 本周值得读

机器阅读理解 / 知识库 / 深度学习 / 对话系统 / 神经机器翻译 | 本周值得读

PaperWeekly

11+阅读 · 2018年3月21日

自然语言处理 (NLP)资源大全

自然语言处理 (NLP)资源大全

机械鸡

35+阅读 · 2017年9月17日

相关论文

COSA: Concatenated Sample Pretrained Vision-Language Foundation Model

Arxiv

0+阅读 · 2023年6月15日

Document Entity Retrieval with Massive and Noisy Pre-training

Arxiv

0+阅读 · 2023年6月15日

Towards training Bilingual and Code-Switched Speech Recognition models from Monolingual data sources

Arxiv

0+阅读 · 2023年6月14日

A Mechanistic Transform Model for Synthesizing Eye Movement Data with Improved Realism

Arxiv

0+阅读 · 2023年6月14日

Gen-IR @ SIGIR 2023: The First Workshop on Generative Information Retrieval

Arxiv

0+阅读 · 2023年6月13日

Contrastive Learning-Based Audio to Lyrics Alignment for Multiple Languages

Arxiv

0+阅读 · 2023年6月13日

Multi-Modal Knowledge Graph Construction and Application: A Survey

Arxiv

79+阅读 · 2022年2月11日

Pre-training Methods in Information Retrieval

Arxiv

16+阅读 · 2021年11月27日

Graph Neural Networks for Natural Language Processing: A Survey

Arxiv

36+阅读 · 2021年6月10日

Unsupervised Domain Clusters in Pretrained Language Models

Arxiv

11+阅读 · 2020年4月5日

相关基金

锂离子吸附剂Li/Al LDH-Cl 的可控合成、结构与吸附机理研究

国家自然科学基金

0+阅读 · 2015年12月31日

分级多孔过渡金属钛酸盐基光还原CO2催化剂的设计合成及表面改性

国家自然科学基金

0+阅读 · 2015年12月31日

面向词汇功能的学术文本语义识别与知识图谱构建

国家自然科学基金

5+阅读 · 2014年12月31日

微拟球藻甾体激素合成通路和调控网络研究

国家自然科学基金

0+阅读 · 2012年12月31日

一维核壳结构M/C-N@GC（M=Fe、Co）非铂催化剂的可控制备及氧还原催化性能的研究

国家自然科学基金

0+阅读 · 2012年12月31日

新型抗生素Bagremycins生物合成基因簇的鉴定与解析

国家自然科学基金

0+阅读 · 2012年12月31日

跨语言信息检索中的机器翻译研究

国家自然科学基金

2+阅读 · 2011年12月31日

5d电子体系铱氧化物中的自旋-轨道调控

国家自然科学基金

0+阅读 · 2011年12月31日

炭疽杆菌S-层蛋白BA3338功能研究

国家自然科学基金

0+阅读 · 2010年12月31日

跨语言文本自动分类关键技术研究

国家自然科学基金

2+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员