科亚拉(Koala): 一种用于度量与预训练语料库的重叠的指标 (Koala: An Index for Quantifying Overlaps with Pre-training Corpora) - 专知论文

会员服务 ·

0

Koala · 语料 · 语料库 · 预训练 · 度量 ·

2023 年 3 月 26 日

Koala: An Index for Quantifying Overlaps with Pre-training Corpora

翻译：科亚拉(Koala): 一种用于度量与预训练语料库的重叠的指标

Thuy-Trang Vu,Xuanli He,Gholamreza Haffari,Ehsan Shareghi

from arxiv, Available here: https://koala-index.erc.monash.edu/

In very recent years more attention has been placed on probing the role of pre-training data in Large Language Models (LLMs) downstream behaviour. Despite the importance, there is no public tool that supports such analysis of pre-training corpora at large scale. To help research in this space, we launch Koala, a searchable index over large pre-training corpora using compressed suffix arrays with highly efficient compression rate and search support. In its first release we index the public proportion of OPT 175B pre-training data. Koala provides a framework to do forensic analysis on the current and future benchmarks as well as to assess the degree of memorization in the output from the LLMs. Koala is available for public use at https://koala-index.erc.monash.edu/.

翻译：近年来，人们越来越关注预训练数据对大型语言模型(Large Language Models，LLMs)下游行为的影响。尽管这很重要，但目前尚无公共工具支持对大规模预训练语料库进行这样的分析。为了帮助研究者在这个领域开展更深入的研究，我们推出了Koala，这是一个使用高压缩率和搜索支持的压缩后缀数组对大型预训练语料进行可搜索索引的工具。在首次发布中，我们索引了公开的OPT 175B预训练数据集。Koala提供了一个框架，可以对当前和未来的基准进行法证分析，以及评估LLMs输出中的记忆程度。Koala可在以下网址公开使用: https://koala-index.erc.monash.edu/。

0

相关内容

Koala

130亿参数，8个A100训练，UC伯克利发布对话模型Koala

130亿参数，8个A100训练，UC伯克利发布对话模型Koala

专知会员服务

44+阅读 · 2023年4月5日

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

专知会员服务

19+阅读 · 2022年3月13日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

专知会员服务

97+阅读 · 2020年4月10日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【ACL2020-Facebook AI】大规模无监督跨语言表示学习

【ACL2020-Facebook AI】大规模无监督跨语言表示学习

专知会员服务

34+阅读 · 2020年4月5日

近期必读的6篇 NeurIPS 2019 的零样本学习(Zero-Shot Learning)论文

近期必读的6篇 NeurIPS 2019 的零样本学习(Zero-Shot Learning)论文

专知会员服务

60+阅读 · 2019年12月24日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

130亿参数，8个A100训练，UC伯克利发布对话模型Koala

130亿参数，8个A100训练，UC伯克利发布对话模型Koala

机器之心

0+阅读 · 2023年4月5日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

IJCAI2022推荐系统论文集锦

IJCAI2022推荐系统论文集锦

机器学习与推荐算法

0+阅读 · 2022年5月20日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

【论文推荐】最新八篇情感分析相关论文—注意力网络、多模态情感分析、情感分析局限性、跨语言情感分类、多语言情感分析

【论文推荐】最新八篇情感分析相关论文—注意力网络、多模态情感分析、情感分析局限性、跨语言情感分类、多语言情感分析

专知

52+阅读 · 2018年6月28日

【论文推荐】最新八篇推荐系统相关论文—亿级商品嵌入、主动学习、树深度模型、知识图谱、注意力感知、矩阵分解、神经个性化嵌入

【论文推荐】最新八篇推荐系统相关论文—亿级商品嵌入、主动学习、树深度模型、知识图谱、注意力感知、矩阵分解、神经个性化嵌入

专知

15+阅读 · 2018年6月15日

【论文推荐】最新六篇知识图谱相关论文—Zero-shot识别、卷积二维知识图谱、变分知识图谱推理、张量分解、推荐

【论文推荐】最新六篇知识图谱相关论文—Zero-shot识别、卷积二维知识图谱、变分知识图谱推理、张量分解、推荐

专知

50+阅读 · 2018年4月25日

【论文推荐】最新七篇知识图谱相关论文—嵌入式知识、Zero-shot识别、知识图谱嵌入、网络库、变分推理、解释、弱监督

【论文推荐】最新七篇知识图谱相关论文—嵌入式知识、Zero-shot识别、知识图谱嵌入、网络库、变分推理、解释、弱监督

专知

19+阅读 · 2018年3月26日

SIRT1调控miR-15b-5p转录的新机制及其在结直肠癌转移的作用

国家自然科学基金

0+阅读 · 2015年12月31日

关于随机MAX SAT和(2+p)-SAT模型可满足阈值的研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于多语用户模型的个性化跨语言信息检索研究

国家自然科学基金

2+阅读 · 2013年12月31日

长链非编码RNA-HOTAIR的蛋白调控网络的构建及功能研究

国家自然科学基金

0+阅读 · 2012年12月31日

构建表达fat-1基因的AD动物模型及内源性DHA与AD的相关性探究

国家自然科学基金

0+阅读 · 2012年12月31日

TM4SF1调控Collagen/DDR1信号通路促进乳腺癌转移的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

一个新FPH致病基因的鉴定与功能分析

国家自然科学基金

0+阅读 · 2012年12月31日

与茉莉酸合成相关抗根结线虫韧皮部运输miRNA鉴定及功能分析

国家自然科学基金

0+阅读 · 2011年12月31日

中风病急性期生物学指标和证候及疗效评价相关性的动态研究

国家自然科学基金

0+阅读 · 2011年12月31日

MK2基因启动子的拷贝数和单核苷酸遗传变异与人群肺癌易感性

国家自然科学基金

0+阅读 · 2009年12月31日

StructGPT: A General Framework for Large Language Model to Reason over Structured Data

StructGPT: A General Framework for Large Language Model to Reason over Structured Data

Arxiv

0+阅读 · 2023年5月16日

GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information

Arxiv

0+阅读 · 2023年5月16日

Self-Prompting Large Language Models for Zero-Shot Open-Domain QA

Arxiv

0+阅读 · 2023年5月16日

Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning

Arxiv

0+阅读 · 2023年5月16日

Large Language Models are Zero-Shot Rankers for Recommender Systems

Arxiv

0+阅读 · 2023年5月15日

Integrating Diverse Knowledge Sources for Online One-shot Learning of Novel Tasks

Arxiv

0+阅读 · 2023年5月15日

Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

Arxiv

0+阅读 · 2023年5月15日

$SmartProbe$: A Virtual Moderator for Market Research Surveys

Arxiv

0+阅读 · 2023年5月14日

CASE: Aligning Coarse-to-Fine Cognition and Affection for Empathetic Response Generation

Arxiv

0+阅读 · 2023年5月14日

Pretrained Transformers for Text Ranking: BERT and Beyond

Arxiv

28+阅读 · 2020年10月13日

VIP会员

文章信息

相关主题

相关VIP内容

130亿参数，8个A100训练，UC伯克利发布对话模型Koala

130亿参数，8个A100训练，UC伯克利发布对话模型Koala

专知会员服务

44+阅读 · 2023年4月5日

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

如何使用TensorFlow 排序构建推荐系统? How to build a recommendation system using TensorFlow Ranking?

专知会员服务

19+阅读 · 2022年3月13日

NLP必读经典文献100篇

专知会员服务

124+阅读 · 2020年9月8日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

专知会员服务

97+阅读 · 2020年4月10日

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

【ACL2020-Facebook AI】跨语言表示学习，Unsupervised Cross-lingual Representation Learning at Scale

专知会员服务

27+阅读 · 2020年4月5日

【ACL2020-Facebook AI】大规模无监督跨语言表示学习

【ACL2020-Facebook AI】大规模无监督跨语言表示学习

专知会员服务

34+阅读 · 2020年4月5日

近期必读的6篇 NeurIPS 2019 的零样本学习(Zero-Shot Learning)论文

近期必读的6篇 NeurIPS 2019 的零样本学习(Zero-Shot Learning)论文

专知会员服务

60+阅读 · 2019年12月24日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日

热门VIP内容

开通专知VIP会员享更多权益服务

《基于遗传神经网络算法的防御系统武器分配多目标优化》

《基于优化的复杂多无人机任务自动分配技术》

人工智能在现代战争中日渐增强的作用：俄乌启示

《美国防部非致命武器项目的组织架构替代方案》最新249页

相关资讯

130亿参数，8个A100训练，UC伯克利发布对话模型Koala

130亿参数，8个A100训练，UC伯克利发布对话模型Koala

机器之心

0+阅读 · 2023年4月5日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

IJCAI2022推荐系统论文集锦

IJCAI2022推荐系统论文集锦

机器学习与推荐算法

0+阅读 · 2022年5月20日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

【论文推荐】最新八篇情感分析相关论文—注意力网络、多模态情感分析、情感分析局限性、跨语言情感分类、多语言情感分析

【论文推荐】最新八篇情感分析相关论文—注意力网络、多模态情感分析、情感分析局限性、跨语言情感分类、多语言情感分析

专知

52+阅读 · 2018年6月28日

【论文推荐】最新八篇推荐系统相关论文—亿级商品嵌入、主动学习、树深度模型、知识图谱、注意力感知、矩阵分解、神经个性化嵌入

【论文推荐】最新八篇推荐系统相关论文—亿级商品嵌入、主动学习、树深度模型、知识图谱、注意力感知、矩阵分解、神经个性化嵌入

专知

15+阅读 · 2018年6月15日

【论文推荐】最新六篇知识图谱相关论文—Zero-shot识别、卷积二维知识图谱、变分知识图谱推理、张量分解、推荐

【论文推荐】最新六篇知识图谱相关论文—Zero-shot识别、卷积二维知识图谱、变分知识图谱推理、张量分解、推荐

专知

50+阅读 · 2018年4月25日

【论文推荐】最新七篇知识图谱相关论文—嵌入式知识、Zero-shot识别、知识图谱嵌入、网络库、变分推理、解释、弱监督

【论文推荐】最新七篇知识图谱相关论文—嵌入式知识、Zero-shot识别、知识图谱嵌入、网络库、变分推理、解释、弱监督

专知

19+阅读 · 2018年3月26日

相关论文

StructGPT: A General Framework for Large Language Model to Reason over Structured Data

StructGPT: A General Framework for Large Language Model to Reason over Structured Data

Arxiv

0+阅读 · 2023年5月16日

GeneGPT: Augmenting Large Language Models with Domain Tools for Improved Access to Biomedical Information

Arxiv

0+阅读 · 2023年5月16日

Self-Prompting Large Language Models for Zero-Shot Open-Domain QA

Arxiv

0+阅读 · 2023年5月16日

Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning

Arxiv

0+阅读 · 2023年5月16日

Large Language Models are Zero-Shot Rankers for Recommender Systems

Arxiv

0+阅读 · 2023年5月15日

Integrating Diverse Knowledge Sources for Online One-shot Learning of Novel Tasks

Arxiv

0+阅读 · 2023年5月15日

Taxi1500: A Multilingual Dataset for Text Classification in 1500 Languages

Arxiv

0+阅读 · 2023年5月15日

$SmartProbe$: A Virtual Moderator for Market Research Surveys

Arxiv

0+阅读 · 2023年5月14日

CASE: Aligning Coarse-to-Fine Cognition and Affection for Empathetic Response Generation

Arxiv

0+阅读 · 2023年5月14日

Pretrained Transformers for Text Ranking: BERT and Beyond

Arxiv

28+阅读 · 2020年10月13日

相关基金

SIRT1调控miR-15b-5p转录的新机制及其在结直肠癌转移的作用

国家自然科学基金

0+阅读 · 2015年12月31日

关于随机MAX SAT和(2+p)-SAT模型可满足阈值的研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于多语用户模型的个性化跨语言信息检索研究

国家自然科学基金

2+阅读 · 2013年12月31日

长链非编码RNA-HOTAIR的蛋白调控网络的构建及功能研究

国家自然科学基金

0+阅读 · 2012年12月31日

构建表达fat-1基因的AD动物模型及内源性DHA与AD的相关性探究

国家自然科学基金

0+阅读 · 2012年12月31日

TM4SF1调控Collagen/DDR1信号通路促进乳腺癌转移的机制研究

国家自然科学基金

0+阅读 · 2012年12月31日

一个新FPH致病基因的鉴定与功能分析

国家自然科学基金

0+阅读 · 2012年12月31日

与茉莉酸合成相关抗根结线虫韧皮部运输miRNA鉴定及功能分析

国家自然科学基金

0+阅读 · 2011年12月31日

中风病急性期生物学指标和证候及疗效评价相关性的动态研究

国家自然科学基金

0+阅读 · 2011年12月31日

MK2基因启动子的拷贝数和单核苷酸遗传变异与人群肺癌易感性

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员