In very recent years more attention has been placed on probing the role of pre-training data in Large Language Models (LLMs) downstream behaviour. Despite the importance, there is no public tool that supports such analysis of pre-training corpora at large scale. To help research in this space, we launch Koala, a searchable index over large pre-training corpora using compressed suffix arrays with highly efficient compression rate and search support. In its first release we index the public proportion of OPT 175B pre-training data. Koala provides a framework to do forensic analysis on the current and future benchmarks as well as to assess the degree of memorization in the output from the LLMs. Koala is available for public use at https://koala-index.erc.monash.edu/.
翻译:近年来,人们越来越关注预训练数据对大型语言模型(Large Language Models,LLMs)下游行为的影响。尽管这很重要,但目前尚无公共工具支持对大规模预训练语料库进行这样的分析。为了帮助研究者在这个领域开展更深入的研究,我们推出了Koala,这是一个使用高压缩率和搜索支持的压缩后缀数组对大型预训练语料进行可搜索索引的工具。在首次发布中,我们索引了公开的OPT 175B预训练数据集。Koala提供了一个框架,可以对当前和未来的基准进行法证分析,以及评估LLMs输出中的记忆程度。Koala可在以下网址公开使用: https://koala-index.erc.monash.edu/。