Foundation models are trained on increasingly immense and opaque datasets. Even while these models are now key in AI system building, it can be difficult to answer the straightforward question: has the model already encountered a given example during training? We therefore propose a widespread adoption of Data Portraits: artifacts that record training data and allow for downstream inspection. First we outline the properties of such an artifact and discuss how existing solutions can be used to increase transparency. We then propose and implement a solution based on data sketching, stressing fast and space efficient querying. Using our tool, we document a popular large language modeling corpus (the Pile) and show that our solution enables answering questions about test set leakage and model plagiarism. Our tool is lightweight and fast, costing only 3% of the dataset size in overhead. We release a demo of our tools at dataportraits.org and call on dataset and model creators to release Data Portraits as a complement to current documentation practices.
翻译:基础模型在日益庞大和不透明的数据集方面得到了培训。 即使这些模型现在在AI系统建设中是关键, 也可能很难回答一个直截了当的问题: 该模型在培训期间是否已经遇到过一个特定的例子? 因此,我们建议广泛采用数据插图: 记录培训数据的工艺品, 并允许下游检查。 首先我们概述这种工艺品的特性, 并讨论如何利用现有解决方案来提高透明度。 然后我们提出并实施一个基于数据草图的解决方案, 强调快速和空间高效查询。 我们用我们的工具, 记录一个受欢迎的大型语言模型( Pile), 并表明我们的解决方案能够回答测试集渗漏和模型鼠标的问题。 我们的工具是轻量和快速的, 其成本只占管理中数据集大小的3% 。 我们在 Dataportraits.org 发布工具演示, 并呼吁数据集和模型创建者发布数据插图, 作为当前文件做法的补充 。</s>