One concern with the rise of large language models lies with their potential for significant harm, particularly from pretraining on biased, obscene, copyrighted, and private information. Emerging ethical approaches have attempted to filter pretraining material, but such approaches have been ad hoc and failed to take into account context. We offer an approach to filtering grounded in law, which has directly addressed the tradeoffs in filtering material. First, we gather and make available the Pile of Law, a 256GB (and growing) dataset of open-source English-language legal and administrative data, covering court opinions, contracts, administrative rules, and legislative records. Pretraining on the Pile of Law may potentially help with legal tasks that have the promise to improve access to justice. Second, we distill the legal norms that governments have developed to constrain the inclusion of toxic or private content into actionable lessons for researchers and discuss how our dataset reflects these norms. Third, we show how the Pile of Law offers researchers the opportunity to learn such filtering rules directly from the data, providing an exciting new research direction in model-based processing.
翻译:大型语言模式的兴起引起关注的一个问题在于其潜在的重大伤害,特别是来自偏见、淫秽、版权和个人信息的预先培训; 新出现的道德做法试图过滤培训前材料,但这类做法是临时性的,没有考虑到背景; 我们提出一种基于法律的过滤方法,直接解决过滤材料的取舍问题; 首先,我们收集并提供了法律文件(256GB(和不断增长的)开放源的英语法律和行政数据数据集),包括法院意见、合同、行政规则和立法记录; 法律文件前培训有可能帮助完成有希望改善司法救助的法律任务; 第二,我们总结政府制定的法律规范,限制将有毒或私人内容纳入研究人员可操作的教训中,并讨论我们的数据集如何反映这些规范; 第三,我们展示法律文件如何为研究人员提供机会,直接从数据中了解这种过滤规则,为基于模型的处理提供令人振奋人心的新研究方向。