We present the call for papers for the BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus. This shared task is intended for participants with an interest in small scale language modeling, human language acquisition, low-resource NLP, and cognitive modeling. In partnership with CoNLL and CMCL, we provide a platform for approaches to pretraining with a limited-size corpus sourced from data inspired by the input to children. The task has three tracks, two of which restrict the training data to pre-released datasets of 10M and 100M words and are dedicated to explorations of approaches such as architectural variations, self-supervised objectives, or curriculum learning. The final track only restricts the amount of text used, allowing innovation in the choice of the data, its domain, and even its modality (i.e., data from sources other than text is welcome). We will release a shared evaluation pipeline which scores models on a variety of benchmarks and tasks, including targeted syntactic evaluations and natural language understanding.
翻译:我们为BabyLM挑战(BabyLM Creative)提供文件呼吁:关于一个发展可信的文件的抽样有效预培训。这一共同任务针对对小规模语言建模、人文语言获取、低资源NLP和认知建模感兴趣的参与者。我们与CONLL和CMCL合作,提供了一个平台,用于从儿童输入的数据中提取有限内容的预培训方法。任务有三个轨道,其中两个轨道将培训数据限制在10M字和100M字的预释放数据集中,并专门探索建筑变异、自我监督目标或课程学习等方法。最后轨道只限制所使用的文本数量,允许数据的选择、范围、甚至模式(即欢迎来自非文本来源的数据)的创新。我们将发布一个共同评价管道,在各种基准和任务中计分出模式,包括有针对性的合成评价和自然语言理解。