This paper presents the first publicly available version of the Carolina Corpus and discusses its future directions. Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology enhanced with provenance, typology, versioning, and text integrality. The corpus aims at being used both as a reliable source for research in Linguistics and as an important resource for Computer Science research on language models, contributing towards removing Portuguese from the set of low-resource languages. Here we present the construction of the corpus methodology, comparing it with other existing methodologies, as well as the corpus current state: Carolina's first public version has $653,322,577$ tokens, distributed over $7$ broad types. Each text is annotated with several different metadata categories in its header, which we developed using TEI annotation standards. We also present ongoing derivative works and invite NLP researchers to contribute with their own.
翻译:本文介绍了卡罗来纳语料库的第一个公开版本,并讨论了它未来的方向。卡罗来纳是一个正在建设中的巴西葡萄牙语大型开放语料库,使用了增强的来源、类别、版本和文本完整性的网络作为语料库方法。该语料库既旨在作为语言学研究的可靠数据来源,又作为计算机科学语言模型研究的重要资源,为消除葡萄牙语成为低资源语言做出贡献。在这里,我们介绍语料库的构建方法,并将其与其他现有方法进行比较,同时介绍了语料库的当前状态:卡罗来纳的第一个公开版本具有 $653,322,577$ 个标记,分布在 $7$ 种广泛的类型中。每个文本的标题都使用TEI注释标准注释了几个不同的元数据类别。此外,我们还介绍了正在进行的衍生作品,并邀请自然语言处理研究人员进行贡献。