Carolina: 拥有来源、类别和版本信息的巴西葡萄牙语通用语料库 (Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information) - 专知论文

会员服务 ·

0

语料库 · 语料 · 类别 · 注释（编程） · 低资源 ·

2023 年 3 月 28 日

Carolina: a General Corpus of Contemporary Brazilian Portuguese with Provenance, Typology and Versioning Information

翻译：Carolina: 拥有来源、类别和版本信息的巴西葡萄牙语通用语料库

Maria Clara Ramos Morales Crespo,Maria Lina de Souza Jeannine Rocha,Mariana Lourenço Sturzeneker,Felipe Ribas Serras,Guilherme Lamartine de Mello,Aline Silva Costa,Mayara Feliciano Palma,Renata Morais Mesquita,Raquel de Paula Guets,Mariana Marques da Silva,Marcelo Finger,Maria Clara Paixão de Sousa,Cristiane Namiuti,Vanessa Martins do Monte

from arxiv, 14 pages, 3 figures, 1 appendix

This paper presents the first publicly available version of the Carolina Corpus and discusses its future directions. Carolina is a large open corpus of Brazilian Portuguese texts under construction using web-as-corpus methodology enhanced with provenance, typology, versioning, and text integrality. The corpus aims at being used both as a reliable source for research in Linguistics and as an important resource for Computer Science research on language models, contributing towards removing Portuguese from the set of low-resource languages. Here we present the construction of the corpus methodology, comparing it with other existing methodologies, as well as the corpus current state: Carolina's first public version has $653,322,577$ tokens, distributed over $7$ broad types. Each text is annotated with several different metadata categories in its header, which we developed using TEI annotation standards. We also present ongoing derivative works and invite NLP researchers to contribute with their own.

翻译：本文介绍了卡罗来纳语料库的第一个公开版本，并讨论了它未来的方向。卡罗来纳是一个正在建设中的巴西葡萄牙语大型开放语料库，使用了增强的来源、类别、版本和文本完整性的网络作为语料库方法。该语料库既旨在作为语言学研究的可靠数据来源，又作为计算机科学语言模型研究的重要资源，为消除葡萄牙语成为低资源语言做出贡献。在这里，我们介绍语料库的构建方法，并将其与其他现有方法进行比较，同时介绍了语料库的当前状态：卡罗来纳的第一个公开版本具有 $653,322,577$ 个标记，分布在 $7$ 种广泛的类型中。每个文本的标题都使用TEI注释标准注释了几个不同的元数据类别。此外，我们还介绍了正在进行的衍生作品，并邀请自然语言处理研究人员进行贡献。

0

相关内容

语料库

语料库是语料库语言学研究的基础资源，也是经验主义语言研究方法的主要资源。应用于词典编纂，语言教学，传统语言研究，自然语言处理中基于统计或实例的研究等方面。

手册《兵棋推演：工具、技术和程序》33页slides，Connections UK – Wargaming for Professionals

手册《兵棋推演：工具、技术和程序》33页slides，Connections UK – Wargaming for Professionals

专知会员服务

40+阅读 · 2022年10月10日

【CVPR 2022-UCSD&英伟达】GroupViT:从文本监督中产生语义分割，Semantic Segmentation Emerges from Text Supervision

【CVPR 2022-UCSD&英伟达】GroupViT:从文本监督中产生语义分割，Semantic Segmentation Emerges from Text Supervision

专知会员服务

12+阅读 · 2022年3月9日

756页美国国家安全AI战略报告

756页美国国家安全AI战略报告

专知会员服务

182+阅读 · 2021年3月25日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【O'Reilly AI Conference 2019】大规模构建和部署AI应用程序和系统（Building and deploying AI applications and systems at scale），O'Reilly的首席数据科学家Ben Lorica、Computable 联合创始人兼首席执行官Roger Chen

【O'Reilly AI Conference 2019】大规模构建和部署AI应用程序和系统（Building and deploying AI applications and systems at scale），O'Reilly的首席数据科学家Ben Lorica、Computable 联合创始人兼首席执行官Roger Chen

专知会员服务

25+阅读 · 2019年11月5日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

162+阅读 · 2019年10月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

博士申请 | 美国约翰霍普金斯大学ECE系Sijia Geng老师招收全奖博士生

博士申请 | 美国约翰霍普金斯大学ECE系Sijia Geng老师招收全奖博士生

PaperWeekly

0+阅读 · 2022年11月13日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【论文推荐】最新5篇情感分析相关论文—深度学习情感分析综述、情感分析语料库、情感预测性、上下文和位置感知的因子分解模型、LSTM

【论文推荐】最新5篇情感分析相关论文—深度学习情感分析综述、情感分析语料库、情感预测性、上下文和位置感知的因子分解模型、LSTM

专知

55+阅读 · 2018年1月28日

【数据集】新的YELP数据集官方下载

【数据集】新的YELP数据集官方下载

机器学习研究会

16+阅读 · 2017年8月31日

向量组合学习框架下基于依存混合树的中文语义解析研究

国家自然科学基金

3+阅读 · 2014年12月31日

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于Vague软集GML和地标的定性空间位置描述

国家自然科学基金

0+阅读 · 2014年12月31日

利用核技术分析并构建金属标记富勒烯多功能纳米材料

国家自然科学基金

0+阅读 · 2013年12月31日

基于贝叶斯推理与人工神经网络的星系多波段能谱分析方法

国家自然科学基金

1+阅读 · 2013年12月31日

基于FrameNet的中文评价词汇本体构建与观点挖掘研究

国家自然科学基金

1+阅读 · 2013年12月31日

面向物联网环境的大规模可扩展网络管理研究

国家自然科学基金

0+阅读 · 2012年12月31日

X射线双星的时变与能谱性质

国家自然科学基金

0+阅读 · 2009年12月31日

WEB2.0环境下基于本体学习的观点挖掘研究

国家自然科学基金

0+阅读 · 2009年12月31日

瞬时随机光照下的自拼接快速三维轮廓测量

国家自然科学基金

0+阅读 · 2008年12月31日

Fast computation of exact confidence intervals for randomized experiments with binary outcomes

Arxiv

0+阅读 · 2023年5月17日

Unified Demonstration Retriever for In-Context Learning

Arxiv

0+阅读 · 2023年5月16日

VCSUM: A Versatile Chinese Meeting Summarization Dataset

Arxiv

0+阅读 · 2023年5月15日

Using LLM-assisted Annotation for Corpus Linguistics: A Case Study of Local Grammar Analysis

Arxiv

0+阅读 · 2023年5月15日

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Arxiv

0+阅读 · 2023年5月15日

Knowledge Refinement via Interaction Between Search Engines and Large Language Models

Arxiv

0+阅读 · 2023年5月12日

The Ethics of AI in Games

Arxiv

0+阅读 · 2023年5月12日

Challenges of Artificial Intelligence -- From Machine Learning and Computer Vision to Emotional Intelligence

Arxiv

19+阅读 · 2022年1月5日

Mining Disinformation and Fake News: Concepts, Methods, and Recent Advancements

Mining Disinformation and Fake News: Concepts, Methods, and Recent Advancements

Arxiv

16+阅读 · 2020年1月2日

Reasoning on Knowledge Graphs with Debate Dynamics

Reasoning on Knowledge Graphs with Debate Dynamics

Arxiv

14+阅读 · 2020年1月2日

VIP会员

文章信息

相关主题

注释（编程）

相关VIP内容

手册《兵棋推演：工具、技术和程序》33页slides，Connections UK – Wargaming for Professionals

手册《兵棋推演：工具、技术和程序》33页slides，Connections UK – Wargaming for Professionals

专知会员服务

40+阅读 · 2022年10月10日

【CVPR 2022-UCSD&英伟达】GroupViT:从文本监督中产生语义分割，Semantic Segmentation Emerges from Text Supervision

【CVPR 2022-UCSD&英伟达】GroupViT:从文本监督中产生语义分割，Semantic Segmentation Emerges from Text Supervision

专知会员服务

12+阅读 · 2022年3月9日

756页美国国家安全AI战略报告

756页美国国家安全AI战略报告

专知会员服务

182+阅读 · 2021年3月25日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

【O'Reilly AI Conference 2019】大规模构建和部署AI应用程序和系统（Building and deploying AI applications and systems at scale），O'Reilly的首席数据科学家Ben Lorica、Computable 联合创始人兼首席执行官Roger Chen

【O'Reilly AI Conference 2019】大规模构建和部署AI应用程序和系统（Building and deploying AI applications and systems at scale），O'Reilly的首席数据科学家Ben Lorica、Computable 联合创始人兼首席执行官Roger Chen

专知会员服务

25+阅读 · 2019年11月5日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

162+阅读 · 2019年10月12日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

最新BERT相关论文清单，BERT-related Papers

最新BERT相关论文清单，BERT-related Papers

专知会员服务

53+阅读 · 2019年9月29日

热门VIP内容

开通专知VIP会员享更多权益服务

自动驾驶轨迹规划中的基础模型：进展综述与开放挑战

《用于提升多域战备的大型语言模型辅助场景生成器》报告

【斯坦福博士论文】为人类使用优化 AI 模型

国防领域人工智能规模化应用的理论与实践

相关资讯

博士申请 | 美国约翰霍普金斯大学ECE系Sijia Geng老师招收全奖博士生

博士申请 | 美国约翰霍普金斯大学ECE系Sijia Geng老师招收全奖博士生

PaperWeekly

0+阅读 · 2022年11月13日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

BERT/Transformer/迁移学习NLP资源大列表

BERT/Transformer/迁移学习NLP资源大列表

专知

19+阅读 · 2019年6月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

上百种预训练中文词向量：Chinese-Word-Vectors

上百种预训练中文词向量：Chinese-Word-Vectors

AINLP

23+阅读 · 2019年2月26日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

【论文推荐】最新5篇情感分析相关论文—深度学习情感分析综述、情感分析语料库、情感预测性、上下文和位置感知的因子分解模型、LSTM

【论文推荐】最新5篇情感分析相关论文—深度学习情感分析综述、情感分析语料库、情感预测性、上下文和位置感知的因子分解模型、LSTM

专知

55+阅读 · 2018年1月28日

【数据集】新的YELP数据集官方下载

【数据集】新的YELP数据集官方下载

机器学习研究会

16+阅读 · 2017年8月31日

相关论文

Fast computation of exact confidence intervals for randomized experiments with binary outcomes

Arxiv

0+阅读 · 2023年5月17日

Unified Demonstration Retriever for In-Context Learning

Arxiv

0+阅读 · 2023年5月16日

VCSUM: A Versatile Chinese Meeting Summarization Dataset

Arxiv

0+阅读 · 2023年5月15日

Using LLM-assisted Annotation for Corpus Linguistics: A Case Study of Local Grammar Analysis

Arxiv

0+阅读 · 2023年5月15日

C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models

Arxiv

0+阅读 · 2023年5月15日

Knowledge Refinement via Interaction Between Search Engines and Large Language Models

Arxiv

0+阅读 · 2023年5月12日

The Ethics of AI in Games

Arxiv

0+阅读 · 2023年5月12日

Challenges of Artificial Intelligence -- From Machine Learning and Computer Vision to Emotional Intelligence

Arxiv

19+阅读 · 2022年1月5日

Mining Disinformation and Fake News: Concepts, Methods, and Recent Advancements

Mining Disinformation and Fake News: Concepts, Methods, and Recent Advancements

Arxiv

16+阅读 · 2020年1月2日

Reasoning on Knowledge Graphs with Debate Dynamics

Reasoning on Knowledge Graphs with Debate Dynamics

Arxiv

14+阅读 · 2020年1月2日

相关基金

向量组合学习框架下基于依存混合树的中文语义解析研究

国家自然科学基金

3+阅读 · 2014年12月31日

S3AGA样本（Spitzer-SDSS Spectral Atlas of Galaxies and AGNs)及其AGN研究

国家自然科学基金

0+阅读 · 2014年12月31日

基于Vague软集GML和地标的定性空间位置描述

国家自然科学基金

0+阅读 · 2014年12月31日

利用核技术分析并构建金属标记富勒烯多功能纳米材料

国家自然科学基金

0+阅读 · 2013年12月31日

基于贝叶斯推理与人工神经网络的星系多波段能谱分析方法

国家自然科学基金

1+阅读 · 2013年12月31日

基于FrameNet的中文评价词汇本体构建与观点挖掘研究

国家自然科学基金

1+阅读 · 2013年12月31日

面向物联网环境的大规模可扩展网络管理研究

国家自然科学基金

0+阅读 · 2012年12月31日

X射线双星的时变与能谱性质

国家自然科学基金

0+阅读 · 2009年12月31日

WEB2.0环境下基于本体学习的观点挖掘研究

国家自然科学基金

0+阅读 · 2009年12月31日

瞬时随机光照下的自拼接快速三维轮廓测量

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员