文本数据一致性得分与R的落实情况 (On consistency scores in text data with an implementation in R) - 专知论文

会员服务 ·

0

Processing（编程语言） · 得分 · MoDELS · N元 · 成比例 ·

2021 年 1 月 13 日

On consistency scores in text data with an implementation in R

翻译：文本数据一致性得分与R的落实情况

Ke-Li Chiu,Rohan Alexander

from arxiv, 13 pages, 0 figures

In this paper, we introduce a reproducible cleaning process for the text extracted from PDFs using n-gram models. Our approach compares the originally extracted text with the text generated from, or expected by, these models using earlier text as stimulus. To guide this process, we introduce the notion of a consistency score, which refers to the proportion of text that is expected by the model. This is used to monitor changes during the cleaning process, and across different corpuses. We illustrate our process on text from the book Jane Eyre and introduce both a Shiny application and an R package to make our process easier for others to adopt.

翻译：在本文中,我们采用n-gram 模型对从PDF中提取的文本采用可复制的清理程序。我们的方法是将最初提取的文本与这些模型中生成的或预期的文本进行比较,使用较早的文本作为刺激因素。为了指导这一过程,我们引入了一致性评分的概念,它指的是该模型所期待的文本比例。它用来监测清理过程中和不同领域的变化。我们用《简易》一书中的文本来说明我们的过程,并引入了“Shiny”应用程序和“R”软件包,以使其他人更容易采用我们的过程。

0

相关内容

Processing（编程语言）

Processing（编程语言）

Processing 是一门开源编程语言和与之配套的集成开发环境（IDE）的名称。Processing 在电子艺术和视觉设计社区被用来教授编程基础，并运用于大量的新媒体和互动艺术作品中。

【2020新书】Python专业实践，250页pdf，Practices of the Python Pro

【2020新书】Python专业实践，250页pdf，Practices of the Python Pro

专知会员服务

60+阅读 · 2020年11月15日

最新《文本简化》综述论文，26页pdf，A Survey on Text Simplification

最新《文本简化》综述论文，26页pdf，A Survey on Text Simplification

专知会员服务

15+阅读 · 2020年8月26日

数据科学导论，54页ppt，Introduction to Data Science

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【CMU】机器学习导论课程（Introduction to Machine Learning）

【CMU】机器学习导论课程（Introduction to Machine Learning）

专知会员服务

61+阅读 · 2019年8月26日

已删除

将门创投

7+阅读 · 2019年10月15日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

CoCon: A Self-Supervised Approach for Controlled Text Generation

Arxiv

2+阅读 · 2021年3月9日

A Review of Spatiotemporal Models for Count Data in R Packages. A Case Study of COVID-19 Data

Arxiv

0+阅读 · 2021年3月8日

Text-to-Image Synthesis Based on Machine Generated Captions

Text-to-Image Synthesis Based on Machine Generated Captions

Arxiv

3+阅读 · 2019年10月9日

Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

Arxiv

3+阅读 · 2019年9月10日

Learning to Weight for Text Classification

Learning to Weight for Text Classification

Arxiv

8+阅读 · 2019年3月28日

Using Scene Graph Context to Improve Image Generation

Using Scene Graph Context to Improve Image Generation

Arxiv

3+阅读 · 2019年1月15日

Improving Tree-LSTM with Tree Attention

Arxiv

4+阅读 · 2019年1月1日

Approaches for Enriching and Improving Textual Knowledge Bases

Arxiv

15+阅读 · 2018年4月20日

Improving Temporal Relation Extraction with a Globally Acquired Statistical Resource

Arxiv

5+阅读 · 2018年4月17日

Translating Pro-Drop Languages with Reconstruction Models

Arxiv

3+阅读 · 2018年1月10日

VIP会员

文章信息

相关主题

Processing（编程语言）

相关VIP内容

【2020新书】Python专业实践，250页pdf，Practices of the Python Pro

【2020新书】Python专业实践，250页pdf，Practices of the Python Pro

专知会员服务

60+阅读 · 2020年11月15日

最新《文本简化》综述论文，26页pdf，A Survey on Text Simplification

最新《文本简化》综述论文，26页pdf，A Survey on Text Simplification

专知会员服务

15+阅读 · 2020年8月26日

数据科学导论，54页ppt，Introduction to Data Science

数据科学导论，54页ppt，Introduction to Data Science

专知会员服务

42+阅读 · 2020年7月27日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

Keras François Chollet 《Deep Learning with Python 》, 386页pdf

专知会员服务

163+阅读 · 2019年10月12日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【CMU】机器学习导论课程（Introduction to Machine Learning）

【CMU】机器学习导论课程（Introduction to Machine Learning）

专知会员服务

61+阅读 · 2019年8月26日

热门VIP内容

开通专知VIP会员享更多权益服务

机器人领域中最佳的三维场景表示是什么？——从几何表示到基础模型

《多域作战兵棋推演：运用形态学分析与人工智能加强国防人员训练》

【博士论文】快速高效的归一化流及其在图像生成模型中的应用

仿生机器人技术的军事应用

相关资讯

已删除

将门创投

7+阅读 · 2019年10月15日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

条件GAN重大改进！cGANs with Projection Discriminator

条件GAN重大改进！cGANs with Projection Discriminator

CreateAMind

8+阅读 · 2018年2月7日

相关论文

CoCon: A Self-Supervised Approach for Controlled Text Generation

Arxiv

2+阅读 · 2021年3月9日

A Review of Spatiotemporal Models for Count Data in R Packages. A Case Study of COVID-19 Data

Arxiv

0+阅读 · 2021年3月8日

Text-to-Image Synthesis Based on Machine Generated Captions

Text-to-Image Synthesis Based on Machine Generated Captions

Arxiv

3+阅读 · 2019年10月9日

Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

Sunny and Dark Outside?! Improving Answer Consistency in VQA through Entailed Question Generation

Arxiv

3+阅读 · 2019年9月10日

Learning to Weight for Text Classification

Learning to Weight for Text Classification

Arxiv

8+阅读 · 2019年3月28日

Using Scene Graph Context to Improve Image Generation

Using Scene Graph Context to Improve Image Generation

Arxiv

3+阅读 · 2019年1月15日

Improving Tree-LSTM with Tree Attention

Arxiv

4+阅读 · 2019年1月1日

Approaches for Enriching and Improving Textual Knowledge Bases

Arxiv

15+阅读 · 2018年4月20日

Improving Temporal Relation Extraction with a Globally Acquired Statistical Resource

Arxiv

5+阅读 · 2018年4月17日

Translating Pro-Drop Languages with Reconstruction Models

Arxiv

3+阅读 · 2018年1月10日

微信扫码咨询专知VIP会员