噪音文字不受监督的正常化比值:信息检索和定调探测案例研究 (An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection) - 专知论文

会员服务 ·

0

规范化的 · INFORMS · 噪声 · CASE · 无监督 ·

2021 年 1 月 9 日

An Unsupervised Normalization Algorithm for Noisy Text: A Case Study for Information Retrieval and Stance Detection

翻译：噪音文字不受监督的正常化比值:信息检索和定调探测案例研究

Anurag Roy,Shalmoli Ghosh,Kripabandhu Ghosh,Saptarshi Ghosh

from arxiv, Will be appearing in the ACM Journal of Data and Information Quality. Implementation available at https://github.com/ranarag/UnsupClean

A large fraction of textual data available today contains various types of 'noise', such as OCR noise in digitized documents, noise due to informal writing style of users on microblogging sites, and so on. To enable tasks such as search/retrieval and classification over all the available data, we need robust algorithms for text normalization, i.e., for cleaning different kinds of noise in the text. There have been several efforts towards cleaning or normalizing noisy text; however, many of the existing text normalization methods are supervised and require language-dependent resources or large amounts of training data that is difficult to obtain. We propose an unsupervised algorithm for text normalization that does not need any training data / human intervention. The proposed algorithm is applicable to text over different languages, and can handle both machine-generated and human-generated noise. Experiments over several standard datasets show that text normalization through the proposed algorithm enables better retrieval and stance detection, as compared to that using several baseline text normalization methods. Implementation of our algorithm can be found at https://github.com/ranarag/UnsupClean.

翻译：今天可获得的大量文本数据包含各种类型的“噪音”,例如数字化文件中的OCR噪音、微博客网站用户非正式书写风格引起的噪音等等。为了能够完成搜索/检索和分类所有现有数据等任务,我们需要为文本正常化,即清除文本中不同种类的噪音制定强有力的算法。为清洁或使噪音文本正常化作出了若干努力;然而,许多现有的文本正常化方法受到监督,需要依赖语言的资源或难以获得的大量培训数据。我们建议对文本正常化采用一种不受监督的算法,不需要任何培训数据/人类干预。提议的算法适用于不同语言的文本,可以处理机器产生的和人类产生的噪音。对若干标准数据集的实验表明,与使用几种基线文本正常化方法相比,通过拟议的算法实现文本正常化可以更好地检索和观察立场。我们的算法可在https://github.com/ranarag/UnsupClean查阅。

0

相关内容

规范化的

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【电子书】机器学习实战（Machine Learning in Action），附PDF

【电子书】机器学习实战（Machine Learning in Action），附PDF

专知会员服务

130+阅读 · 2019年11月25日

【目标检测 | 2019最新综述】目标检测的20年，附39页PDF，Object Detection in 20 Years: A Survey

【目标检测 | 2019最新综述】目标检测的20年，附39页PDF，Object Detection in 20 Years: A Survey

专知会员服务

60+阅读 · 2019年11月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

推荐｜深度强化学习聊天机器人（附论文）！

推荐｜深度强化学习聊天机器人（附论文）！

全球人工智能

4+阅读 · 2018年1月30日

已删除

将门创投

4+阅读 · 2017年12月12日

分布式TensorFlow入门指南

分布式TensorFlow入门指南

机器学习研究会

4+阅读 · 2017年11月28日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection

Arxiv

0+阅读 · 2021年3月5日

Speech2Action: Cross-modal Supervision for Action Recognition

Speech2Action: Cross-modal Supervision for Action Recognition

Arxiv

7+阅读 · 2020年3月30日

FocalMix: Semi-Supervised Learning for 3D Medical Image Detection

FocalMix: Semi-Supervised Learning for 3D Medical Image Detection

Arxiv

10+阅读 · 2020年3月20日

Combination of Multiple Global Descriptors for Image Retrieval

Combination of Multiple Global Descriptors for Image Retrieval

Arxiv

3+阅读 · 2019年4月18日

Learning to Weight for Text Classification

Learning to Weight for Text Classification

Arxiv

8+阅读 · 2019年3月28日

An end-to-end Neural Network Framework for Text Clustering

An end-to-end Neural Network Framework for Text Clustering

Arxiv

6+阅读 · 2019年3月22日

Generative Adversarial Active Learning for Unsupervised Outlier Detection

Generative Adversarial Active Learning for Unsupervised Outlier Detection

Arxiv

5+阅读 · 2019年3月14日

Deep Anomaly Detection with Outlier Exposure

Deep Anomaly Detection with Outlier Exposure

Arxiv

17+阅读 · 2018年12月21日

Graph Convolutional Networks for Text Classification

Arxiv

11+阅读 · 2018年10月17日

Weakly Supervised Object Detection in Artworks

Arxiv

4+阅读 · 2018年10月5日

VIP会员

文章信息

相关主题

相关VIP内容

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【电子书】机器学习实战（Machine Learning in Action），附PDF

【电子书】机器学习实战（Machine Learning in Action），附PDF

专知会员服务

130+阅读 · 2019年11月25日

【目标检测 | 2019最新综述】目标检测的20年，附39页PDF，Object Detection in 20 Years: A Survey

【目标检测 | 2019最新综述】目标检测的20年，附39页PDF，Object Detection in 20 Years: A Survey

专知会员服务

60+阅读 · 2019年11月15日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《多智能体不确定环境追逃博弈研究》216页

美智库最新发布《解放军"人机编组协同作战"发展路径：理论与实践》53页

现代战争"杀伤区"理论：空间尺度与结构特征、控制手段与毁伤机制、生存策略与战线转移

《俄军无人机创新技术或已在乌克兰达成"战场空中封锁"作战效果》最新18页报告

相关资讯

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

推荐｜深度强化学习聊天机器人（附论文）！

推荐｜深度强化学习聊天机器人（附论文）！

全球人工智能

4+阅读 · 2018年1月30日

已删除

将门创投

4+阅读 · 2017年12月12日

分布式TensorFlow入门指南

分布式TensorFlow入门指南

机器学习研究会

4+阅读 · 2017年11月28日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

gan生成图像at 1024² 的代码论文

gan生成图像at 1024² 的代码论文

CreateAMind

4+阅读 · 2017年10月31日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection

Arxiv

0+阅读 · 2021年3月5日

Speech2Action: Cross-modal Supervision for Action Recognition

Speech2Action: Cross-modal Supervision for Action Recognition

Arxiv

7+阅读 · 2020年3月30日

FocalMix: Semi-Supervised Learning for 3D Medical Image Detection

FocalMix: Semi-Supervised Learning for 3D Medical Image Detection

Arxiv

10+阅读 · 2020年3月20日

Combination of Multiple Global Descriptors for Image Retrieval

Combination of Multiple Global Descriptors for Image Retrieval

Arxiv

3+阅读 · 2019年4月18日

Learning to Weight for Text Classification

Learning to Weight for Text Classification

Arxiv

8+阅读 · 2019年3月28日

An end-to-end Neural Network Framework for Text Clustering

An end-to-end Neural Network Framework for Text Clustering

Arxiv

6+阅读 · 2019年3月22日

Generative Adversarial Active Learning for Unsupervised Outlier Detection

Generative Adversarial Active Learning for Unsupervised Outlier Detection

Arxiv

5+阅读 · 2019年3月14日

Deep Anomaly Detection with Outlier Exposure

Deep Anomaly Detection with Outlier Exposure

Arxiv

17+阅读 · 2018年12月21日

Graph Convolutional Networks for Text Classification

Arxiv

11+阅读 · 2018年10月17日

Weakly Supervised Object Detection in Artworks

Arxiv

4+阅读 · 2018年10月5日

微信扫码咨询专知VIP会员