评估各种阿拉伯文文本分类的投标人 (Evaluating Various Tokenizers for Arabic Text Classification) - 专知论文

会员服务 ·

0

词元分析器 · 文本分类 · Performer · 数据集 · 词表 ·

2021 年 6 月 14 日

Evaluating Various Tokenizers for Arabic Text Classification

翻译：评估各种阿拉伯文文本分类的投标人

Zaid Alyafeai,Maged S. Al-shaibani,Mustafa Ghaleb,Irfan Ahmad

The first step in any NLP pipeline is learning word vector representations. However, given a large text corpus, representing all the words is not efficient. In the literature, many tokenization algorithms have emerged to tackle this problem by creating subwords which in turn limits the vocabulary size in any text corpus. However such algorithms are mostly language-agnostic and lack a proper way of capturing meaningful tokens. Not to mention the difficulty of evaluating such techniques in practice. In this paper, we introduce three new tokenization algorithms for Arabic and compare them to three other baselines using unsupervised evaluations. In addition to that, we compare all the six algorithms by evaluating them on three tasks which are sentiment analysis, news classification and poetry classification. Our experiments show that the performance of such tokenization algorithms depends on the size of the dataset, type of the task, and the amount of morphology that exists in the dataset.

翻译：任何NLP管道的第一个步骤是学习文字矢量表示。但是,鉴于代表所有字词的庞大文本体并不有效。在文献中,许多象征性算法已经出现,通过创建子字来解决这个问题,这些子字反过来限制任何文字体的词汇大小。然而,这种算法大多是语言不可知性的,缺乏捕捉有意义符号的适当方式。更不用说在实际中评估这些技术的困难。在本文中,我们为阿拉伯语引入了三种新的象征性算法,并用未经监督的评价将其与其他三个基线进行比较。此外,我们通过对这六个算法进行对比,在三个任务上进行了感性分析、新闻分类和诗歌分类。我们的实验表明,这种象征性算法的性能取决于数据集的大小、任务类型以及数据集中存在的形态数量。

0

相关内容

词元分析器

词元分析器

WWW2021 | 图机器学习论文一览

专知会员服务

59+阅读 · 2021年4月29日

自然语言处理顶会EMNLP2020接受论文列表，754篇论文都在这儿了！

自然语言处理顶会EMNLP2020接受论文列表，754篇论文都在这儿了！

专知会员服务

28+阅读 · 2020年10月26日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【快讯】KDD2020论文出炉，216篇上榜，你的paper中了吗？

【快讯】KDD2020论文出炉，216篇上榜，你的paper中了吗？

专知会员服务

51+阅读 · 2020年5月16日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【快讯】CVPR2020结果出炉，1470篇上榜，你的paper中了吗？

【快讯】CVPR2020结果出炉，1470篇上榜，你的paper中了吗？

专知会员服务

51+阅读 · 2020年2月24日

【NLP| 推荐文章】语言语音处理（Speech and Language Processing(3rd ed.draft)）

专知会员服务

15+阅读 · 2019年11月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

【Github】TextCluster：短文本聚类预处理模块 Short text cluster

【Github】TextCluster：短文本聚类预处理模块 Short text cluster

AINLP

5+阅读 · 2019年12月1日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【PyTorch实战】手把手教你用torchtext处理文本数据

【PyTorch实战】手把手教你用torchtext处理文本数据

专知

13+阅读 · 2018年6月14日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

Learning to Hash Robustly, with Guarantees

Arxiv

0+阅读 · 2021年8月11日

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Arxiv

3+阅读 · 2020年3月24日

Text Classification Algorithms: A Survey

Arxiv

15+阅读 · 2019年6月25日

BERTScore: Evaluating Text Generation with BERT

Arxiv

5+阅读 · 2019年4月21日

Learning to Weight for Text Classification

Learning to Weight for Text Classification

Arxiv

8+阅读 · 2019年3月28日

Super Characters: A Conversion from Sentiment Classification to Image Classification

Arxiv

4+阅读 · 2018年10月15日

Sentiment Analysis of Arabic Tweets: Feature Engineering and A Hybrid Approach

Arxiv

7+阅读 · 2018年5月22日

Improving Sentiment Analysis in Arabic Using Word Representation

Arxiv

4+阅读 · 2018年2月28日

Fine-tuned Language Models for Text Classification

Arxiv

5+阅读 · 2018年1月18日

Knowledge-based Word Sense Disambiguation using Topic Models

Arxiv

5+阅读 · 2018年1月5日

VIP会员

文章信息

相关主题

词元分析器

相关VIP内容

WWW2021 | 图机器学习论文一览

专知会员服务

59+阅读 · 2021年4月29日

自然语言处理顶会EMNLP2020接受论文列表，754篇论文都在这儿了！

自然语言处理顶会EMNLP2020接受论文列表，754篇论文都在这儿了！

专知会员服务

28+阅读 · 2020年10月26日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

【快讯】KDD2020论文出炉，216篇上榜，你的paper中了吗？

【快讯】KDD2020论文出炉，216篇上榜，你的paper中了吗？

专知会员服务

51+阅读 · 2020年5月16日

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

【2020新书】自然语言处理Python与spaCy实践，216页pdf，NLP with Python

专知会员服务

108+阅读 · 2020年5月1日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【快讯】CVPR2020结果出炉，1470篇上榜，你的paper中了吗？

【快讯】CVPR2020结果出炉，1470篇上榜，你的paper中了吗？

专知会员服务

51+阅读 · 2020年2月24日

【NLP| 推荐文章】语言语音处理（Speech and Language Processing(3rd ed.draft)）

专知会员服务

15+阅读 · 2019年11月24日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

从社会学实验到行为仿真：理解基于Agent的观点动力学建模思维

中英文版《GPT-5 System Card速览》报告

ACL 2025 | 大模型结构化知识提示的泛化能力研究

【普林斯顿博士论文】大型模型的高效推理

相关资讯

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

【论文笔记】通俗理解少样本文本分类 (Few-Shot Text Classification) (1)

深度学习自然语言处理

7+阅读 · 2020年4月8日

【Github】TextCluster：短文本聚类预处理模块 Short text cluster

【Github】TextCluster：短文本聚类预处理模块 Short text cluster

AINLP

5+阅读 · 2019年12月1日

强化学习三篇论文避免遗忘等

强化学习三篇论文避免遗忘等

CreateAMind

20+阅读 · 2019年5月24日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【PyTorch实战】手把手教你用torchtext处理文本数据

【PyTorch实战】手把手教你用torchtext处理文本数据

专知

13+阅读 · 2018年6月14日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

相关论文

Learning to Hash Robustly, with Guarantees

Arxiv

0+阅读 · 2021年8月11日

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalization

Arxiv

3+阅读 · 2020年3月24日

Text Classification Algorithms: A Survey

Arxiv

15+阅读 · 2019年6月25日

BERTScore: Evaluating Text Generation with BERT

Arxiv

5+阅读 · 2019年4月21日

Learning to Weight for Text Classification

Learning to Weight for Text Classification

Arxiv

8+阅读 · 2019年3月28日

Super Characters: A Conversion from Sentiment Classification to Image Classification

Arxiv

4+阅读 · 2018年10月15日

Sentiment Analysis of Arabic Tweets: Feature Engineering and A Hybrid Approach

Arxiv

7+阅读 · 2018年5月22日

Improving Sentiment Analysis in Arabic Using Word Representation

Arxiv

4+阅读 · 2018年2月28日

Fine-tuned Language Models for Text Classification

Arxiv

5+阅读 · 2018年1月18日

Knowledge-based Word Sense Disambiguation using Topic Models

Arxiv

5+阅读 · 2018年1月5日

微信扫码咨询专知VIP会员