极端多标签数据分批抽样 (Stratified Sampling for Extreme Multi-Label Data) - 专知论文

会员服务 ·

0

分层采样 · XPath · 划分 · Performer · binary ·

2021 年 3 月 5 日

Stratified Sampling for Extreme Multi-Label Data

翻译：极端多标签数据分批抽样

Maximillian Merrillees,Lan Du

Extreme multi-label classification (XML) is becoming increasingly relevant in the era of big data. Yet, there is no method for effectively generating stratified partitions of XML datasets. Instead, researchers typically rely on provided test-train splits that, 1) aren't always representative of the entire dataset, and 2) are missing many of the labels. This can lead to poor generalization ability and unreliable performance estimates, as has been established in the binary and multi-class settings. As such, this paper presents a new and simple algorithm that can efficiently generate stratified partitions of XML datasets with millions of unique labels. We also examine the label distributions of prevailing benchmark splits, and investigate the issues that arise from using unrepresentative subsets of data for model development. The results highlight the difficulty of stratifying XML data, and demonstrate the importance of using stratified partitions for training and evaluation.

翻译：在大数据时代,极端多标签分类(XML)正在变得日益重要。然而,没有有效生成XML数据集分层分割的方法。相反,研究人员通常依赖提供的测试-培训分解,其中1(1)并非总能代表整个数据集,2)缺少许多标签。这可能导致二进制和多级设置中确立的概括化能力差和性能估计不可靠。因此,本文件提出了一个新的、简单的算法,可以有效地生成具有数百万个独特标签的XML数据集分层分割。我们还检查了当前基准分解的标签分布情况,并调查了在模型开发中使用不具有代表性的一组数据所产生的问题。结果突出表明了压缩XML数据的困难,并展示了在培训和评估中使用分层分割分割分区的重要性。

0

相关内容

分层采样

【2021新书】模式、预测与行动：机器学习的故事，308页pdf

【2021新书】模式、预测与行动：机器学习的故事，308页pdf

专知会员服务

84+阅读 · 2021年2月15日

【经典书】用Python学数据科学(Data Science from Scratch)，464页pdf

【经典书】用Python学数据科学(Data Science from Scratch)，464页pdf

专知会员服务

43+阅读 · 2021年2月13日

【AAAI2021】对比聚类，Contrastive Clustering

【AAAI2021】对比聚类，Contrastive Clustering

专知会员服务

78+阅读 · 2021年1月30日

多标签学习的新趋势（2020 Survey）

多标签学习的新趋势（2020 Survey）

专知会员服务

43+阅读 · 2020年12月6日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

图解FixMatch的半监督学习，The Illustrated FixMatch for Semi-Supervised Learning

图解FixMatch的半监督学习，The Illustrated FixMatch for Semi-Supervised Learning

专知会员服务

26+阅读 · 2020年4月2日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

专知会员服务

8+阅读 · 2019年12月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

深度卷积神经网络中的降采样

深度卷积神经网络中的降采样

极市平台

12+阅读 · 2019年5月24日

IEEE | DSC 2019诚邀稿件 (EI检索)

IEEE | DSC 2019诚邀稿件 (EI检索)

Call4Papers

10+阅读 · 2019年2月25日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

【SIGIR2018】五篇对抗训练文章

【SIGIR2018】五篇对抗训练文章

专知

12+阅读 · 2018年7月9日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

读书报告 | Deep Learning for Extreme Multi-label Text Classification

读书报告 | Deep Learning for Extreme Multi-label Text Classification

科技创新与创业

48+阅读 · 2018年1月10日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

Adaptive Methods for Real-World Domain Generalization

Arxiv

13+阅读 · 2021年3月29日

Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data

Arxiv

9+阅读 · 2021年2月8日

Does Data Augmentation Benefit from Split BatchNorms

Does Data Augmentation Benefit from Split BatchNorms

Arxiv

3+阅读 · 2020年10月15日

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Arxiv

13+阅读 · 2020年6月24日

A Baseline for Few-Shot Image Classification

Arxiv

7+阅读 · 2020年3月1日

X-BERT: eXtreme Multi-label Text Classification with BERT

X-BERT: eXtreme Multi-label Text Classification with BERT

Arxiv

12+阅读 · 2019年7月4日

LNEMLC: Label Network Embeddings for Multi-Label Classification

Arxiv

3+阅读 · 2019年1月1日

AceKG: A Large-scale Knowledge Graph for Academic Data Mining

AceKG: A Large-scale Knowledge Graph for Academic Data Mining

Arxiv

6+阅读 · 2018年8月7日

Classification with Fairness Constraints: A Meta-Algorithm with Provable Guarantees

Classification with Fairness Constraints: A Meta-Algorithm with Provable Guarantees

Arxiv

3+阅读 · 2018年8月2日

Subset Labeled LDA for Large-Scale Multi-Label Classification

Arxiv

3+阅读 · 2017年9月16日

VIP会员

文章信息

相关主题

相关VIP内容

【2021新书】模式、预测与行动：机器学习的故事，308页pdf

【2021新书】模式、预测与行动：机器学习的故事，308页pdf

专知会员服务

84+阅读 · 2021年2月15日

【经典书】用Python学数据科学(Data Science from Scratch)，464页pdf

【经典书】用Python学数据科学(Data Science from Scratch)，464页pdf

专知会员服务

43+阅读 · 2021年2月13日

【AAAI2021】对比聚类，Contrastive Clustering

【AAAI2021】对比聚类，Contrastive Clustering

专知会员服务

78+阅读 · 2021年1月30日

多标签学习的新趋势（2020 Survey）

多标签学习的新趋势（2020 Survey）

专知会员服务

43+阅读 · 2020年12月6日

零样本文本分类，Zero-Shot Learning for Text Classification

零样本文本分类，Zero-Shot Learning for Text Classification

专知会员服务

97+阅读 · 2020年5月31日

图解FixMatch的半监督学习，The Illustrated FixMatch for Semi-Supervised Learning

图解FixMatch的半监督学习，The Illustrated FixMatch for Semi-Supervised Learning

专知会员服务

26+阅读 · 2020年4月2日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

【ECML-PKDD 2019】基于种子样本的Web数据抽取（Web Data Extraction with Seed Samples）

专知会员服务

8+阅读 · 2019年12月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《美陆军特种作战条令》最新102页

《洛克希德SR-71“黑鸟”侦察机动力系统》21页slides

美空军作战实验室通过人工智能和指挥控制技术创新推进杀伤链

《指挥控制能力分析方法论》最新报告

相关资讯

深度卷积神经网络中的降采样

深度卷积神经网络中的降采样

极市平台

12+阅读 · 2019年5月24日

IEEE | DSC 2019诚邀稿件 (EI检索)

IEEE | DSC 2019诚邀稿件 (EI检索)

Call4Papers

10+阅读 · 2019年2月25日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

Disentangled的假设的探讨

Disentangled的假设的探讨

CreateAMind

9+阅读 · 2018年12月10日

【SIGIR2018】五篇对抗训练文章

【SIGIR2018】五篇对抗训练文章

专知

12+阅读 · 2018年7月9日

Hierarchical Disentangled Representations

Hierarchical Disentangled Representations

CreateAMind

4+阅读 · 2018年4月15日

读书报告 | Deep Learning for Extreme Multi-label Text Classification

读书报告 | Deep Learning for Extreme Multi-label Text Classification

科技创新与创业

48+阅读 · 2018年1月10日

【推荐】用Tensorflow理解LSTM

【推荐】用Tensorflow理解LSTM

机器学习研究会

36+阅读 · 2017年9月11日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

相关论文

Adaptive Methods for Real-World Domain Generalization

Arxiv

13+阅读 · 2021年3月29日

Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data

Arxiv

9+阅读 · 2021年2月8日

Does Data Augmentation Benefit from Split BatchNorms

Does Data Augmentation Benefit from Split BatchNorms

Arxiv

3+阅读 · 2020年10月15日

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Minimal Variance Sampling with Provable Guarantees for Fast Training of Graph Neural Networks

Arxiv

13+阅读 · 2020年6月24日

A Baseline for Few-Shot Image Classification

Arxiv

7+阅读 · 2020年3月1日

X-BERT: eXtreme Multi-label Text Classification with BERT

X-BERT: eXtreme Multi-label Text Classification with BERT

Arxiv

12+阅读 · 2019年7月4日

LNEMLC: Label Network Embeddings for Multi-Label Classification

Arxiv

3+阅读 · 2019年1月1日

AceKG: A Large-scale Knowledge Graph for Academic Data Mining

AceKG: A Large-scale Knowledge Graph for Academic Data Mining

Arxiv

6+阅读 · 2018年8月7日

Classification with Fairness Constraints: A Meta-Algorithm with Provable Guarantees

Classification with Fairness Constraints: A Meta-Algorithm with Provable Guarantees

Arxiv

3+阅读 · 2018年8月2日

Subset Labeled LDA for Large-Scale Multi-Label Classification

Arxiv

3+阅读 · 2017年9月16日

微信扫码咨询专知VIP会员