树木、森林、鸡鸡和鸡蛋:在随机森林中植树的时间和原因 (Trees, Forests, Chickens, and Eggs: When and Why to Prune Trees in a Random Forest) - 专知论文

会员服务 ·

0

随机森林 · 正则化项 · 剪枝 · Continuity · SimPLe ·

2021 年 3 月 30 日

Trees, Forests, Chickens, and Eggs: When and Why to Prune Trees in a Random Forest

翻译：树木、森林、鸡鸡和鸡蛋:在随机森林中植树的时间和原因

Siyu Zhou,Lucas Mentch

Due to their long-standing reputation as excellent off-the-shelf predictors, random forests continue remain a go-to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner-workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged -- one based on interpolation and the other based on regularization. This work argues in favor of the latter by utilizing the regularization framework to reexamine the decades-old question of whether individual trees in an ensemble ought to be pruned. Despite the fact that default constructions of random forests use near full depth trees in most popular software packages, here we provide strong evidence that tree depth should be seen as a natural form of regularization across the entire procedure. In particular, our work suggests that random forests with shallow trees are advantageous when the signal-to-noise ratio in the data is low. In building up this argument, we also critique the newly popular notion of "double descent" in random forests by drawing parallels to U-statistics and arguing that the noticeable jumps in random forest accuracy are the result of simple averaging rather than interpolation.

翻译：随机森林长期以来一直以优秀的现成预测器为名声,因此,随机森林仍然是应用统计人员和数据科学家选择的模式。尽管它们广泛使用,但直到最近,它们内部工作及其程序的哪些方面都鲜为人知,而且这些程序的哪些方面促使它们取得成功。最近出现了两个相互竞争的假设 -- -- 一个基于内推,另一个基于正规化。这项工作支持后者,利用正规化框架来重新审查数十年来一直存在的关于合谋中单个树木是否应该被切割的问题。尽管默认的随机森林建筑在最受欢迎的软件包中使用了接近完全深度的树木,但这里我们提供了有力的证据,表明树木深度应被视为整个程序的自然正规化形式。特别是,我们的工作表明,当数据中的信号对噪音比率低时,带浅树的随机森林是有利的。在提出这一论点时,我们还批评了在随机森林中新流行的“双重血统”概念,即通过绘制与U统计学相近的图谱,并争论说,在随机森林的明显跳动是简单的平均结果。

0

相关内容

随机森林

随机森林指的是利用多棵树对样本进行训练并预测的一种分类器。

知识荟萃

精品入门和进阶教程、论文和代码整理等

更多

查看相关VIP内容、论文、资讯等

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【ECML-PKDD 2019】带歧义的分类变量编码（Encoding Categorical Variables with Ambiguity）

【ECML-PKDD 2019】带歧义的分类变量编码（Encoding Categorical Variables with Ambiguity）

专知会员服务

5+阅读 · 2019年12月1日

《机器学习与公平性》（Fairness and Machine Learning）新书发布，附181页PDF下载

《机器学习与公平性》（Fairness and Machine Learning）新书发布，附181页PDF下载

专知会员服务

78+阅读 · 2019年10月26日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

机器学习在材料科学中的应用综述，21页pdf

机器学习在材料科学中的应用综述，21页pdf

专知会员服务

49+阅读 · 2019年9月24日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

RF(随机森林)、GBDT、XGBoost面试级整理

RF(随机森林)、GBDT、XGBoost面试级整理

数据挖掘入门与实战

7+阅读 · 2018年2月6日

算法｜随机森林（Random Forest）

算法｜随机森林（Random Forest）

全球人工智能

3+阅读 · 2018年1月8日

分布式TensorFlow入门指南

分布式TensorFlow入门指南

机器学习研究会

4+阅读 · 2017年11月28日

【推荐】决策树/随机森林深入解析

【推荐】决策树/随机森林深入解析

机器学习研究会

5+阅读 · 2017年9月21日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

Universal Consistency of Decision Trees in High Dimensions

Arxiv

0+阅读 · 2021年5月25日

SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

Arxiv

0+阅读 · 2021年5月25日

A moving-boundary model of reactive settling in wastewater treatment

Arxiv

0+阅读 · 2021年5月23日

Who Watches the New Watchmen? The Challenges for Drone Digital Forensics Investigations

Arxiv

0+阅读 · 2021年5月23日

Towards Certifying L-infinity Robustness using Neural Networks with L-inf-dist Neurons

Arxiv

0+阅读 · 2021年5月22日

Support Optimality and Adaptive Cuckoo Filters

Arxiv

0+阅读 · 2021年5月22日

On the Consistency of a Random Forest Algorithm in the Presence of Missing Entries

Arxiv

0+阅读 · 2021年5月22日

Efficient PAC Reinforcement Learning in Regular Decision Processes

Arxiv

0+阅读 · 2021年5月21日

ADASYN-Random Forest Based Intrusion Detection Model

Arxiv

0+阅读 · 2021年5月20日

Survival-Supervised Topic Modeling with Anchor Words: Characterizing Pancreatitis Outcomes

Arxiv

3+阅读 · 2017年12月7日

VIP会员

文章信息

相关主题

相关VIP内容

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

51+阅读 · 2020年12月14日

【干货书】机器学习速查手册，135页pdf

【干货书】机器学习速查手册，135页pdf

专知会员服务

127+阅读 · 2020年11月20日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【ECML-PKDD 2019】带歧义的分类变量编码（Encoding Categorical Variables with Ambiguity）

【ECML-PKDD 2019】带歧义的分类变量编码（Encoding Categorical Variables with Ambiguity）

专知会员服务

5+阅读 · 2019年12月1日

《机器学习与公平性》（Fairness and Machine Learning）新书发布，附181页PDF下载

《机器学习与公平性》（Fairness and Machine Learning）新书发布，附181页PDF下载

专知会员服务

78+阅读 · 2019年10月26日

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

社交网络上议题社群的公共焦虑研究，中国人民大学新闻学院塔娜讲师，第八届全国社会媒体处理大会SMP2019

专知会员服务

15+阅读 · 2019年10月23日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

机器学习在材料科学中的应用综述，21页pdf

机器学习在材料科学中的应用综述，21页pdf

专知会员服务

49+阅读 · 2019年9月24日

热门VIP内容

开通专知VIP会员享更多权益服务

【博士论文】低维与高维空间中潜在表征的分析、建模与变换

《生态建模密码破译：建模与编程实践》美陆军最新报告

大模型解决方案白皮书：社交陪伴场景全流程落地指南

面向具身操作的视觉-语言-动作模型综述

相关资讯

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Call for Participation: Shared Tasks in NLPCC 2019

Call for Participation: Shared Tasks in NLPCC 2019

中国计算机学会

5+阅读 · 2019年3月22日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

RF(随机森林)、GBDT、XGBoost面试级整理

RF(随机森林)、GBDT、XGBoost面试级整理

数据挖掘入门与实战

7+阅读 · 2018年2月6日

算法｜随机森林（Random Forest）

算法｜随机森林（Random Forest）

全球人工智能

3+阅读 · 2018年1月8日

分布式TensorFlow入门指南

分布式TensorFlow入门指南

机器学习研究会

4+阅读 · 2017年11月28日

【推荐】决策树/随机森林深入解析

【推荐】决策树/随机森林深入解析

机器学习研究会

5+阅读 · 2017年9月21日

【推荐】SVM实例教程

【推荐】SVM实例教程

机器学习研究会

17+阅读 · 2017年8月26日

相关论文

Universal Consistency of Decision Trees in High Dimensions

Arxiv

0+阅读 · 2021年5月25日

SHAFF: Fast and consistent SHApley eFfect estimates via random Forests

Arxiv

0+阅读 · 2021年5月25日

A moving-boundary model of reactive settling in wastewater treatment

Arxiv

0+阅读 · 2021年5月23日

Who Watches the New Watchmen? The Challenges for Drone Digital Forensics Investigations

Arxiv

0+阅读 · 2021年5月23日

Towards Certifying L-infinity Robustness using Neural Networks with L-inf-dist Neurons

Arxiv

0+阅读 · 2021年5月22日

Support Optimality and Adaptive Cuckoo Filters

Arxiv

0+阅读 · 2021年5月22日

On the Consistency of a Random Forest Algorithm in the Presence of Missing Entries

Arxiv

0+阅读 · 2021年5月22日

Efficient PAC Reinforcement Learning in Regular Decision Processes

Arxiv

0+阅读 · 2021年5月21日

ADASYN-Random Forest Based Intrusion Detection Model

Arxiv

0+阅读 · 2021年5月20日

Survival-Supervised Topic Modeling with Anchor Words: Characterizing Pancreatitis Outcomes

Arxiv

3+阅读 · 2017年12月7日

微信扫码咨询专知VIP会员