Random forests (RF) and deep networks (DN) are two of the most popular machine learning methods in the current scientific literature and yield differing levels of performance on different data modalities. We wish to further explore and establish the conditions and domains in which each approach excels, particularly in the context of sample size and feature dimension. To address these issues, we tested the performance of these approaches across tabular, image, and audio settings using varying model parameters and architectures. Our focus is on datasets with at most 10,000 samples, which represent a large fraction of scientific and biomedical datasets. In general, we found RF to excel at tabular and structured data (image and audio) with small sample sizes, whereas DN performed better on structured data with larger sample sizes. Although we plan to continue updating this technical report in the coming months, we believe the current preliminary results may be of interest to others.
翻译:随机森林(RF)和深海网络(DN)是当前科学文献中最流行的两种机器学习方法,在不同数据模式方面产生不同程度的绩效,我们希望进一步探索和确定每种方法优异的条件和领域,特别是在抽样规模和特征方面。为了解决这些问题,我们利用不同的模型参数和结构,在表格、图像和音频设置中测试了这些方法的性能。我们的重点是最多有10 000个样本的数据集,这些样本占科学和生物医学数据集的一大部分。一般而言,我们发现RF以表格和结构化数据(图像和音频)为优势,样本规模较小,而DN在结构化数据方面表现更好,样本规模较大。虽然我们计划在未来几个月继续更新这份技术报告,但我们认为目前的初步结果可能令其他人感兴趣。