When learning from very small data sets, the resulting models can make many mistakes. For example, consider learning predictors for open source project health. The training data for this task may be very small (e.g. five years of data, collected every month means just 60 rows of training data). Using this data, prior work had unacceptably large errors in their learned predictors. We show that these high errors rates can be tamed by better configuration of the control parameters of the machine learners. For example, we present here a {\em landscape analytics} method (called SNEAK) that (a)~clusters the data to find the general landscape of the hyperparameters; then (b)~explores a few representatives from each part of that landscape. SNEAK is both faster and and more effective than prior state-of-the-art hyperparameter optimization algorithms (FLASH, HYPEROPT, OPTUNA, and differential evolution). More importantly, the configurations found by SNEAK had far less error that other methods. We conjecture that SNEAK works so well since it finds the most informative regions of the hyperparameters, then jumps to those regions. Other methods (that do not reflect over the landscape) can waste time exploring less informative options. From this, we make the following conclusions. Firstly, for predicting open source project health, we recommend landscape analytics (e.g.SNEAK). Secondly, and more generally, when learning from very small data sets, using hyperparameter optimization (e.g. SNEAK) to select learning control parameters. Due to its speed and implementation simplicity, we suggest SNEAK might also be useful in other ``data-light'' SE domains. To assist other researchers in repeating, improving, or even refuting our results, all our scripts and data are available on GitHub at https://github.com/zxcv123456qwe/niSneak
翻译:当从非常小的数据集中学习时, 由此产生的模型可以造成许多错误。 例如, 我们在这里展示了一个用于开源项目健康的学习预测器。 用于此任务的培训数据可能非常小( 例如, 五年的数据, 每个月收集60行的培训数据 ) 。 使用此数据, 先前的工作在所学的预测器中出现令人无法接受的大错误。 我们显示这些高误差率可以通过更好地配置机器学习者的控制参数来驯化。 例如, 我们在这里展示了一个 hiem 地分析参数( 称为 SNEAK ) 的方法 ( a) ~ 利用数据来寻找超分量参数的全局; 然后( b) ~ 利用数据来查找每个部分的数据 。 Snellchy 不仅速度更快, 也比先前的超光速优化算法( Flash, HyPEROPT, OPUNA, 和不同变异的变法) 。 更重要的是, Snechechal 发现所有的方法都比其他方法更不那么简单。 然后我们预测Snrech 工作得这么精细,, 。