One truism of deep learning is that the automatic feature engineering (seen in the first layers of those networks) excuses data scientists from performing tedious manual feature engineering prior to running DL. For the specific case of deep learning for defect prediction, we show that that truism is false. Specifically, when we preprocess data with a novel oversampling technique called fuzzy sampling, as part of a larger pipeline called GHOST (Goal-oriented Hyper-parameter Optimization for Scalable Training), then we can do significantly better than the prior DL state of the art in 14/20 defect data sets. Our approach yields state-of-the-art results significantly faster deep learners. These results present a cogent case for the use of oversampling prior to applying deep learning on software defect prediction datasets.
翻译:深层学习的一个道理是,自动特征工程(在这些网络的最初几层中都看到)为数据科学家在运行 DL 之前不进行无聊的手工特征工程提供了借口。 关于深入学习预测缺陷的具体案例,我们证明,这种三重理论是虚假的。 具体地说,当我们用一种叫作模糊取样的新颖的过度采样技术来预处理数据时,作为称为GHOST(面向目标的超参数优化可缩放培训)的更大管道的一部分,然后我们可以大大改进14/20的缺陷数据集中以前的DL状态。 我们的方法产生最先进的深层学习者。 这些结果为在对软件缺陷预测数据集进行深层学习之前使用过度采样提供了令人信服的证据。