A vital problem in solving classification or regression problem is to apply feature engineering and variable selection on data before fed into models.One of a most popular feature engineering method is to discretisize continous variable with some cutting points,which is refered to as bining processing.Good cutting points are important for improving model's ability, because wonderful bining may ignore some noisy variance in continous variable range and keep useful leveled information with good ordered encodings.However, to our best knowledge a majority of cutting point selection is done via researchers domain knownledge or some naive methods like equal-width cutting or equal-frequency cutting.In this paper we propose an end-to-end supervised cutting point selection method based on group and fused lasso along with the automatically variable selection effect.We name our method \textbf{ABM}(automatic bining machine). We firstly cut each variable range into fine grid bins and train model with our group and group fused lasso regularization on each successive bins.It is a method that integrates feature engineering,variable selection and model training simultanously.And one more inspiring thing is that the method is flexible such that it can be taken into a bunch of loss function based model including deep neural networks.We have also implemented the method in R and open the source code to other researchers.A Python version will also meet the community in days.
翻译:解决分类或回归问题的一个关键问题是,在输入模型之前对数据应用地貌工程和变量选择。 最流行的地貌工程方法之一是使用一些切分点的离散至端点变数, 称为拆解处理。 良好的切分点对于提高模型的能力很重要, 因为美妙的拆分可能会忽略在连续变数范围中存在一些吵闹的差异, 并且使用良好的定序编码来保持有用的平级信息。 然而, 根据我们的最佳知识, 大部分切分点选择是通过研究人员的域名已知或一些天真的方法, 如等平宽切或等平频切。 在此文件中, 我们提议一个基于组和连接的拉索的端到端监督切点选择方法, 以及自动的变数选择效果。 我们命名我们的方法 \ textb{ABM}( 自动的拆解机) 。 我们首先将每个变数范围切成精细的网格, 并与我们的组和组组合的拉索调定型模式。 这是一种整合地工程、 可变式的裁剪裁剪和模型 。 一种更令人振动的网络将它变成一个开放的模型。 。 也是一种方法, 以开放的系统将它作为另一种的版本。