Machine learning (ML) offers powerful methods for detecting and modeling associations often in data with large feature spaces and complex associations. Many useful tools/packages (e.g. scikit-learn) have been developed to make the various elements of data handling, processing, modeling, and interpretation accessible. However, it is not trivial for most investigators to assemble these elements into a rigorous, replicatable, unbiased, and effective data analysis pipeline. Automated machine learning (AutoML) seeks to address these issues by simplifying the process of ML analysis for all. Here, we introduce STREAMLINE, a simple, transparent, end-to-end AutoML pipeline designed as a framework to easily conduct rigorous ML modeling and analysis (limited initially to binary classification). STREAMLINE is specifically designed to compare performance between datasets, ML algorithms, and other AutoML tools. It is unique among other autoML tools by offering a fully transparent and consistent baseline of comparison using a carefully designed series of pipeline elements including: (1) exploratory analysis, (2) basic data cleaning, (3) cross validation partitioning, (4) data scaling and imputation, (5) filter-based feature importance estimation, (6) collective feature selection, (7) ML modeling with `Optuna' hyperparameter optimization across 15 established algorithms (including less well-known Genetic Programming and rule-based ML), (8) evaluation across 16 classification metrics, (9) model feature importance estimation, (10) statistical significance comparisons, and (11) automatically exporting all results, plots, a PDF summary report, and models that can be easily applied to replication data.
翻译:机械学习(ML)为探测和建模协会提供了强有力的方法,往往以具有大型地物空间和复杂协会的数据为数据进行探测和建模,已经开发了许多有用的工具/组合(例如,Scikit-learn),使数据处理、处理、建模和解释等各种要素易于获得,然而,对于大多数调查员来说,将这些要素汇集成一个严格、可复制、公正和有效的数据分析管道管道,自动化机学习(Automal)力求通过简化所有项目都可简化的ML分析程序来解决这些问题。在这里,我们采用了STREAMLINE,一个简单、透明、端到端的Automt-MML输油管管道,作为便于进行严格的ML模型和分析的框架(最初限于二进制分类),但是,STREAMLNINE专门设计将这些要素汇集成一个严格的、可复制的、不偏重的、不偏重的、不偏重的、不偏重的、跨16级的和精确的M-LIML IM 标准、跨15级的、基于的统计的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、基于级的、基于的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、基于的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、跨级的、