PMLB (Penn Machine Learning Benchmark) is an open-source data repository containing a curated collection of datasets for evaluating and comparing machine learning (ML) algorithms. Compiled from a broad range of existing ML benchmark collections, PMLB synthesizes and standardizes hundreds of publicly available datasets from diverse sources such as the UCI ML repository and OpenML, enabling systematic assessment of different ML methods. These datasets cover a range of applications, from binary/multi-class classification to regression problems with combinations of categorical and continuous features. PMLB has both a Python interface (pmlb) and an R interface (pmlbr), both with detailed documentation that allows the user to access cleaned and formatted datasets using a single function call. PMLB also provides a comprehensive description of each dataset and advanced functions to explore the dataset space, allowing for smoother user experience and handling of data. The resource is designed to facilitate open-source contributions in the form of datasets as well as improvements to curation.
翻译:PMLB(Penn机器学习基准)是一个公开的数据储存库,包含一套用于评价和比较机器学习算法的分类数据集集,从现有的多种ML基准收集、PMLB合成和标准化了来自各种来源的数百个公开数据集,如UCI ML储存库和OpenML,从而能够对不同的ML方法进行系统评估。这些数据集涵盖一系列应用,从二进制/多级分类到与绝对和连续特征相结合的回归问题。PMLB有一个Python界面(pmlb)和R界面(pmlbr),两者都有详细的文档,使用户能够利用单一功能调用获得经过清理和格式化的数据集。PMLB还全面描述了每个数据集和高级功能,以探索数据集空间,使用户能够更顺畅地体验和处理数据。该资源的设计是为了便利以数据集的形式提供开放源的贡献,并改进校正。