Evaluating new techniques on realistic datasets plays a crucial role in the development of ML research and its broader adoption by practitioners. In recent years, there has been a significant increase of publicly available unstructured data resources for computer vision and NLP tasks. However, tabular data -- which is prevalent in many high-stakes domains -- has been lagging behind. To bridge this gap, we present Bank Account Fraud (BAF), the first publicly available privacy-preserving, large-scale, realistic suite of tabular datasets. The suite was generated by applying state-of-the-art tabular data generation techniques on an anonymized,real-world bank account opening fraud detection dataset. This setting carries a set of challenges that are commonplace in real-world applications, including temporal dynamics and significant class imbalance. Additionally, to allow practitioners to stress test both performance and fairness of ML methods, each dataset variant of BAF contains specific types of data bias. With this resource, we aim to provide the research community with a more realistic, complete, and robust test bed to evaluate novel and existing methods.
翻译:评估关于现实数据集的新技术,对于发展ML研究及其为从业人员广泛采用,具有关键作用。近年来,用于计算机视觉和NLP任务的公开可得的无结构数据资源显著增加。然而,表格数据 -- -- 在许多高取域很普遍 -- -- 一直落后。为了缩小这一差距,我们介绍了银行帐户欺诈(BAF),这是第一个公开提供的保密、大规模、现实的表格数据集套件。这套成套数据是应用最新的表格数据生成技术,用于无名化、真实世界银行账户打开欺诈检测数据集。这一设置带来了一系列现实世界应用程序中常见的挑战,包括时间动态和严重阶级失衡。此外,为了让从业人员对ML方法的性能和公正性进行压力测试,BAF的每个数据集变量都包含特定类型的数据偏差。我们利用这一资源,旨在为研究界提供一个更现实、完整和可靠的测试床,以评价新颖和现有方法。