Jupyter notebooks are widely used for machine learning (ML) prototyping. Yet, few debugging tools are designed for ML code in notebooks, partly, due to the lack of benchmarks. We introduce JunoBench, the first benchmark dataset of real-world crashes in Python-based ML notebooks. JunoBench includes 111 curated and reproducible crashes with verified fixes from public Kaggle notebooks, covering popular ML libraries (e.g., TensorFlow/Keras, PyTorch, Scikit-learn) and notebook-specific out-of-order execution errors. JunoBench ensures reproducibility and ease of use through a unified environment that reliably reproduces all crashes. By providing realistic crashes, their resolutions, richly annotated labels of crash characteristics, and natural-language diagnostic annotations, JunoBench facilitates research on bug detection, localization, diagnosis, and repair in notebook-based ML development.
翻译:Jupyter Notebook被广泛用于机器学习(ML)原型开发。然而,由于缺乏基准数据,目前鲜有针对Notebook中ML代码设计的调试工具。本文提出JunoBench——首个基于Python的ML Notebook真实崩溃场景的基准数据集。该数据集包含从公开Kaggle Notebook中收集的111个经过筛选可复现的崩溃案例及其已验证的修复方案,涵盖主流ML库(如TensorFlow/Keras、PyTorch、Scikit-learn)及Notebook特有的乱序执行错误。JunoBench通过统一环境确保所有崩溃案例的可复现性与易用性。通过提供真实崩溃场景、解决方案、丰富的崩溃特征标注及自然语言诊断注释,JunoBench为基于Notebook的ML开发中的错误检测、定位、诊断与修复研究提供了支持。