In this paper, we present DendroMap, a novel approach to interactively exploring large-scale image datasets for machine learning. Machine learning practitioners often explore image datasets by generating a grid of images or projecting high-dimensional representations of images into 2-D using dimensionality reduction techniques (e.g., t-SNE). However, neither approach effectively scales to large datasets because images are ineffectively organized and interactions are insufficiently supported. To address these challenges, we develop DendroMap by adapting Treemaps, a well-known visualization technique. DendroMap effectively organizes images by extracting hierarchical cluster structures from high-dimensional representations of images. It enables users to make sense of the overall distributions of datasets and interactively zoom into specific areas of interests at multiple levels of abstraction. Our case studies with widely-used image datasets for deep learning demonstrate that users can discover insights about datasets and trained models by examining the diversity of images, identifying underperforming subgroups, and analyzing classification errors. We conducted a user study that evaluates the effectiveness of DendroMap in grouping and searching tasks by comparing it with a gridified version of t-SNE and found that participants preferred DendroMap over the compared method.
翻译:在本文中,我们介绍DendroMap,这是交互探索大规模图像数据集供机器学习的新颖方法。机器学习实践者经常利用减少维度技术(例如t-SNE),通过生成图像网格或将图像的高维表达方式投射到2D,来探索图像数据集。然而,由于图像组织不力,互动支持不足,因此对大型数据集没有有效的尺度。为了应对这些挑战,我们开发DendroMap,采用了众所周知的可视化技术“树形图案”。DendroMap通过从高维图像显示中提取分级集成结构有效地组织图像。它使用户能够理解数据集的总体分布和交互式缩放到多个抽象层次的具体兴趣领域(例如t-SNENE)。我们用广泛使用的图像数据集进行的案例研究表明,用户可以通过审查图像的多样性、查明表现不佳的分组和分析分类错误来发现数据集和经过培训的模型。我们进行了一项用户研究,通过将DentroMap用户的组合和搜索方式与经过比较的电网化的参与者相比,评估了DDNER的版本。