In this paper, we present DendroMap, a novel approach to interactively exploring large-scale image datasets for machine learning (ML). ML practitioners often explore image datasets by generating a grid of images or projecting high-dimensional representations of images into 2-D using dimensionality reduction techniques (e.g., t-SNE). However, neither approach effectively scales to large datasets because images are ineffectively organized and interactions are insufficiently supported. To address these challenges, we develop DendroMap by adapting Treemaps, a well-known visualization technique. DendroMap effectively organizes images by extracting hierarchical cluster structures from high-dimensional representations of images. It enables users to make sense of the overall distributions of datasets and interactively zoom into specific areas of interests at multiple levels of abstraction. Our case studies with widely-used image datasets for deep learning demonstrate that users can discover insights about datasets and trained models by examining the diversity of images, identifying underperforming subgroups, and analyzing classification errors. We conducted a user study that evaluates the effectiveness of DendroMap in grouping and searching tasks by comparing it with a gridified version of t-SNE and found that participants preferred DendroMap. DendroMap is available at https://div-lab.github.io/dendromap/.
翻译:在本文中,我们介绍DendroMap,这是交互探索大规模图像数据集以供机器学习的一种新颖方法。ML从业人员经常利用减少维度技术(例如,t-SNE),通过生成图像网格或将图像的高维表达式投射到二维二维技术(例如,t-SNE)来探索图像数据集。然而,由于图像组织不力,互动支持不足,因此对大型数据集没有进行有效的尺度评估。为了应对这些挑战,我们开发DendroMap,调整了树马普,这是一种广为人知的视觉化技术。DendroMap通过从高层面图像展示中提取等级分组结构来有效地组织图像。它使用户能够了解数据集和交互式缩影在多个抽象层次的具体利益领域的总体分布。我们利用广泛使用的图像数据集进行的案例研究表明,用户可以通过审查图像多样性、查明表现不佳的分组和分析分类错误来了解数据集和经过培训的模型。我们开展了一项用户研究,通过对DdromMM-DERM参与者的分组/DRAM进行对比来评估DERM的效用和搜索。