Machine Learning (ML) has been widely used in Natural Language Processing (NLP) applications. A fundamental assumption in ML is that training data and real-world data should follow a similar distribution. However, a deployed ML model may suffer from out-of-distribution (OOD) issues due to distribution shifts in the real-world data. Though many algorithms have been proposed to detect OOD data from text corpora, there is still a lack of interactive tool support for ML developers. In this work, we propose DeepLens, an interactive system that helps users detect and explore OOD issues in massive text corpora. Users can efficiently explore different OOD types in DeepLens with the help of a text clustering method. Users can also dig into a specific text by inspecting salient words highlighted through neuron activation analysis. In a within-subjects user study with 24 participants, participants using DeepLens were able to find nearly twice more types of OOD issues accurately with 22% more confidence compared with a variant of DeepLens that has no interaction or visualization support.
翻译:机器学习(ML)已被广泛用于自然语言处理(NLP)应用。 ML的一个基本假设是,培训数据和真实世界数据应该遵循类似的分布方式。然而,由于真实世界数据的分布变化,部署的ML模型可能会因分配问题(OOOD)而受到影响。虽然已提出许多算法来检测文本公司提供的OOD数据,但仍缺乏对ML开发者的互动工具支持。在这项工作中,我们提议了DeepLens(DeepLens),这是一个互动系统,帮助用户在巨大的文本公司中探测和探索OOD问题。用户可以借助文本集法有效探索DeepLens的不同OOD类型。用户也可以通过神经激活分析来检查突出的突出字眼来挖掘具体文本。在一项由24名参与者组成的主题用户研究中,使用DeepLens的参与者能够找到近两倍的OD问题类型,比没有互动或视觉支持的DeepLens变体更有信心22%。</s>