Human observers engage in selective information uptake when classifying visual patterns. The same is true of deep neural networks, which currently constitute the best performing artificial vision systems. Our goal is to examine the congruence, or lack thereof, in the information-gathering strategies of the two systems. We have operationalized our investigation as a character recognition task. We have used eye-tracking to assay the spatial distribution of information hotspots for humans via fixation maps and an activation mapping technique for obtaining analogous distributions for deep networks through visualization maps. Qualitative comparison between visualization maps and fixation maps reveals an interesting correlate of congruence. The deep learning model considered similar regions in character, which humans have fixated in the case of correctly classified characters. On the other hand, when the focused regions are different for humans and deep nets, the characters are typically misclassified by the latter. Hence, we propose to use the visual fixation maps obtained from the eye-tracking experiment as a supervisory input to align the model's focus on relevant character regions. We find that such supervision improves the model's performance significantly and does not require any additional parameters. This approach has the potential to find applications in diverse domains such as medical analysis and surveillance in which explainability helps to determine system fidelity.
翻译:人类观察家在对视觉模式进行分类时有选择地进行信息采集。 深神经网络也是同样的情况,这种网络目前是最佳的人工视觉系统。 我们的目标是检查这两个系统的信息收集战略的相容性或缺乏这种一致性。 我们已经将我们的调查作为一种特征识别任务加以操作。 我们使用眼睛跟踪来分析人类信息热点的空间分布,通过固定地图和启动绘图技术,通过可视化地图获取深海网络的类似分布。 对视觉化地图和固定地图的定性比较显示一个有趣的一致性相关性。 深度学习模型认为,在性质上相似的区域,人类在正确分类的字符中固定了这些区域。 另一方面,当重点区域对人和深网不同时,这些字符通常被后者错误地分类。 因此,我们提议使用从眼跟踪实验中获得的视觉固定地图作为监督性投入,以调整模型对相关特征区域的关注。 我们发现,这种监督极大地改进了模型的性能,而人类在正确分类的字符方面没有固定性,因此不需要在任何不同的区域中确定任何真实性参数。