Few images on the Web receive alt-text descriptions that would make them accessible to blind and low vision (BLV) users. Image-based NLG systems have progressed to the point where they can begin to address this persistent societal problem, but these systems will not be fully successful unless we evaluate them on metrics that guide their development correctly. Here, we argue against current referenceless metrics -- those that don't rely on human-generated ground-truth descriptions -- on the grounds that they do not align with the needs of BLV users. The fundamental shortcoming of these metrics is that they do not take context into account, whereas contextual information is highly valued by BLV users. To substantiate these claims, we present a study with BLV participants who rated descriptions along a variety of dimensions. An in-depth analysis reveals that the lack of context-awareness makes current referenceless metrics inadequate for advancing image accessibility. As a proof-of-concept, we provide a contextual version of the referenceless metric CLIPScore which begins to address the disconnect to the BLV data. An accessible HTML version of this paper is available at https://elisakreiss.github.io/contextual-description-evaluation/paper/reflessmetrics.html
翻译:网络上很少的图像会收到能够让盲目和低视低视(BLV)用户能够访问的利他文字描述。基于图像的NLG系统已经进展到可以开始解决这一持续存在的社会问题的地步,但这些系统将无法完全成功,除非我们用正确指导其发展的尺度来评价它们。在这里,我们反对目前的无参考性指标 -- -- 那些不依赖人类生成的地面图解 -- -- 理由是它们不符合BLV用户的需要。这些指标的基本缺点是它们不考虑背景,而背景信息则受到BLV用户的高度评价。为了证实这些说法,我们向BLV参与者提交了一份研究报告,这些参与者按不同层面对描述进行了评级。一项深入的分析表明,缺乏环境意识使得当前无参考性指标不足以推进图像的可访问性。作为证据,我们提供了无参考性指标CLIPScorc的上下文版,开始解决与BLV数据脱节的问题。本文的易读的HTMLD版本可在 https://elisaksak-deplaintical/tomatimos.