BERTScore has become a widely adopted metric for evaluating semantic similarity between natural language sentences. However, we identify a critical limitation: BERTScore exhibits low sensitivity to numerical variation, a significant weakness in finance where numerical precision directly affects meaning (e.g., distinguishing a 2% gain from a 20% loss). We introduce FinNuE, a diagnostic dataset constructed with controlled numerical perturbations across earnings calls, regulatory filings, social media, and news articles. Using FinNuE, demonstrate that BERTScore fails to distinguish semantically critical numerical differences, often assigning high similarity scores to financially divergent text pairs. Our findings reveal fundamental limitations of embedding-based metrics for finance and motivate numerically-aware evaluation frameworks for financial NLP.
翻译:BERTScore已成为评估自然语言句子间语义相似性的广泛采用指标。然而,我们发现其存在一个关键局限:BERTScore对数值变化的敏感性较低,这在金融领域构成显著缺陷,因为数值精度直接影响语义(例如区分2%的收益与20%的损失)。我们提出FinNuE,这是一个通过收益电话会议、监管文件、社交媒体及新闻文章中的受控数值扰动构建的诊断数据集。利用FinNuE,我们证明BERTScore无法区分语义关键的数值差异,常为金融意义上显著分歧的文本对赋予高相似性分数。我们的研究揭示了基于嵌入的度量方法在金融应用中的根本局限,并推动建立面向金融自然语言处理的数值感知评估框架。