Large Language Models (LLMs) are increasingly employed in applications that require processing information from heterogeneous formats, including texts, tables, infoboxes, and knowledge graphs. However, systematic biases toward particular formats may undermine LLMs' ability to integrate heterogeneous data impartially, potentially resulting in reasoning errors and increased risks in downstream tasks. Yet it remains unclear whether such biases are systematic, which data-level factors drive them, and what internal mechanisms underlie their emergence. In this paper, we present the first comprehensive study of format bias in LLMs through a three-stage empirical analysis. The first stage explores the presence and direction of bias across a diverse range of LLMs. The second stage examines how key data-level factors influence these biases. The third stage analyzes how format bias emerges within LLMs' attention patterns and evaluates a lightweight intervention to test its effectiveness. Our results show that format bias is consistent across model families, driven by information richness, structure quality, and representation type, and is closely associated with attention imbalance within the LLMs. Based on these investigations, we identify three future research directions to reduce format bias: enhancing data pre-processing through format repair and normalization, introducing inference-time interventions such as attention re-weighting, and developing format-balanced training corpora. These directions will support the design of more robust and fair heterogeneous data processing systems.
翻译:大语言模型(LLMs)越来越多地应用于需要处理异构格式信息的场景,包括文本、表格、信息框和知识图谱。然而,对特定格式的系统性偏差可能会削弱LLMs公正整合异构数据的能力,可能导致推理错误并增加下游任务的风险。目前尚不清楚此类偏差是否具有系统性、哪些数据层面因素驱动了这些偏差,以及其产生的内部机制是什么。本文通过三阶段实证分析,首次对LLMs中的格式偏差进行了全面研究。第一阶段探究了多种LLMs中偏差的存在性与方向性。第二阶段考察了关键数据层面因素如何影响这些偏差。第三阶段分析了格式偏差如何在LLMs的注意力模式中显现,并评估了一种轻量级干预措施的有效性。研究结果表明,格式偏差在不同模型家族中具有一致性,受信息丰富度、结构质量和表征类型驱动,并与LLMs内部的注意力失衡密切相关。基于这些发现,我们提出了三个减少格式偏差的未来研究方向:通过格式修复与规范化增强数据预处理、引入注意力重加权等推理时干预方法,以及开发格式平衡的训练语料库。这些方向将有助于设计更稳健、公平的异构数据处理系统。