Large Language Models are increasingly popular in genomics due to their potential to decode complex biological sequences. Hence, researchers require a standardized benchmark to evaluate DNA Language Models (DNA LMs) capabilities. However, evaluating DNA LMs is a complex task that intersects genomic's domain-specific challenges and machine learning methodologies, where seemingly minor implementation details can significantly compromise benchmark validity. We demonstrate this through BEND (Benchmarking DNA Language Models), where hardware-dependent hyperparameters -- number of data loading workers and buffer sizes -- create spurious performance variations of up to 4% for identical models. The problem stems from inadequate data shuffling interacting with domain specific data characteristics. Experiments with three DNA language models (HyenaDNA, DNABERT-2, ResNet-LM) show these artifacts affect both absolute performance and relative model rankings. We propose a simple solution: pre-shuffling data before storage eliminates hardware dependencies while maintaining efficiency. This work highlights how standard ML practices can interact unexpectedly with domain-specific data characteristics, with broader implications for benchmark design in specialized domains.
翻译:大型语言模型因其解码复杂生物序列的潜力而在基因组学领域日益普及。因此,研究人员需要一个标准化基准来评估DNA语言模型的能力。然而,评估DNA语言模型是一项复杂的任务,它涉及基因组学领域特有的挑战与机器学习方法的交叉,其中看似微小的实现细节可能显著损害基准测试的有效性。我们通过BEND(DNA语言模型基准测试)证明了这一点:在BEND中,依赖于硬件的超参数——数据加载工作进程数量和缓冲区大小——会导致相同模型的性能出现高达4%的虚假波动。该问题源于不充分的数据混洗与领域特定数据特征的相互作用。对三种DNA语言模型(HyenaDNA、DNABERT-2、ResNet-LM)的实验表明,这些人为因素既影响绝对性能,也影响模型间的相对排名。我们提出一个简单的解决方案:在存储前对数据进行预混洗,这可以消除硬件依赖性,同时保持效率。这项工作揭示了标准的机器学习实践如何与领域特定的数据特征产生意料之外的相互作用,并对专业领域的基准测试设计具有更广泛的启示意义。