Japanese language models for medical text classification face challenges with complex vocabulary and linguistic structures in radiology reports. This study compared three Japanese models--BERT Base, JMedRoBERTa, and ModernBERT--for multi-label classification of 18 chest CT findings. Using the CT-RATE-JPN dataset, all models were fine-tuned under identical conditions. ModernBERT showed clear efficiency advantages, producing substantially fewer tokens and achieving faster training and inference than the other models while maintaining comparable performance on the internal test dataset (exact match accuracy: 74.7% vs. 72.7% for BERT Base). To assess generalizability, we additionally constructed RR-Findings, an external dataset of 243 naturally written Japanese radiology reports annotated using the same schema. Under this domain-shifted setting, performance differences became pronounced: BERT Base outperformed both JMedRoBERTa and ModernBERT, whereas ModernBERT showed the largest decline in exact match accuracy. Average precision differences were smaller, indicating that ModernBERT retained reasonable ranking ability despite reduced calibration. Overall, ModernBERT offers substantial computational efficiency and strong in-domain performance but remains sensitive to real-world linguistic variability. These results highlight the need for more diverse natural-language training data and domain-specific calibration strategies to improve robustness when deploying modern transformer models in heterogeneous clinical environments.
翻译:用于医学文本分类的日语语言模型在处理放射学报告中复杂词汇和语言结构时面临挑战。本研究比较了三种日语模型——BERT Base、JMedRoBERTa 和 ModernBERT——对 18 种胸部 CT 发现的多标签分类性能。使用 CT-RATE-JPN 数据集,所有模型在相同条件下进行微调。ModernBERT 显示出明显的效率优势,生成的 token 数量显著减少,训练和推理速度比其他模型更快,同时在内部测试数据集上保持了可比的性能(精确匹配准确率:74.7%,而 BERT Base 为 72.7%)。为评估泛化能力,我们额外构建了 RR-Findings,这是一个包含 243 份自然书写的日语放射学报告的外部数据集,并使用相同标注模式进行标注。在这种领域转移设置下,性能差异变得显著:BERT Base 优于 JMedRoBERTa 和 ModernBERT,而 ModernBERT 的精确匹配准确率下降幅度最大。平均精度差异较小,表明尽管校准能力降低,ModernBERT 仍保留了合理的排序能力。总体而言,ModernBERT 提供了显著的计算效率和强大的域内性能,但对真实世界的语言变异性仍较为敏感。这些结果强调了在异构临床环境中部署现代 Transformer 模型时,需要更多样化的自然语言训练数据和领域特定的校准策略以提高鲁棒性。