Column Type Annotation (CTA) is a fundamental step towards enabling schema alignment and semantic understanding of tabular data. Existing encoder-only language models achieve high accuracy when fine-tuned on labeled columns, but their applicability is limited to in-domain settings, as distribution shifts in tables or label spaces require costly re-training from scratch. Recent work has explored prompting generative large language models (LLMs) by framing CTA as a multiple-choice task, but these approaches face two key challenges: (1) model performance is highly sensitive to subtle changes in prompt wording and structure, and (2) annotation F1 scores remain modest. A natural extension is to fine-tune large language models. However, fully fine-tuning these models incurs prohibitive computational costs due to their scale, and the sensitivity to prompts is not eliminated. In this paper, we present a parameter-efficient framework for CTA that trains models over prompt-augmented data via Low-Rank Adaptation (LoRA). Our approach mitigates sensitivity to prompt variations while drastically reducing the number of necessary trainable parameters, achieving robust performance across datasets and templates. Experimental results on recent benchmarks demonstrate that models fine-tuned with our prompt augmentation strategy maintain stable performance across diverse prompt patterns during inference and yield higher weighted F1 scores than those fine-tuned on a single prompt template. These results highlight the effectiveness of parameter-efficient training and augmentation strategies in developing practical and adaptable CTA systems.
翻译:列类型标注(CTA)是实现表格数据模式对齐与语义理解的基础步骤。现有仅编码器语言模型在标注列上微调后可获得高准确率,但其适用性局限于域内场景,因为表格或标签空间的分布偏移需要从头开始进行昂贵的重新训练。近期研究通过将CTA构建为多项选择任务来探索提示生成式大语言模型(LLM),但这些方法面临两个关键挑战:(1)模型性能对提示措辞和结构的细微变化高度敏感;(2)标注F1分数仍处于中等水平。一个自然的延伸方向是对大语言模型进行微调。然而,由于模型规模庞大,完全微调这些模型会产生极高的计算成本,且对提示的敏感性并未消除。本文提出一种参数高效的CTA框架,通过低秩自适应(LoRA)在提示增强数据上训练模型。我们的方法在显著减少可训练参数数量的同时,缓解了对提示变化的敏感性,实现了跨数据集和模板的鲁棒性能。在最新基准测试上的实验结果表明,采用我们提示增强策略微调的模型在推理过程中能保持跨不同提示模式的稳定性能,并且比在单一提示模板上微调的模型获得更高的加权F1分数。这些结果凸显了参数高效训练与增强策略在开发实用且适应性强的CTA系统中的有效性。