Despite recent advancements in tabular language model research, real-world applications are still challenging. In industry, there is an abundance of tables found in spreadsheets, but acquisition of substantial amounts of labels is expensive, since only experts can annotate the often highly technical and domain-specific tables. Active learning could potentially reduce labeling costs, however, so far there are no works related to active learning in conjunction with tabular language models. In this paper we investigate different acquisition functions in a real-world industrial tabular language model use case for sub-cell named entity recognition. Our results show that cell-level acquisition functions with built-in diversity can significantly reduce the labeling effort, while enforced table diversity is detrimental. We further see open fundamental questions concerning computational efficiency and the perspective of human annotators.
翻译:尽管最近在表格语言模式研究方面取得了进展,但现实世界的应用仍然具有挑战性。在行业中,电子表格中发现大量表格,但大量标签的获取成本很高,因为只有专家才能对通常高度技术性和特定领域的表格进行批注。尽管如此,积极学习可以降低标签成本,但到目前为止,还没有与表格语言模式一起积极学习的作品。在本文中,我们调查了在以亚细胞名称实体识别的真实世界工业表格语言模式中使用的不同获取功能。我们的结果显示,具有内在多样性的细胞级获取功能可以大大减少标签工作,而强化表格多样性则有害。我们进一步看到了关于计算效率和人类批注者观点的公开基本问题。