Natural Language Processing (NLP) is one of the core techniques in AI software. As AI is being applied to more and more domains, how to efficiently develop high-quality domain-specific language models becomes a critical question in AI software engineering. Existing domain-specific language model development processes mostly focus on learning a domain-specific pre-trained language model (PLM); when training the domain task-specific language model based on PLM, only a direct (and often unsatisfactory) fine-tuning strategy is adopted commonly. By enhancing the task-specific training procedure with domain knowledge graphs, we propose KnowledgeDA, a unified and low-code domain language model development service. Given domain-specific task texts input by a user, KnowledgeDA can automatically generate a domain-specific language model following three steps: (i) localize domain knowledge entities in texts via an embedding-similarity approach; (ii) generate augmented samples by retrieving replaceable domain entity pairs from two views of both knowledge graph and training data; (iii) select high-quality augmented samples for fine-tuning via confidence-based assessment. We implement a prototype of KnowledgeDA to learn language models for two domains, healthcare and software development. Experiments on five domain-specific NLP tasks verify the effectiveness and generalizability of KnowledgeDA. (Code is publicly available at https://github.com/RuiqingDing/KnowledgeDA.)
翻译:自然语言处理(NLP)是AI软件的核心技术之一。随着AI应用到越来越多的领域,如何有效开发高质量的特定域语言模型成为AI软件工程中的一个关键问题。现有的特定域语言模式开发过程主要侧重于学习一个特定域的预先培训语言模型(PLM);在培训基于PLM的具体域任务语言模型时,通常只采用直接(而且往往不能令人满意)的微调战略。通过用域知识图表加强具体任务培训程序,我们建议“知识开发工具”,一个统一和低编码的域域语言模型开发服务。鉴于用户对域特定任务文本的投入,“知识开发工具”可自动产生一个特定域语言模型,遵循三个步骤:(一) 通过嵌入式-类似方法,将文本中的域知识实体本地化;(二) 从知识图表和培训数据两种观点中重新定位可替换的域实体配对,产生更多的样本;(三) 选择高质量的强化样本,通过基于信任的评估进行微调。我们实施了知识开发数据数据库原型,以学习两个域的域、ROD/软件开发。