Text structuralization is one of the important fields of natural language processing (NLP) consists of information extraction (IE) and structure formalization. However, current studies of text structuralization suffer from a shortage of manually annotated high-quality datasets from different domains and languages, which require specialized professional knowledge. In addition, most IE methods are designed for a specific type of structured data, e.g., entities, relations, and events, making them hard to generalize to others. In this work, we propose a simple and efficient approach to instruct large language model (LLM) to extract a variety of structures from texts. More concretely, we add a prefix and a suffix instruction to indicate the desired IE task and structure type, respectively, before feeding the text into a LLM. Experiments on two LLMs show that this approach can enable language models to perform comparable with other state-of-the-art methods on datasets of a variety of languages and knowledge, and can generalize to other IE sub-tasks via changing the content of instruction. Another benefit of our approach is that it can help researchers to build datasets in low-source and domain-specific scenarios, e.g., fields in finance and law, with low cost.
翻译:文本结构化是自然语言处理(NLP)中的重要领域之一,包括信息提取(IE)和结构形式化。然而,当前的文本结构化研究受到来自不同领域和语言的手动标注高质量数据集的短缺所困扰,这需要专业知识。此外,大多数IE方法都是为特定类型的结构化数据设计的,例如实体、关系和事件,使它们难以推广到其他类型。在这项工作中,我们提出了一种简单而有效的方法来指导大型语言模型(LLM)从文本中提取各种结构。更确切地说,在将文本输入LLM之前,我们在其前面和后面添加一个前缀和后缀指令,以指示所需的IE任务和结构类型。两个LLM的实验表明,该方法可以使语言模型在多种语言和知识数据集上表现出与其他最先进方法相当的性能,并且通过更改指令内容可以推广到其他IE子任务。我们方法的另一个好处是它可以帮助研究人员在低成本的情况下创建特定领域的数据集,例如金融和法律领域。