Text structuralization is one of the important fields of natural language processing (NLP) consists of information extraction (IE) and structure formalization. However, current studies of text structuralization suffer from a shortage of manually annotated high-quality datasets from different domains and languages, which require specialized professional knowledge. In addition, most IE methods are designed for a specific type of structured data, e.g., entities, relations, and events, making them hard to generalize to others. In this work, we propose a simple and efficient approach to instruct large language model (LLM) to extract a variety of structures from texts. More concretely, we add a prefix and a suffix instruction to indicate the desired IE task and structure type, respectively, before feeding the text into a LLM. Experiments on two LLMs show that this approach can enable language models to perform comparable with other state-of-the-art methods on datasets of a variety of languages and knowledge, and can generalize to other IE sub-tasks via changing the content of instruction. Another benefit of our approach is that it can help researchers to build datasets in low-source and domain-specific scenarios, e.g., fields in finance and law, with low cost.
翻译:文本结构化是自然语言处理的重要领域之一,包括信息提取和结构形式化。然而,当前文本结构化研究在不同领域和语言的高质量手工注释数据集方面短缺,需要特定的专业知识。此外,大多数信息提取方法被设计用于特定类型的结构化数据(如实体、关系和事件),使它们难以推广到其他类型。在这项工作中,我们提出了一种简单而有效的方法,通过指导大型语言模型提取文本中的各种结构。更具体地,在将文本输入语言模型之前,我们添加了前缀和后缀指示所需的信息提取任务和结构类型。通过两个语言模型的实验,我们发现这种方法可以使语言模型在不同语言和领域的数据集上表现得与其他先进方法相当,在不同的信息提取子任务上进行推广可以通过改变指示内容来实现。我们方法的另一个好处是,它可以帮助研究人员在低成本情况下构建金融和法律等特定领域的数据集。