This paper presents a pipeline integrating fine-tuned large language models (LLMs) with named entity recognition (NER) for efficient domain-specific text summarization and tagging. The authors address the challenge posed by rapidly evolving sub-cultural languages and slang, which complicate automated information extraction and law enforcement monitoring. By leveraging the LLaMA Factory framework, the study fine-tunes LLMs on both generalpurpose and custom domain-specific datasets, particularly in the political and security domains. The models are evaluated using BLEU and ROUGE metrics, demonstrating that instruction fine-tuning significantly enhances summarization and tagging accuracy, especially for specialized corpora. Notably, the LLaMA3-8B-Instruct model, despite its initial limitations in Chinese comprehension, outperforms its Chinese-trained counterpart after domainspecific fine-tuning, suggesting that underlying reasoning capabilities can transfer across languages. The pipeline enables concise summaries and structured entity tagging, facilitating rapid document categorization and distribution. This approach proves scalable and adaptable for real-time applications, supporting efficient information management and the ongoing need to capture emerging language trends. The integration of LLMs and NER offers a robust solution for transforming unstructured text into actionable insights, crucial for modern knowledge management and security operations.
翻译:本文提出了一种将微调后的大语言模型(LLMs)与命名实体识别(NER)相结合的流程,以实现高效的领域特定文本摘要生成与标注。作者针对快速演变的亚文化语言和俚语所带来的挑战进行了探讨,这些语言现象使得自动化信息提取和执法监控变得复杂。研究利用LLaMA Factory框架,在通用数据集及定制化的领域特定数据集(尤其是政治与安全领域)上对LLMs进行了微调。模型通过BLEU和ROUGE指标进行评估,结果表明指令微调显著提升了摘要生成和标注的准确性,特别是在专业语料上。值得注意的是,LLaMA3-8B-Instruct模型尽管在中文理解方面存在初始局限,但在领域特定微调后表现优于其中文训练版本,这表明底层的推理能力可以跨语言迁移。该流程能够生成简洁的摘要并进行结构化的实体标注,从而促进文档的快速分类与分发。该方法被证明具有可扩展性和适应性,适用于实时应用,支持高效的信息管理,并能满足捕捉新兴语言趋势的持续需求。LLMs与NER的集成为将非结构化文本转化为可操作的洞察提供了稳健的解决方案,这对于现代知识管理和安全运营至关重要。