TagGPT: 大语言模型是零-shot多模态标签器 (TagGPT: Large Language Models are Zero-shot Multimodal Taggers)

Tags are pivotal in facilitating the effective distribution of multimedia content in various applications in the contemporary Internet era, such as search engines and recommendation systems. Recently, large language models (LLMs) have demonstrated impressive capabilities across a wide range of tasks. In this work, we propose TagGPT, a fully automated system capable of tag extraction and multimodal tagging in a completely zero-shot fashion. Our core insight is that, through elaborate prompt engineering, LLMs are able to extract and reason about proper tags given textual clues of multimodal data, e.g., OCR, ASR, title, etc. Specifically, to automatically build a high-quality tag set that reflects user intent and interests for a specific application, TagGPT predicts large-scale candidate tags from a series of raw data via prompting LLMs, filtered with frequency and semantics. Given a new entity that needs tagging for distribution, TagGPT introduces two alternative options for zero-shot tagging, i.e., a generative method with late semantic matching with the tag set, and another selective method with early matching in prompts. It is well noticed that TagGPT provides a system-level solution based on a modular framework equipped with a pre-trained LLM (GPT-3.5 used here) and a sentence embedding model (SimCSE used here), which can be seamlessly replaced with any more advanced one you want. TagGPT is applicable for various modalities of data in modern social media and showcases strong generalization ability to a wide range of applications. We evaluate TagGPT on publicly available datasets, i.e., Kuaishou and Food.com, and demonstrate the effectiveness of TagGPT compared to existing hashtags and off-the-shelf taggers. Project page: https://github.com/TencentARC/TagGPT.

翻译：摘要：标签在现代互联网时代的各种应用中都发挥着重要作用，例如搜索引擎和推荐系统等。近年来，大型语言模型（LLMs）在各种任务上展示了惊人的能力。在这项工作中，我们提出了TagGPT，这是一个完全自动化的系统，能够以完全零-shot的方式进行标签提取和多模态标记。我们的核心见解是，通过精心设计的提示工程，LLMs能够从多模态数据的文本线索中提取和推理出正确的标签，例如OCR、ASR、标题等。具体而言，为了自动构建一个反映特定应用程序用户意图和兴趣的高质量标记集，TagGPT通过提示LLMs预测大规模候选标记，经过频率和语义过滤。针对需要分发的新实体进行标记的情况，TagGPT引入了两种零-shot标记的替代选项，即具有标记集中的晚期语义匹配的生成方法和具有提示中的早期匹配的选择性方法。值得注意的是，TagGPT提供了一个基于模块化框架的系统级解决方案，配备了一个预训练的LLM（这里使用的GPT-3.5）和一个句子嵌入模型（这里使用的是SimCSE），可以无缝地更换为任何更先进的模型。TagGPT适用于现代社交媒体中各种方式的数据，并展示了强大的泛化能力，适用于各种应用。我们在公开可用的数据集，即Kuaishou和Food.com上对TagGPT进行评估，并与现有的hashtags和现成标记器进行了比较。项目主页：https://github.com/TencentARC/TagGPT。