This paper introduces the retrieval-augmented framework for automatic fashion caption and hashtag generation, combining multi-garment detection, attribute reasoning, and Large Language Model (LLM) prompting. The system aims to produce visually grounded, descriptive, and stylistically interesting text for fashion imagery, overcoming the limitations of end-to-end captioners that have problems with attribute fidelity and domain generalization. The pipeline combines a YOLO-based detector for multi-garment localization, k-means clustering for dominant color extraction, and a CLIP-FAISS retrieval module for fabric and gender attribute inference based on a structured product index. These attributes, together with retrieved style examples, create a factual evidence pack that is used to guide an LLM to generate human-like captions and contextually rich hashtags. A fine-tuned BLIP model is used as a supervised baseline model for comparison. Experimental results show that the YOLO detector is able to obtain a mean Average Precision (mAP@0.5) of 0.71 for nine categories of garments. The RAG-LLM pipeline generates expressive attribute-aligned captions and achieves mean attribute coverage of 0.80 with full coverage at the 50% threshold in hashtag generation, whereas BLIP gives higher lexical overlap and lower generalization. The retrieval-augmented approach exhibits better factual grounding, less hallucination, and great potential for scalable deployment in various clothing domains. These results demonstrate the use of retrieval-augmented generation as an effective and interpretable paradigm for automated and visually grounded fashion content generation.
翻译:本文提出了一种检索增强框架,用于自动生成时尚图像描述和话题标签,该框架结合了多服装检测、属性推理和大语言模型(LLM)提示技术。该系统旨在为时尚图像生成视觉基础扎实、描述性强且风格有趣的文本,以克服端到端描述生成模型在属性保真度和领域泛化方面的局限性。该流程整合了基于YOLO的多服装定位检测器、用于主色调提取的k-means聚类算法,以及基于结构化产品索引的CLIP-FAISS检索模块,用于推断面料和性别属性。这些属性与检索到的风格示例共同构成事实证据包,用于引导LLM生成类人化的描述和语境丰富的话题标签。研究采用微调后的BLIP模型作为监督基线进行对比。实验结果表明,YOLO检测器在九类服装检测中平均精度(mAP@0.5)达到0.71。RAG-LLM流程生成的描述在属性对齐方面表现力强,在话题标签生成中平均属性覆盖率达0.80(50%阈值下实现完全覆盖),而BLIP模型则呈现更高的词汇重叠度与更低的泛化能力。检索增强方法展现出更好的事实依据性、更低的幻觉现象,并在多服装领域具备可扩展部署的巨大潜力。这些结果证明了检索增强生成作为自动化视觉基础时尚内容生成的有效且可解释范式。