自回归风格文本图像生成：可靠性提升研究 (Autoregressive Styled Text Image Generation, but Make it Reliable)

Generating faithful and readable styled text images (especially for Styled Handwritten Text generation - HTG) is an open problem with several possible applications across graphic design, document understanding, and image editing. A lot of research effort in this task is dedicated to developing strategies that reproduce the stylistic characteristics of a given writer, with promising results in terms of style fidelity and generalization achieved by the recently proposed Autoregressive Transformer paradigm for HTG. However, this method requires additional inputs, lacks a proper stop mechanism, and might end up in repetition loops, generating visual artifacts. In this work, we rethink the autoregressive formulation by framing HTG as a multimodal prompt-conditioned generation task, and tackle the content controllability issues by introducing special textual input tokens for better alignment with the visual ones. Moreover, we devise a Classifier-Free-Guidance-based strategy for our autoregressive model. Through extensive experimental validation, we demonstrate that our approach, dubbed Eruku, compared to previous solutions requires fewer inputs, generalizes better to unseen styles, and follows more faithfully the textual prompt, improving content adherence.

翻译：生成忠实且可读的风格化文本图像（尤其是风格化手写文本生成——HTG）是一个开放性问题，在平面设计、文档理解和图像编辑等多个领域具有潜在应用价值。该任务的大量研究工作致力于开发能够复现特定书写者风格特征的策略，近期提出的基于自回归Transformer范式的HTG方法在风格保真度和泛化能力方面取得了显著成果。然而，该方法需要额外输入，缺乏有效的停止机制，且可能陷入重复循环并产生视觉伪影。本研究通过将HTG重构为多模态提示条件生成任务，重新思考自回归建模框架，并通过引入特殊文本输入标记以增强与视觉标记的对齐性，从而解决内容可控性问题。此外，我们为自回归模型设计了基于无分类器引导的策略。通过大量实验验证，我们证明所提出的Eruku方法相较于现有方案需要更少的输入，对未见风格具有更好的泛化能力，并能更准确地遵循文本提示，显著提升了内容一致性。