ImaGGen：基于语言与图像输入的零样本共语语义手势生成 (ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input)

Human communication combines speech with expressive nonverbal cues such as hand gestures that serve manifold communicative functions. Yet, current generative gesture generation approaches are restricted to simple, repetitive beat gestures that accompany the rhythm of speaking but do not contribute to communicating semantic meaning. This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance. Such gestures cannot be derived from language input alone, which inherently lacks the visual meaning that is often carried autonomously by gestures. We therefore introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention. Our method integrates an image analysis pipeline that extracts key object properties such as shape, symmetry, and alignment, together with a semantic matching module that links these visual details to spoken text. An inverse kinematics engine then synthesizes iconic and deictic gestures and combines them with co-generated natural beat gestures for coherent multimodal communication. A comprehensive user study demonstrates the effectiveness of our approach. In scenarios where speech alone was ambiguous, gestures generated by our system significantly improved participants' ability to identify object properties, confirming their interpretability and communicative value. While challenges remain in representing complex shapes, our results highlight the importance of context-aware semantic gestures for creating expressive and collaborative virtual agents or avatars, marking a substantial step forward towards efficient and robust, embodied human-agent interaction. More information and example videos are available here: https://review-anon-io.github.io/ImaGGen.github.io/

翻译：人类交流将言语与富有表现力的非语言线索（如手势）相结合，这些手势具有多种交际功能。然而，当前生成式手势生成方法仅限于生成简单、重复的节拍手势，这些手势仅伴随说话节奏，并不参与语义传达。本文解决了共语手势合成中的一个核心挑战：生成与言语话语在语义上连贯的图标性或指示性手势。此类手势无法仅从语言输入中推导得出，因为语言输入本身缺乏通常由手势自主承载的视觉意义。因此，我们引入了一种零样本系统，该系统不仅根据给定的语言输入生成手势，还额外利用图像信息进行引导，整个过程无需人工标注或干预。我们的方法整合了一个图像分析流程，用于提取关键物体属性（如形状、对称性和对齐方式），以及一个语义匹配模块，将这些视觉细节与口语文本关联起来。随后，一个逆运动学引擎合成图标性和指示性手势，并将其与共同生成的自然节拍手势相结合，以实现连贯的多模态交流。一项全面的用户研究证明了我们方法的有效性。在仅凭语音存在歧义的场景中，我们的系统生成的手势显著提高了参与者识别物体属性的能力，证实了其可解释性和交际价值。尽管在表示复杂形状方面仍存在挑战，但我们的研究结果突显了情境感知语义手势对于创建富有表现力和协作性的虚拟代理或化身的重要性，标志着朝着高效、鲁棒的具身人机交互迈出了重要一步。更多信息及示例视频请访问：https://review-anon-io.github.io/ImaGGen.github.io/