GLM-4.1V-Thinking and GLM-4.5V: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team, :,Wenyi Hong,Wenmeng Yu,Xiaotao Gu,Guo Wang,Guobing Gan,Haomiao Tang,Jiale Cheng,Ji Qi,Junhui Ji,Lihang Pan,Shuaiqi Duan,Weihan Wang,Yan Wang,Yean Cheng,Zehai He,Zhe Su,Zhen Yang,Ziyang Pan,Aohan Zeng,Baoxu Wang,Bin Chen,Boyan Shi,Changyu Pang,Chenhui Zhang,Da Yin,Fan Yang,Guoqing Chen,Jiazheng Xu,Jiale Zhu,Jiali Chen,Jing Chen,Jinhao Chen,Jinghao Lin,Jinjiang Wang,Junjie Chen,Leqi Lei,Letian Gong,Leyi Pan,Mingdao Liu,Mingde Xu,Mingzhi Zhang,Qinkai Zheng,Sheng Yang,Shi Zhong,Shiyu Huang,Shuyuan Zhao,Siyan Xue,Shangqin Tu,Shengbiao Meng,Tianshu Zhang,Tianwei Luo,Tianxiang Hao,Tianyu Tong,Wenkai Li,Wei Jia,Xiao Liu,Xiaohan Zhang,Xin Lyu,Xinyue Fan,Xuancheng Huang,Yanling Wang,Yadong Xue,Yanfeng Wang,Yanzi Wang,Yifan An,Yifan Du,Yiming Shi,Yiheng Huang,Yilin Niu,Yuan Wang,Yuanchang Yue,Yuchen Li,Yutao Zhang,Yuting Wang,Yu Wang,Yuxuan Zhang,Zhao Xue,Zhenyu Hou,Zhengxiao Du,Zihan Wang,Peng Zhang,Debing Liu,Bin Xu,Juanzi Li,Minlie Huang,Yuxiao Dong,Jie Tang

We present GLM-4.1V-Thinking and GLM-4.5V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. Code, models and more information are released at https://github.com/zai-org/GLM-V.

翻译：暂无翻译

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日