This report provides an architecture-led analysis of two modern vision-language models (VLMs), Qwen2.5-VL-7B-Instruct and Llama-4-Scout-17B-16E-Instruct, and explains how their architectural properties map to a practical video-to-artifact pipeline implemented in the BodyLanguageDetection repository [1]. The system samples video frames, prompts a VLM to detect visible people and generate pixel-space bounding boxes with prompt-conditioned attributes (emotion by default), validates output structure using a predefined schema, and optionally renders an annotated video. We first summarize the shared multimodal foundation (visual tokenization, Transformer attention, and instruction following), then describe each architecture at a level sufficient to justify engineering choices without speculative internals. Finally, we connect model behavior to system constraints: structured outputs can be syntactically valid while semantically incorrect, schema validation is structural (not geometric correctness), person identifiers are frame-local in the current prompting contract, and interactive single-frame analysis returns free-form text rather than schema-enforced JSON. These distinctions are critical for writing defensible claims, designing robust interfaces, and planning evaluation.
翻译:本报告对两种现代视觉语言模型(VLM)——Qwen2.5-VL-7B-Instruct 与 Llama-4-Scout-17B-16E-Instruct——进行了架构主导的分析,并阐释了它们的架构特性如何映射到 BodyLanguageDetection 代码库 [1] 中实现的一个实用视频到产物的处理流程。该系统对视频帧进行采样,提示 VLM 检测可见人物并生成带有提示条件属性(默认为情绪)的像素空间边界框,使用预定义模式验证输出结构,并可选择性地渲染带标注的视频。我们首先总结了其共享的多模态基础(视觉分词、Transformer 注意力机制以及指令遵循),然后在足以论证工程选择而不涉及推测性内部实现的层面上,描述了每种架构。最后,我们将模型行为与系统约束联系起来:结构化输出可能在句法上有效但语义上错误,模式验证是结构性的(而非几何正确性验证),在当前提示约定下人物标识符是帧内局部的,而交互式单帧分析返回的是自由格式文本而非模式强制的 JSON。这些区分对于撰写可辩护的论断、设计稳健的接口以及规划评估至关重要。