Vision-Language-Action (VLA) models are increasingly used for end-to-end driving due to their world knowledge and reasoning ability. Most prior work, however, inserts textual chains-of-thought (CoT) as intermediate steps tailored to the current scene. Such symbolic compressions can blur spatio-temporal relations and discard fine visual cues, creating a cross-modal gap between perception and planning. We propose FSDrive, a visual spatio-temporal CoT framework that enables VLAs to think in images. The model first acts as a world model to generate a unified future frame that overlays coarse but physically-plausible priors-future lane dividers and 3D boxes-on the predicted future image. This unified frame serves as the visual CoT, capturing both spatial structure and temporal evolution. The same VLA then functions as an inverse-dynamics model, planning trajectories from current observations and the visual CoT. To equip VLAs with image generation while preserving understanding, we introduce a unified pre-training paradigm that expands the vocabulary to include visual tokens and jointly optimizes VQA (for semantics) and future-frame prediction (for dynamics). A progressive easy-to-hard scheme first predicts lane/box priors to enforce physical constraints, then completes full future frames for fine details. On nuScenes and NAVSIM, FSDrive improves trajectory accuracy and reduces collisions under both ST-P3 and UniAD metrics, and attains competitive FID for future-frame generation despite using lightweight autoregression. It also advances scene understanding on DriveLM. Together, these results indicate that visual CoT narrows the cross-modal gap and yields safer, more anticipatory planning. Code is available at https://github.com/MIV-XJTU/FSDrive.
翻译:暂无翻译