Video prediction is plagued by a fundamental trilemma: achieving high-resolution and perceptual quality typically comes at the cost of real-time speed, hindering its use in latency-critical applications. This challenge is most acute for autonomous UAVs in dense urban environments, where foreseeing events from high-resolution imagery is non-negotiable for safety. Existing methods, reliant on iterative generation (diffusion, autoregressive models) or quadratic-complexity attention, fail to meet these stringent demands on edge hardware. To break this long-standing trade-off, we introduce RAPTOR, a video prediction architecture that achieves real-time, high-resolution performance. RAPTOR's single-pass design avoids the error accumulation and latency of iterative approaches. Its core innovation is Efficient Video Attention (EVA), a novel translator module that factorizes spatiotemporal modeling. Instead of processing flattened spacetime tokens with $O((ST)^2)$ or $O(ST)$ complexity, EVA alternates operations along the spatial (S) and temporal (T) axes. This factorization reduces the time complexity to $O(S + T)$ and memory complexity to $O(max(S, T))$, enabling global context modeling at $512^2$ resolution and beyond, operating directly on dense feature maps with a patch-free design. Complementing this architecture is a 3-stage training curriculum that progressively refines predictions from coarse structure to sharp, temporally coherent details. Experiments show RAPTOR is the first predictor to exceed 30 FPS on a Jetson AGX Orin for $512^2$ video, setting a new state-of-the-art on UAVid, KTH, and a custom high-resolution dataset in PSNR, SSIM, and LPIPS. Critically, RAPTOR boosts the mission success rate in a real-world UAV navigation task by 18%, paving the way for safer and more anticipatory embodied agents.
翻译:视频预测面临一个根本性的三重困境:实现高分辨率和感知质量通常以牺牲实时速度为代价,这阻碍了其在延迟关键型应用中的使用。这一挑战对于密集城市环境中的自主无人机最为严峻,因为从高分辨率图像中预见事件对于安全性而言是不可妥协的。现有方法依赖于迭代生成(扩散模型、自回归模型)或具有二次复杂度的注意力机制,无法满足边缘硬件上的这些严苛要求。为了打破这一长期存在的权衡,我们引入了RAPTOR,一种实现实时高分辨率性能的视频预测架构。RAPTOR的单次前向设计避免了迭代方法的误差累积和延迟。其核心创新是高效视频注意力(EVA),这是一种新颖的翻译器模块,它对时空建模进行分解。EVA不是以$O((ST)^2)$或$O(ST)$的复杂度处理展平的时空标记,而是沿着空间(S)和时间(T)轴交替进行操作。这种分解将时间复杂度降低到$O(S + T)$,内存复杂度降低到$O(max(S, T))$,从而能够在$512^2$及更高分辨率上对全局上下文进行建模,并以无补丁设计直接在密集特征图上运行。与此架构相辅相成的是一个三阶段训练课程,该课程从粗糙结构逐步细化预测,直至生成清晰、时间连贯的细节。实验表明,RAPTOR是首个在Jetson AGX Orin上对$512^2$视频实现超过30 FPS的预测器,在UAVid、KTH以及一个自定义高分辨率数据集上,于PSNR、SSIM和LPIPS指标上创造了新的最先进水平。至关重要的是,RAPTOR在实际无人机导航任务中将任务成功率提升了18%,为更安全、更具预见性的具身智能体铺平了道路。