Traditional autonomous driving pipelines decouple camera design from downstream perception, relying on fixed optics and handcrafted ISPs that prioritize human viewable imagery rather than machine semantics. This separation discards information during demosaicing, denoising, or quantization, while forcing models to adapt to sensor artifacts. We present a task-driven co-design framework that unifies optics, sensor modeling, and lightweight semantic segmentation networks into a single end-to-end RAW-to-task pipeline. Building on DeepLens[19], our system integrates realistic cellphone-scale lens models, learnable color filter arrays, Poisson-Gaussian noise processes, and quantization, all optimized directly for segmentation objectives. Evaluations on KITTI-360 show consistent mIoU improvements over fixed pipelines, with optics modeling and CFA learning providing the largest gains, especially for thin or low-light-sensitive classes. Importantly, these robustness gains are achieved with a compact ~1M-parameter model running at ~28 FPS, demonstrating edge deployability. Visual and quantitative analyses further highlight how co-designed sensors adapt acquisition to semantic structure, sharpening boundaries and maintaining accuracy under blur, noise, and low bit-depth. Together, these findings establish full-stack co-optimization of optics, sensors, and networks as a principled path toward efficient, reliable, and deployable perception in autonomous systems.
翻译:传统自动驾驶流程将相机设计与下游感知任务解耦,依赖固定光学器件和手工设计的图像信号处理器,其优先考虑人类可读图像而非机器语义。这种分离在去马赛克、去噪或量化过程中丢弃了信息,同时迫使模型适应传感器伪影。我们提出了一种任务驱动的协同设计框架,将光学器件、传感器建模与轻量级语义分割网络统一为端到端的RAW数据到任务的完整流程。基于DeepLens[19],我们的系统集成了真实的手机级镜头模型、可学习的彩色滤光阵列、泊松-高斯噪声过程以及量化模块,所有组件均直接针对分割目标进行优化。在KITTI-360数据集上的评估表明,相较于固定流程,本方法在mIoU指标上获得持续提升,其中光学建模与CFA学习贡献了最主要的性能增益,尤其对细长结构或低光照敏感类别效果显著。值得注意的是,这些鲁棒性提升是通过仅约100万参数、运行速度约28 FPS的紧凑模型实现的,证明了边缘部署可行性。可视化与定量分析进一步表明,协同设计的传感器能根据语义结构自适应调整采集策略,在模糊、噪声及低比特深度条件下仍能锐化边界并保持精度。综合而言,这些发现确立了光学、传感器与网络的全栈协同优化路径,为自动驾驶系统实现高效、可靠且可部署的感知提供了理论依据。