gpu_ext：基于eBPF的GPU可扩展操作系统策略 (gpu_ext: Extensible OS Policies for GPUs via eBPF)

Performance in modern GPU-centric systems increasingly depends on resource management policies, including memory placement, scheduling, and observability. However, uniform policies typically yield suboptimal performance across diverse workloads. Existing approaches present a tradeoff: user-space runtimes provide programmability and flexibility but lack cross-tenant visibility and fine-grained control of hardware resources; meanwhile, modifications to the OS kernel introduce significant complexity and safety risks. To address this, we argue that the GPU driver and device layer should provide an extensible OS interface for policy enforcement. While the emerging eBPF technology shows potential, directly applying existing host-side eBPF is insufficient because they lack visibility and control into critical device-side events, and directly embedding policy code into GPU kernels could compromise safety and efficiency. We propose gpu_ext, an eBPF-based runtime that treats the GPU driver and device as a programmable OS subsystem. gpu_ext extends GPU drivers by exposing safe programmable hooks and introduces a device-side eBPF runtime capable of executing verified policy logic within GPU kernels, enabling coherent and transparent policies. Evaluation across realistic workloads including inference, training, and vector search demonstrates that gpu_ext improves throughput by up to 4.8x and reduces tail latency by up to 2x, incurring low overhead, without modifying or restarting applications

翻译：在现代以GPU为中心的系统中，性能日益依赖于资源管理策略，包括内存放置、调度和可观测性。然而，统一的策略通常无法在不同工作负载上实现最优性能。现有方法存在权衡：用户空间运行时提供了可编程性和灵活性，但缺乏跨租户的可见性以及对硬件资源的细粒度控制；同时，修改操作系统内核会引入显著的复杂性和安全风险。为解决这一问题，我们认为GPU驱动和设备层应提供一个用于策略执行的可扩展操作系统接口。尽管新兴的eBPF技术展现出潜力，但直接应用现有的主机端eBPF是不够的，因为它们缺乏对关键设备端事件的可见性和控制能力，而直接将策略代码嵌入GPU内核可能损害安全性和效率。我们提出了gpu_ext，这是一个基于eBPF的运行时，它将GPU驱动和设备视为一个可编程的操作系统子系统。gpu_ext通过暴露安全的可编程钩子来扩展GPU驱动，并引入一个设备端eBPF运行时，能够在GPU内核内执行经过验证的策略逻辑，从而实现一致且透明的策略。在包括推理、训练和向量搜索在内的实际工作负载上的评估表明，gpu_ext可将吞吐量提升高达4.8倍，并将尾部延迟降低高达2倍，同时带来较低的开销，且无需修改或重启应用程序。