面向大规模机器学习服务的声明式数据流水线 (Declarative Data Pipeline for Large Scale ML Services)

Yunzhao Yang,Runhui Wang,Xuanqing Liu,Adit Krishnan,Yefan Tao,Yuqian Deng,Kuangyou Yao,Peiyuan Sun,Henrik Johnson,Aditi sinha,Davor Golac,Gerald Friedland,Usman Shakeel,Daryl Cooke,Joe Sullivan,Madhu Chandra,Chris Kong

Modern distributed data processing systems struggle to balance performance, maintainability, and developer productivity when integrating machine learning at scale. These challenges intensify in large collaborative environments due to high communication overhead and coordination complexity. We present a "Declarative Data Pipeline" (DDP) architecture that addresses these challenges while processing billions of records efficiently. Our modular framework seamlessly integrates machine learning within Apache Spark using logical computation units called Pipes, departing from traditional microservice approaches. By establishing clear component boundaries and standardized interfaces, we achieve modularity and optimization without sacrificing maintainability. Enterprise case studies demonstrate substantial improvements: 50% better development efficiency, collaboration efforts compressed from weeks to days, 500x scalability improvement, and 10x throughput gains.

翻译：现代分布式数据处理系统在集成大规模机器学习时，难以平衡性能、可维护性与开发效率。在大型协作环境中，由于高昂的通信开销与复杂的协调机制，这些挑战尤为突出。本文提出一种“声明式数据流水线”（DDP）架构，可在高效处理数十亿条记录的同时应对上述挑战。该模块化框架通过名为“管道”（Pipes）的逻辑计算单元，将机器学习无缝集成至Apache Spark平台，突破了传统微服务架构的局限。通过建立清晰的组件边界与标准化接口，我们在保持系统可维护性的同时实现了模块化设计与性能优化。企业案例研究表明，该架构带来显著改进：开发效率提升50%，协作周期从数周压缩至数日，可扩展性提升500倍，吞吐量提高10倍。