The infrastructure powering IBM's Gen AI model development

Talia Gershon,Seetharami Seelam,Brian Belgodere,Milton Bonilla,Lan Hoang,Danny Barnett,I-Hsin Chung,Apoorve Mohan,Ming-Hung Chen,Lixiang Luo,Robert Walkup,Constantinos Evangelinos,Shweta Salaria,Marc Dombrowa,Yoonho Park,Apo Kayi,Liran Schour,Alim Alim,Ali Sydney,Pavlos Maniotis,Laurent Schares,Bernard Metzler,Bengi Karacali-Akyamac,Sophia Wen,Tatsuhiro Chiba,Sunyanan Choochotkaew,Takeshi Yoshimura,Claudia Misale,Tonia Elengikal,Kevin O Connor,Zhuoran Liu,Richard Molina,Lars Schneidenbach,James Caden,Christopher Laibinis,Carlos Fonseca,Vasily Tarasov,Swaminathan Sundararaman,Frank Schmuck,Scott Guthridge,Jeremy Cohn,Marc Eshel,Paul Muench,Runyu Liu,William Pointer,Drew Wyskida,Bob Krull,Ray Rose,Brent Wolfe,William Cornejo,John Walter,Colm Malone,Clifford Perucci,Frank Franco,Nigel Hinds,Bob Calio,Pavel Druyan,Robert Kilduff,John Kienle,Connor McStay,Andrew Figueroa,Matthew Connolly,Edie Fost,Gina Roma,Jake Fonseca,Ido Levy,Michele Payne,Ryan Schenkel,Amir Malki,Lion Schneider,Aniruddha Narkhede,Shekeba Moshref,Alexandra Kisin,Olga Dodin,Bill Rippon,Henry Wrieth,John Ganci,Johnny Colino,Donna Habeger-Rose,Rakesh Pandey,Aditya Gidh,Aditya Gaur,Dennis Patterson,Samsuddin Salmani,Rambilas Varma,Rumana Rumana,Shubham Sharma,Aditya Gaur,Mayank Mishra,Rameswar Panda,Aditya Prasad,Matt Stallone,Gaoyuan Zhang,Yikang Shen,David Cox,Ruchir Puri,Dakshi Agrawal,Drew Thorstensen,Joel Belog,Brent Tang,Saurabh Kumar Gupta,Amitabha Biswas,Anup Maheshwari,Eran Gampel,Jason Van Patten,Matthew Runion,Sai Kaki,Yigal Bogin,Brian Reitz,Steve Pritko,Shahan Najam,Surya Nambala,Radhika Chirra,Rick Welp,Frank DiMitri,Felipe Telles,Amilcar Arvelo,King Chu,Ed Seminaro,Andrew Schram,Felix Eickhoff,William Hanson,Eric Mckeever,Dinakaran Joseph,Piyush Chaudhary,Piyush Shivam,Puneet Chaudhary,Wesley Jones,Robert Guthrie,Chris Bostic,Rezaul Islam,Steve Duersch,Wayne Sawdon,John Lewars,Matthew Klos,Michael Spriggs,Bill McMillan,George Gao,Ashish Kamra,Gaurav Singh,Marc Curry,Tushar Katarki,Joe Talerico,Zenghui Shi,Sai Sindhur Malleni,Erwan Gallen

from arxiv, Corresponding Authors: Talia Gershon, Seetharami Seelam,Brian Belgodere, Milton Bonilla

AI Infrastructure plays a key role in the speed and cost-competitiveness of developing and deploying advanced AI models. The current demand for powerful AI infrastructure for model training is driven by the emergence of generative AI and foundational models, where on occasion thousands of GPUs must cooperate on a single training job for the model to be trained in a reasonable time. Delivering efficient and high-performing AI training requires an end-to-end solution that combines hardware, software and holistic telemetry to cater for multiple types of AI workloads. In this report, we describe IBM's hybrid cloud infrastructure that powers our generative AI model development. This infrastructure includes (1) Vela: an AI-optimized supercomputing capability directly integrated into the IBM Cloud, delivering scalable, dynamic, multi-tenant and geographically distributed infrastructure for large-scale model training and other AI workflow steps and (2) Blue Vela: a large-scale, purpose-built, on-premises hosting environment that is optimized to support our largest and most ambitious AI model training tasks. Vela provides IBM with the dual benefit of high performance for internal use along with the flexibility to adapt to an evolving commercial landscape. Blue Vela provides us with the benefits of rapid development of our largest and most ambitious models, as well as future-proofing against the evolving model landscape in the industry. Taken together, they provide IBM with the ability to rapidly innovate in the development of both AI models and commercial offerings.

翻译：暂无翻译

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日