Loading…
Tuesday November 12, 2024 11:50am - 12:15pm MST
Cloud native takes on new meaning in the AI and HPC domains. What does cloud native mean when your software is tightly coupled to hardware? When capacity is fixed, which assumptions start to break down? How can you flex GPUs batch training workloads and inference? Join us for a case study, demonstrating how a small team scaled ML infrastructure from a single cloud to multiple clusters across 4 cloud providers - in under 6 months. We’ll share unique multi-cloud challenges we uncovered around supercomputing infrastructure, cross cloud networking, capacity & quota management, batch workloads, FinOps, and observability. We will particularly highlight our experience using Kueue to manage fixed capacity across clouds & where Kubernetes still falls short for HPC workloads. Leave with a solid understanding of what it takes for an infrastructure team to support the lifecycle of a cloud native foundation model.
Speakers
avatar for Autumn  Moulder

Autumn Moulder

Director of Infrastructure & Security, Cohere
Autumn is the Director of Infrastructure & Security at Cohere. She’s been with the company since September 2022 scaling teams & tools. Prior to buying into the startup life, she spent 3 years in financial services and 14 years at a large non-profit. Her passion is helping innovative... Read More →
Tuesday November 12, 2024 11:50am - 12:15pm MST
Salt Palace | Level 1 | Grand Ballroom A
Feedback form is now closed.

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link