Loading…
Tuesday November 12, 2024 10:40am - 11:05am MST
Developed by the Kubernetes community in collaboration with the ecosystem, Kueue augments k8s and ClusterAutoscaler to provide an E2E batch system. Kueue implements job queueing, deciding when jobs should wait and when they should start or be preempted, based on quotas and a hierarchy for sharing resources among teams. An exciting addition in the v0.7 release is fair sharing, designed to support large ML platforms serving multiple teams. Kueue allows platforms to model their teams and achieve a high utilization of resources, while sharing cost and providing equitative access to unused resources. Teams can always reclaim their guaranteed quotas via preemption. The Kueue v0.7 and the Kubernetes v1.31 releases also include performance optimizations to achieve high throughput. In this talk, you will learn about the challenges faced during design and implementation of fair sharing and preemption, about this system running in production, and the plans to support complex hierarchies.
Speakers
avatar for Aldo Culquicondor

Aldo Culquicondor

Sr. Software Engineer, Google
Aldo is a Senior Software Engineer at Google. He works on Kubernetes and Google Kubernetes Engine, where he contributes to kube-scheduler, the Job API and other features to support batch, AI/ML and HPC workloads. He is currently a TL at SIG Scheduling and an active member of WG Batch... Read More →
avatar for Rajat Phull

Rajat Phull

Engineering Manager, Apple
Rajat Phull is an Engineering Manager at Apple. He works in Machine Learning Platform team with a focus on GPU resource management, and ML training orchestration at scale using Kubernetes.
Tuesday November 12, 2024 10:40am - 11:05am MST
Salt Palace | Level 1 | Grand Ballroom A
Log in to leave feedback.

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link