Name: Multitenancy and Fairness at Scale with Kueue: A Case Study - Aldo Culquicondor, Google & Rajat Phull, Apple
Start: 2024-11-12T10:40:00-0700
End: 2024-11-12T11:05:00-0700

Tuesday November 12, 2024 10:40am - 11:05am MST

Salt Palace | Level 1 | Grand Ballroom A

Developed by the Kubernetes community in collaboration with the ecosystem, Kueue augments k8s and ClusterAutoscaler to provide an E2E batch system. Kueue implements job queueing, deciding when jobs should wait and when they should start or be preempted, based on quotas and a hierarchy for sharing resources among teams. An exciting addition in the v0.7 release is fair sharing, designed to support large ML platforms serving multiple teams. Kueue allows platforms to model their teams and achieve a high utilization of resources, while sharing cost and providing equitative access to unused resources. Teams can always reclaim their guaranteed quotas via preemption. The Kueue v0.7 and the Kubernetes v1.31 releases also include performance optimizations to achieve high throughput. In this talk, you will learn about the challenges faced during design and implementation of fair sharing and preemption, about this system running in production, and the plans to support complex hierarchies.

Speakers

Aldo Culquicondor

Sr. Software Engineer, Google

Aldo is a Senior Software Engineer at Google. He works on Kubernetes and Google Kubernetes Engine, where he contributes to kube-scheduler, the Job API and other features to support batch, AI/ML and HPC workloads. He is currently a TL at SIG Scheduling and an active member of WG Batch... Read More →

Rajat Phull

Engineering Manager, Apple

Rajat Phull is an Engineering Manager at Apple. He works in Machine Learning Platform team with a focus on GPU resource management, and ML training orchestration at scale using Kubernetes.

Kubecon AI Day Kueue Fair Sharing.pptx pdf

Tuesday November 12, 2024 10:40am - 11:05am MST
Salt Palace | Level 1 | Grand Ballroom A

Cloud Native + Kubernetes AI Day, Best practices for ML Infrastructure

Content Experience Level Intermediate
Event + Breaks Cloud Native + Kubernetes AI Day

Feedback form is now closed.

Attendees (161)

s
s
N
s
h
R
View All →

CNCF-hosted Co-located Events North America 2024

Aldo Culquicondor

Rajat Phull

Attendees (161)

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!