Name: Boosting Training and Inference Performance via Topology-Aware Scheduling of Heterogeneous Resources - He Cao, ByteDance
Start: 2024-11-12T14:05:00-0700
End: 2024-11-12T14:30:00-0700

Tuesday November 12, 2024 2:05pm - 2:30pm MST

Salt Palace | Level 1 | Grand Ballroom A

As LLMs rapidly evolve, K8s’ topology management can not meet the performance demands in several aspects: 1. For new-generation high-density processors, NUMA affinity is insufficient to ensure inference performance. 2. The performance bottleneck has shifted from computation to networking. However, K8s does not consider the topology of heterogeneous resources like GPU and RDMA.

In this talk, He will introduce how ByteDance significantly improves LLM workload performance by enhancing topology-aware scheduling: 1. For nodes with high-density processors, achieve die-level affinity and implement anti-affinity between memory bandwidth-intensive pods. 2. For pods within a training job, achieve inter-RDMA affinity at the ToR level to avoid switch congestion. 3. For inference workloads, achieve GPU-RDMA affinity at PCIe switch level to enable GPUDirect RDMA for accelerated communication. 4. How we achieve job-level topology affinity based on K8s scheduler which operates at the pod level.

Speakers

He Cao

Senior Software Engineer, ByteDance

He Cao is a senior software engineer on the Cloud Native team at ByteDance, a maintainer of Katalyst and KubeZoo, and a member of Istio. He has 5+ years of experience in the cloud native area. Since joining ByteDance, he has designed and implemented several critical systems for VKE... Read More →

Tuesday November 12, 2024 2:05pm - 2:30pm MST
Salt Palace | Level 1 | Grand Ballroom A

Cloud Native + Kubernetes AI Day, Hardware acceleration and device management

Content Experience Level Any
Event + Breaks Cloud Native + Kubernetes AI Day

Feedback form is now closed.

CNCF-hosted Co-located Events North America 2024

He Cao

Attendees (122)

CNCF-hosted Co-located Events North America 2024

He Cao

Attendees (122)

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!