Name: Breaking the 1.5MB Barrier: Running Large Metaflow Flows with Argo for AI/ML Workloads - Saurabh Garg, Outerbounds
Start: 2024-11-12T11:15:00-0700
End: 2024-11-12T11:40:00-0700

Tuesday November 12, 2024 11:15am - 11:40am MST

Salt Palace | Level 2 | 254 B

Managing large-scale batch workflows efficiently is critical for AI/ML workloads. Data preparation for training or fine tuning models can involve a large number of steps. These make for excellent Argo workflows. But Argo faces the etcd limitation of the 1.5MB object size. This limitation restricts the ability of Argo to run truly large-scale workflows. This talk will delve into the intricacies of this limitation and its impact on AI/ML workflows. We will illustrate with examples how this has been a non-deterministic and frustrating bottleneck for users. To address this challenge, Argo introduced a feature that circumvents the etcd object size restriction. By offloading the bulk of the workflow status to an RDBMS and only storing the reference in etcd, Argo maintains its scaling capabilities still adhering to Kubernetes' limitations. This talk will provide a comprehensive guide on configuring and utilizing the Argo offloading feature in AWS using Aurora Postgres RDS and EKS.

Speakers

Saurabh Garg

Senior Software Engineer, Outerbounds, Inc.

Argocon Breaking the 1.5MB Barrier pdf

Tuesday November 12, 2024 11:15am - 11:40am MST
Salt Palace | Level 2 | 254 B

ArgoCon, Data Processing

Content Experience Level Advanced
Event + Breaks ArgoCon

Feedback form is now closed.

CNCF-hosted Co-located Events North America 2024

Saurabh Garg

Attendees (100)

CNCF-hosted Co-located Events North America 2024

Saurabh Garg

Attendees (100)

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!