Loading…
Tuesday November 12, 2024 11:15am - 11:40am MST
Managing large-scale batch workflows efficiently is critical for AI/ML workloads. Data preparation for training or fine tuning models can involve a large number of steps. These make for excellent Argo workflows. But Argo faces the etcd limitation of the 1.5MB object size. This limitation restricts the ability of Argo to run truly large-scale workflows. This talk will delve into the intricacies of this limitation and its impact on AI/ML workflows. We will illustrate with examples how this has been a non-deterministic and frustrating bottleneck for users. To address this challenge, Argo introduced a feature that circumvents the etcd object size restriction. By offloading the bulk of the workflow status to an RDBMS and only storing the reference in etcd, Argo maintains its scaling capabilities still adhering to Kubernetes' limitations. This talk will provide a comprehensive guide on configuring and utilizing the Argo offloading feature in AWS using Aurora Postgres RDS and EKS.
Speakers
avatar for Saurabh Garg

Saurabh Garg

Senior Software Engineer, Outerbounds, Inc.
Tuesday November 12, 2024 11:15am - 11:40am MST
Salt Palace | Level 2 | 254 B
  ArgoCon, Data Processing

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link