Data is the lifeblood of machine learning. Yet, our system infrastructure for managing and preprocessing training data in ML jobs lags behind the vast advancements in hardware accelerators, software frameworks, and algorithms that optimize model training computations. The input data pipeline in an ML job is responsible for extracting data from storage, transforming data on-the-fly, and loading data to a training node (typically a GPU or TPU). As hardware accelerators continues to provide more FLOPS, feeding data at a sufficient rate to saturate accelerators is increasingly challenging. The high cost of accelerators compared to their CPU hosts makes it particularly important to ensure that they operate at high utilization. Hence, the input pipeline is critical to the end-to-end throughput and cost of ML jobs. In this talk, we will discuss the characteristics of real ML input pipelines from production workloads which have led to the trend of disaggregating input data processing from model training. I will present recent open-source systems such as tf.data service and Cachew, which leverage a disaggregated system architecture to scale-out and optimize data processing within and across jobs. These systems alleviate input bottlenecks and dramatically improve the training time and cost of ML jobs.
Ana Klimovic is an Assistant Professor in the Systems Group of the Computer Science Department at ETH Zurich. Her research interests span operating systems, computer architecture, and their intersection with machine learning. Ana's work focuses on computer system design for large-scale applications such as cloud computing services, data analytics, and machine learning. Before joining ETH in August 2020, Ana was a Research Scientist at Google Brain and completed her Ph.D. in Electrical Engineering at Stanford University.