Parallelism in ML Training
Modern ML training requires splitting work across clusters of GPUs to process data faster and fit model state into GPU memory. In ML systems, “parallelism” can mean several different things. Data and model parallelism distribute work across devices. Getting good performance also depends on intra-device parallelism: overlapping compute and communication operators on each individual GPU. This post gives an overview of both kinds of parallelism, then describes my work on abstractions for tuning inter- and intra-device parallelism strategies. This work is currently under submission to NSDI.