This event is hosted by SF Big Analytics Group. https://www.meetup.com/SF-Big-Analytics
Recent years have witnessed an exponential growth of the model scale in recommendation/Ads/search—from Google’s 2016 model with 1 billion parameters to the latest Facebook’s model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes people believe the era of 100 trillion parameters is around the corner. To prepare the exponential growth of the model size, an efficient distributed training system is in urgent need. However, the training of such huge models is challenging even within industrial scale data centers.
In this talk, I will introduce Persia -- an open training system developed by my team -- to resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Persia admits nearly linear speedup properties while scaling the number of workers and the model size. Beside the capability of training 100 trillion parameters, it also shows a clear advantage in efficiency over other open sourced engines.