In the last three years, supercomputers are becoming increasingly popular in hyper-scale companies. Amazon built a HPC cloud. Google released its first 100-petaFlop supercomputer (TPU Pod). Facebook made a submission on the Top500 supercomputer list. Why they like supercomputers? Because the computation of deep learning is very expensive. For example, even with 16 TPUs, BERT training takes more than 3 days. On the other hand, supercomputers can process 10^17 floating point operations per second. So why can’t we just use supercomputer and finish the training of deep neural networks in a very short time? The reason is that deep learning does not have enough parallelism to make full use of millions of processors in a typical modern supercomputer. There are two directions for parallelizing deep learning. The model parallelism is very limited. For data parallelism, current optimizers can not scale to thousands of processors because it is a sharp minimum problem. In this talk, I will introduce LARS (Layer-wise Adaptive Rate Scaling) and LAMB (Layer-wise Adaptive Moments optimizer for Batch training) optimizers, which can find more parallelism for deep learning. They not only can make the deep learning system scale well, but also can achieve a higher accuracy. Since 2017, all the Imagenet training speed world records were created by LARS. LARS was added to the MLperf, which is the industry benchmark for fast deep learning. Google used LAMB to reduce the BERT training time from 3 days to 76 minutes. Google used LAMB to achieve new state-of-the-art results on GLUE, RACE, and SQuAD benchmarks. The approaches in this talk were used by state-of-the-art distributed systems at Google, Intel, NVIDIA, Sony, Tencent, and so on.
Yang You is a PhD candidate at UC Berkeley Computer Science Division. His advisor is Prof. James Demmel. Yang You's research interests include Parallel/Distributed Algorithms, High Performance Computing (HPC), and Machine Learning. The focus of his current research is scaling up deep neural networks training on distributed systems or supercomputers. In 2017, his team broke the world record of ImageNet training speed, which was covered by the technology media like NSF, ScienceDaily, Science NewsLine, and i-programmer. His algorithms are being used in many hyper-scale companies like Google, Facebook, Tencent, Sony, and so on. His research was mentioned by Google's production release. As the first author, he won IPDPS'15 Best Paper Award (0.8%), ICPP'18 Best Paper Award (0.3%), ICDM'19 Best Paper Candidate, and SC'19 Best Student Paper Finalist. His ICPP'18 paper was the most cited paper among all the HPC conference papers (HPDC, ICS, ICPP, IPDPS, PPOPP, SC, etc.) published since 2018.
He is an ACM/IEEE George Michael HPC Fellow and a Siebel Scholar.