TY - JOUR
T1 - Stochastic Gradient Descent for Nonconvex Learning without Bounded Gradient Assumptions
AU - Lei, Yunwen
AU - Hu, Ting
AU - Li, Guiying
AU - Tang, Ke
N1 - Funding Information:
This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFB1003102, in part by the National Natural Science Foundation of China under Grant 11571078, Grant 11671307, Grant 61672478, and Grant 61806091, in part by the Program for University Key Laboratory of Guangdong Province under Grant 2017KSYS008, in part by the Program for Guangdong Introducing Innovative and Entrepreneurial Teams under Grant 2017ZT07X386, and in part by the Alexander von Humboldt Foundation. (Corresponding author: Guiying Li.)
Publisher Copyright:
© 2019 IEEE.
PY - 2020/10
Y1 - 2020/10
N2 - Stochastic gradient descent (SGD) is a popular and efficient method with wide applications in training deep neural nets and other nonconvex models. While the behavior of SGD is well understood in the convex learning setting, the existing theoretical results for SGD applied to nonconvex objective functions are far from mature. For example, existing results require to impose a nontrivial assumption on the uniform boundedness of gradients for all iterates encountered in the learning process, which is hard to verify in practical implementations. In this article, we establish a rigorous theoretical foundation for SGD in nonconvex learning by showing that this boundedness assumption can be removed without affecting convergence rates, and relaxing the standard smoothness assumption to Hölder continuity of gradients. In particular, we establish sufficient conditions for almost sure convergence as well as optimal convergence rates for SGD applied to both general nonconvex and gradient-dominated objective functions. A linear convergence is further derived in the case with zero variances.
AB - Stochastic gradient descent (SGD) is a popular and efficient method with wide applications in training deep neural nets and other nonconvex models. While the behavior of SGD is well understood in the convex learning setting, the existing theoretical results for SGD applied to nonconvex objective functions are far from mature. For example, existing results require to impose a nontrivial assumption on the uniform boundedness of gradients for all iterates encountered in the learning process, which is hard to verify in practical implementations. In this article, we establish a rigorous theoretical foundation for SGD in nonconvex learning by showing that this boundedness assumption can be removed without affecting convergence rates, and relaxing the standard smoothness assumption to Hölder continuity of gradients. In particular, we establish sufficient conditions for almost sure convergence as well as optimal convergence rates for SGD applied to both general nonconvex and gradient-dominated objective functions. A linear convergence is further derived in the case with zero variances.
KW - Learning theory
KW - nonconvex optimization
KW - Polyak–Łojasiewicz condition
KW - stochastic gradient descent (SGD)
UR - http://www.scopus.com/inward/record.url?scp=85092680343&partnerID=8YFLogxK
U2 - 10.1109/TNNLS.2019.2952219
DO - 10.1109/TNNLS.2019.2952219
M3 - Journal article
C2 - 31831449
AN - SCOPUS:85092680343
SN - 2162-237X
VL - 31
SP - 4394
EP - 4400
JO - IEEE Transactions on Neural Networks and Learning Systems
JF - IEEE Transactions on Neural Networks and Learning Systems
IS - 10
ER -