TY - JOUR
T1 - Generalization Performance of Multi-pass Stochastic Gradient Descent with Convex Loss Functions
AU - Lei, Yunwen
AU - Hu, Ting
AU - Tang, Ke
N1 - Funding Information:
We thank the anonymous reviewers and the editor for their constructive comments. This work is supported partially by the National Natural Science Foundation of China (Grant Nos. 61806091, 12071356), and MOE University Scientific-Technological Innovation Plan Program. Yunwen Lei also acknowledges support by the Alexander von Humboldt Foundation. Parts of the results in this paper were presented at Advances in Neural Information Processing Systems 31 (2018), 1526–1536. The corresponding author is Ting Hu.
Publisher Copyright:
© 2021 Yunwen Lei, Ting Hu and Ke Tang.
PY - 2021/1
Y1 - 2021/1
N2 - Stochastic gradient descent (SGD) has become the method of choice to tackle large-scale datasets due to its low computational cost and good practical performance. Learning rate analysis, either capacity-independent or capacity-dependent, provides a unifying viewpoint to study the computational and statistical properties of SGD, as well as the implicit regularization by tuning the number of passes. Existing capacity-independent learning rates require a nontrivial bounded subgradient assumption and a smoothness assumption to be optimal. Furthermore, existing capacity-dependent learning rates are only established for the specific least squares loss with a special structure. In this paper, we provide both optimal capacity-independent and capacity-dependent learning rates for SGD with general convex loss functions. Our results require neither bounded subgradient assumptions nor smoothness assumptions, and are stated with high probability. We achieve this improvement by a refined estimate on the norm of SGD iterates based on a careful martingale analysis and concentration inequalities on empirical processes.
AB - Stochastic gradient descent (SGD) has become the method of choice to tackle large-scale datasets due to its low computational cost and good practical performance. Learning rate analysis, either capacity-independent or capacity-dependent, provides a unifying viewpoint to study the computational and statistical properties of SGD, as well as the implicit regularization by tuning the number of passes. Existing capacity-independent learning rates require a nontrivial bounded subgradient assumption and a smoothness assumption to be optimal. Furthermore, existing capacity-dependent learning rates are only established for the specific least squares loss with a special structure. In this paper, we provide both optimal capacity-independent and capacity-dependent learning rates for SGD with general convex loss functions. Our results require neither bounded subgradient assumptions nor smoothness assumptions, and are stated with high probability. We achieve this improvement by a refined estimate on the norm of SGD iterates based on a careful martingale analysis and concentration inequalities on empirical processes.
KW - Generalization bound
KW - Learning theory
KW - Stochastic gradient descent
UR - http://www.scopus.com/inward/record.url?scp=85105877079&partnerID=8YFLogxK
M3 - Journal article
AN - SCOPUS:85105877079
SN - 1532-4435
VL - 22
JO - Journal of Machine Learning Research
JF - Journal of Machine Learning Research
ER -