[PDF] Control Batch Size and Learning Rate to Generalize Well - NeurIPS PDF dc6a70712a252123c40d2adba6a11d84-Paper.pdf

When employing SGD to train deep neural networks, we should control the batch batch size S Considering deep learning is usually utilized to process large- scale data, this batch size has an optimal value when the learning rate is fixed

Abstract Increasing the batch size is a popular way to speed up neural network training, optimal learning rates and large batch training, making it a useful tool to generate testable More commonly (especially in deep learning), exponential

[PDF] Control Batch Size and Learning Rate to Generalize Well - NeurIPS

[PDF] ONLINE BATCH SELECTION FOR FASTER TRAINING OF NEURAL

Deep neural networks (DNNs) are currently the best-performing method for many in a batch Instead, it is common in SGD to fix the batch size and iteratively

[PDF] Why Does Large Batch Training Result in Poor - HUSCAP

ponents of machine learning because a better solution generally leads to a more accurate prediction for ally optimal solution (Lundy Mees, 1986) In actual use, the Training with a large batch size in a neural network tends to result in

[PDF] CROSSBOW: Scaling Deep Learning with Small Batch Sizes on

tem for training deep learning models that enables users to freely choose their preferred batch size—however small—while scaling to multiple GPUs

[PDF] Train Deep Neural Networks with Small Batch Sizes - IJCAI

To the best of our knowledge, TRAlgo is the first to use such tech- nique to address noisy gradients, under the small batch size constraint The remainder of this

[PDF] On the Generalization Benefit of Noise in Stochastic Gradient Descent

batch gradient descent in deep neural networks However recent papers have hand, when the batch size is small, we expect the optimal learning rate to be

[PDF] optimal solution

[PDF] optimal solution example

[PDF] optimal solution in lpp

[PDF] optimal solution in transportation problem

[PDF] optimal solution of linear programming problem

[PDF] optimal work week hours

[PDF] optimise b2 workbook answers pdf

[PDF] optimise workbook b2 answers

[PDF] optimistic words

[PDF] optimum basic feasible solution in transportation problem

[PDF] optimum camera

[PDF] optimum channel guide ct

[PDF] optimum dental insurance

[PDF] optimum google

[PDF] optimum portal

ControlBatchSizeandLear ningRateto Generalize

Well:TheoreticalandEmpirical Evidence

FengxiangHeTongliangLiu DachengT ao

UBTECHSydney AICentre,SchoolofComputer Science,Faculty ofEngineering TheUniv ersityofSydney,Darlington,NSW2008,Australia {fengxiang.he,t ongliang.liu,dacheng.tao}@sydney.edu.au

Abstract

Deepneuralnetw orkshav ereceiveddramatic successbasedontheoptimization methodofstochastic gradientdescent (SGD).Howe ver, itisstill notclearho w totuneh yper-parameters,especially batchsizeandlearningrate,toensuregood generalization.Thispaper reportsboththeoretical andempiricale videnceofa trainingstrategy thatweshouldcontroltherati oofbatch sizetolearning ratenot too largetoachieve agoodgeneralization ability.Specifically,weprovea PA C-Bayes generalizationbound forneuralnetw orkstrainedby SGD,whichhas apositive correlationwith theratioof batchsizeto learningrate.This correlationbuilds the theoreticalfoundationof thetrainingstrate gy.Furthermore, weconducta large- scaleexperiment toverifythecorrelationand trainingstrategy .Wetrained1,600 modelsbasedon architecturesResNet-110,and VGG-19with datasetsCIFAR -10 andCIF AR-100whilestrictlycontrolunrelatedv ariables.Accuracieson thetest setsarecollected forthe evaluation. Spearman'srank-order correlationcoeffi cients andthecorresponding pvalueson164groupsofthecollected datademonstrate thatthecorrelation isstatistically significant,whichfully supportsthetraining strategy.

1Introduction

Therecentdecade sawdramatic successofdeep neuralnetworks[9]basedon theoptimization methodofs tochasticgradientdescent (SGD)[2,32].Itis aninterestingand importantproblem that howtotunethe hyper-parameters ofSGDto make neuralnetworksgeneralizewell.Someworks havebeenaddressingthestrategiesof tuninghyper -parameters[5,10,14,15]andthe generalization abilityofSGD [4,11,19,26,27].Howe ver,therestilllackssolidevidenceforthetrainingstrategies regardingthehyper-parametersforneural networks. Inthispaper ,wepresent boththeoreticalandempiricalevidence foratraining strategy fordeep neuralnetworks: Whenemploying SGDtotraindeepneuralnetw orks,weshould controlthebatch sizenottoo largeand learningratenot toosmall,inordertomak ethenetw orks generalizewell. Thisstrate gygivesaguidetotune thehyper-parametersthathelpsneuralnetworks achieve goodtest performancewhenthe trainingerrorhas beensmall.It isderiv edfromthe following property: Thegeneralizationability ofdeep neuralnetworks hasane gativ ecorrelationwith theratioof batchsize tolearningrate. Asre gardsthetheoreticalevidence,weprov eano velP AC-Bayes[24,25]upperbound forthe generalizationerrorof deepneural networkstrained bySGD.The proposedgeneralizationbound

33rdConferenceon NeuralInformation ProcessingSystems(NeurIPS 2019),Vancouv er,Canada.

whereall(X i ,Y i )constitutethetraining sampleT.

Equivalently,theempiricalrisk

Ristheerror ofthe algorithmontraining data,whilethe expected riskRisthee xpectationof theerrorontestdataor unseendata.Therefore, thedifference between themisan informative indexto expressthegeneralizationabilityofthealgorithm,which iscalled generalizationerror .Theupperboundofthe generalizationerror(usually calledgeneralizationbound) expresseshowlar gethegeneralizationerrorispossible tobe.Therefore,generalizationboundis also animportantinde xtosho wthegeneralizationabilityofan algorithm. Stochasticgradient descent.Tooptimizetheexpected risk(eq.1), anaturaltool isthegradient descent(GD).Specifically ,thegradient ofeq.(1)intermsof theparameter ✓andthecorresponding updateequationare definedasfollo ws, g(✓(t)),r ✓(t)

R(✓(t))= r

✓(t) E (X,Y) l(F ✓(t) (X),Y),(5) ✓(t+1)= ✓(t)⌘g(✓(t)),(6) where✓(t)istheparameter attheinteration tand⌘>0isthe learningrate. Stochasticgradientdescent (SGD)estimatesthe gradientfrommini batchesof thetrainingsample settoestimate thegradientg(✓).LetSbetheindices ofamini batch,inwhich allindicesare independentlyand identically(i.i.d.)dra wnfrom{1,2,...,N},whereNisthetraining samplesize. Thensimilarto thegradient,the iterationofSGD onthemini-batch Sisdefinedas follows, ˆg S (✓(t))=r ✓(t)

R(✓(t))=

1 |S| X n2S r ✓(t) l(F ✓(t) (X n ),Y n ),(7) ✓(t+1)= ✓(t)⌘ˆg(✓(t)),(8) where

R(✓)=

1 |S| P n2S l(F (X n ),Y n )istheempirical riskonthe minibatchand |S|isthe cardinalityofthe setS.F orbrevity,were writel(F (X n ),Y n )=l n (✓)intherest ofthispaper . Also,supposethat instepi,thedistrib utionofparameter isQ i ,theinitial distributionis Q 0 ,andthe convergentdistributionisQ.ThenSGD isusedto findQfromQ 0 throughaseries ofQ i

3Theoretical Evidence

Inthissection, weexplore anddev elopthetheoretical foundationsforthetrainingstrate gy.The main ingredientis aPA C-Bayesgeneralizationbound ofdeepneuralnetworksbasedontheoptimization methodSGD.The generalizationbound hasapositi vecorrelationwiththe ratioofbatch sizeto learningrate.This correlationsuggests thepresentedtraining strategy.

3.1AGeneralization Boundfor SGD

Apparently,bothl

n (✓)and R(✓)areun-biasedestimations ofthe expectedrisk R(✓),whiler l n andˆg S (✓)arebothun-biased estimationsof thegradientg(✓)=r

R(✓):

E[l n (✓)]=E h

R(✓)

i =R(✓),(9) E[r l n (✓)]= E[ˆg S (✓)]=g(✓)=r

R(✓),(10)

wherethee xpectationsarein termsofthecorrespondinge xamples(X,Y). Anassumption(see, e.g.,[ 7,23])forthe estimationsisthat allthe gradients{r l n (✓)}calculated fromindividual datapointsarei.i.d.drawn fromaGaussian distributioncentred atg(✓)=r

R(✓):

r l n (✓)⇠N(g(✓),C).(11) whereCistheco variancematrix andisaconstantmatrixforall✓.Asco variancematrices are(semi) positive-definite,forbrevity,we supposethatCcanbef actorizedasC=BB .Thisassumption can bejustifiedby thecentrallimit theoremwhenthe samplesizeNislarge enoughcomparedwiththe batchsize S.Consideringdeep learningis usuallyutilizedto processlarge-scale data,thisassumption approximatelyholds inthereal-life cases. 3 Therefore,thestochastic gradientisalso drawnfrom aGaussiandistrib utioncentredat g(✓): ˆg S 1 |S| X n2S r l n (✓)⇠N g(✓), 1 |S| C .(12)

SGDusesthe stochasticgradientˆg

S (✓)toiterativ elyupdatetheparameter✓inorderto minimizethe functionR(✓): S (✓(t))=⌘g(✓)+ p |S|

BW,W⇠N(0,I).(13)

Inthispaper ,weonly considerthecasethatthebatch size|S|andlearningrate ⌘areconstant.Eq. (13)expresses astochasticprocesswhichiswell-kno wnasOrnstein-Uhlenbeck process[33]. Furthermore,weassume thattheloss functioninthe localregion aroundthe minimumiscon vex and

2-orderdif ferentiable:

R(✓)=

1 2 A ,(14) whereAistheHessian matrixaroundthe minimumand isa(semi) positive-definite matrix.This assumptionhasbeen primarilydemonstratedby empiricalworks (see[18,p.1, Figures1(a)and 1(b) andp. 6,Figures4(a) and4(b)]).W ithoutlossof generality,we assumethatthe globalminimum of theobjectiv efunctionR(✓)is0andachiev esat✓=0.Generalcases canbe obtainedbytranslation operations,which wouldnot changethegeometryofobjectiv efunctionand thecorresponding generalizationability. FromtheresultsofOrnstein-Uhlenbeckprocess, eq.(13)has ananalytic stationarydistribution: q(✓)=Mexp 1 2 1 ,(15) whereMisthe normalizer[8]. EstimatingSGDas acontinuous-timestochastic processdatesback toworks by[ 17,21].For a detailedjustification,please referto arecentw ork[see23, pp.6-8,Section 3.2]. Wethenobtaina generalizationboundfor SGDasfollo ws. Theorem1.Foranypositivereal 2(0,1),withpr obabilityatleast 1overatrainingsample setofsize N,wehave thefollowinginequality forthe distribution Qoftheoutput hypothesisfunction ofSGD:

R(Q)

R(Q)+ s |S| tr(CA 1 )2log(det(⌃))2d+4l og 1 +4l ogN+8 8N4 ,(16) and ⌃A+A⌃= |S|

C,(17)

whereAistheHessian matrixof thelossfunction aroundthe localminimum,Bisfrom thecovariance matrixoft hegradients calculatedbysinglesamplepoints, anddisthedimens ionof theparameter✓ (networksize). Theprooffor thisgeneralizationbound hastwo parts:(1)utilize resultsfrom stochasticdifferential equation(SDE)to findthe stationarysolutionof thelatentOrnstein-Uhlenbeck process(eq.13) whiche xpressestheiterativeupdate ofSGD;and (2)adaptthePAC-Bayesframework toobtain the generalizationbound basedonthe stationarydistribution. Adetailedproof isomittedhere andis giveninAppendixB.1.

3.2ASpecial Caseofthe GeneralizationBound

Inthissubsection, westudya specialcase withtwo moreassumptionsfor furtherunderstatingthe influenceofthe gradientfluctuationon ourproposedgeneralization bound. 4 Assumption1can betranslatedas thatboththe localgeometryaround theglobalminima and thestationarydistrib utionarehomogeneous toeverydimensionof theparameter space.Similar assumptionsarealso usedbya recentwork [14].Thisassumption indicatesthatthe product⌃Aof matricesAand⌃isalsosymmetric. Basedon Assumptions1,we canfurtherget thefollowing theorem. Theorem2.WhenAssumptions1 holds,underall theconditionsof Theorem1, thestationary distributionofSGDhasthe followinggener alizationbound,

R(Q)

R(Q) v u u t 2|S| tr(CA 1 )+dlog 2|S| log(det(CA 1 ))d+2l og 1 +2l ogN+4 4N2 (18) Adetailedproof isomitted hereandis given inAppendixB.2 inthesupplementary materials. Intuitively,ourgeneralizationboundlinksthegeneralizationabilityof thedeepneural networks trainedbySGD withthreef actors: Localgeometryar oundminimum.Thedeterminant oftheHessianmatrix Aexpressesthelocal geometryofthe objective functionaroundthe localminimum.Specifically,themagnitudeofdet(A) expressesthesharpnessofthe localminima. Manyw orkssuggestthat sharplocalminima relateto poorgeneralizationability [15,10]. Gradientfluctuation. Thecov ariancematrixC(orequiv alentlythematrixB)expresses thefluctua- tionofthe estimationtothe gradientfrom individualdata pointswhichis thesourceof gradientnoise. Arecentintuition fortheadv antageofSGD isthatit introducesnoiseinto thegradient,sothatit can jumpoutof badlocalminima. Hyper-parameters.Batchsize|S|andlearningrate ⌘,adjustt hefluctuationof gradient.Specifically, underthefollo wingassumption,our generalizationboundhasapositi vecorrelation withthe ratioof batchsizeto learningrate.

Assumption2.Thenetworksize islarg eenough:

d> tr(CA 1 2|S| ,(19) wheredisthenumber ofthe parameters, Cexpressesthemagnitudeofindividualgr adientnoise, A istheHessian matrixaround theglobalminima, ⌘istheleaning rate, and|S|isthebatc hsize . Thisassumptioncan bejustifiedthat thenetwork sizesofneural networks areusuallye xtremelylarge. Thispropertyis alsocalledoverparametrization[6,3,1].We canobtainthefollowingcorollary by combiningTheorem2 andAssumption2. Corollary1.Whenallconditions ofTheorem 2andAssumption 2hold,the generalizationboundof thenetworkhas apositivecorr elationwiththe ratioof batchsize tolearningrate. Theproofis omittedfrom themainte xtandgi venin AppendixB.3 Itrev ealsthenegativecorrelation betweenthegeneralization abilityandtheratio.Thisproperty furtherderiv esthetrainingstrategythatweshouldcontrol therationot toolar getoachievea good generalizationwhentraining deepneuralnetw orksusingSGD.

4EmpiricalEvidence

Toevaluatethe trainingstrategyfromtheempiricalaspect,weconduct extensiv esystematic experi- mentstoin vestigate theinfluenceofthebatchsizeandlearningrateonthegenerali zationabilityof deepneural networkstrained bySGD.Todeliv errigorousresults, ourexperiments strictlycontrol allunrelatedv ariables.Theempirical resultsshowthatthereis astatisticallysignificant neg ative correlationbetween thegeneralizationa bilityofthe networksand theratioofthebatch sizeto the learningrate,which buildsa solidempiricalfoundation forthetrainingstrategy. 5 (a)Test accuracy-batchsize(b)Test accuracy-learningrate Figure2:Curv esof testaccuracytobatchsize andlearningrate. Thefourro wsarerespectivelyfor (1)ResNet-110 trainedonCIF AR-10,(2)ResNet-110 trainedonCIF AR-100,(3)VGG-19trainedon CIFAR-10,and(4)VGG-19 trainedonCIF AR-10.Eachcurv eisbasedon20 networks.

4.1ImplementationDetails

Toguaranteethattheempiric alresultsgenerally applytoan ycase,our experimentsare conductedbasedon twopopular architectures,ResNet-110[ 12,13]andV GG-19[28],on twostandarddatasets,CIF AR-10andCIF AR-100[16],whichcan bedo wnloadedfrom https://www.cs.toronto.edu/kriz/cifar.html. Theseparationsofthetrainingsetsandthetest setsare thesame astheof ficialversion.

Wetrained1,600modelswith 20batch sizes,S

BS ={16,32,48,64,80,96,112,128,144,160,176,

192,208,224,240,256,272,288,304,320},and20 learningrates,S

LR ={0.01,0.02,0.03,0.04, ingtechniques ofSGD,suchasmomentum,are disabled.Also,both batchsizeand learningrate areconstantin ourexperiments. Every modelwith aspecificpairofbatchsize andlearningrateis trainedfor200 epochs.Thetest accuraciesofall 200epochsare collectedforanalysis. We selectthe highestaccurac yonthetestsetto expressthe generalizationabilityof eachmodel,since thetraining errorisalmost thesameacross allmodels (theyare allnearly0). Thecollecteddata isthen utilizedtoin vestigate threecorrelations:(1) thecorrelationbetween the generalizationabilityof networksand thebatchsize, (2)thecorrelationbetweenthe generalization abilityofnetw orksandthe learningrate,and(3)the correlationbetweenthe generalizationability ofnetworks andtheratioofbatch sizetolearning rate,wherethe firsttwo arepreparationsfor thefinalone. Specifically,we calculatetheSpearman' srank-ordercorrelationcoefficients (SCCs) andthecorresponding pvalueof164groupsof thecollecteddata toinv estigatethe statistically significanceofthe correlations.Almostall resultsdemonstratethe correlationsarestatisticall y significant(p<0.005) 1 .Thepvaluesofthecorrelationbetween thetestaccurac yandthe ratioare alllower than10 180
(seeTable 3). Thearchitecturesof ourmodelsare similartoa popularimplementationof ResNet-110and VGG-19 2 Additionally,ourexperiments areconductedon acomputingclusterwithGPUsof NVIDIA Tesla

V10016GBand CPUsofIntel

Xeon

Gold6140CPU @2.30GHz.

4.2EmpiricalResults onthe Correlation

Correlationbetweengeneralizationabilityand batchsize.Whenthelearning rateisfix edasan elementofS LR ,we trainResNet-110and VGG-19on CIFAR10and CIFAR100with 20batchsizes ofS BS .Theplots oftest accuracyto batchsizeare illustratedinFigure 2a.Welist1/4ofallplots duetospace limitation.Therest oftheplots areinthe supplementarymaterials. Wethen calculatethe SCCsand thepvaluesasTable 1,wherebold pvaluesrefertothestatistically significantobservations, whileunderlinedones refertothose notsignificant(as wellasT able2). Theresultsclearly showthat thereisa statisticallysignificantne gativ ecorrelationbetween generalizationabilityand batchsize. 1 Thedefinitionof "statisticallysignificant" hasvarious versions,such asp<0.05andp<0.01.This paper usesamore rigorousone( p<0.005). 2 SeeWei Yang,https://github.com/bearpaw/pytorch-classification, 2017. 6 Table1:SCCand pvaluesofbatchsize totestaccurac yfordif ferentlearningrate (LR). LRResNet-110on CIFAR-10 ResNet-110onCIF AR-100VGG-19onCIFAR-10 VGG-19on CIFAR-100

SCCpSCCpSCCpSCCp

0.010.962.6⇥10

0.925.6⇥10

0.983.7⇥10

0.997.1⇥10

0.020.961.2⇥10

0.941.5⇥10

0.993.6⇥10

0.997.1⇥10

0.030.963.4⇥10

0.991.5⇥10

0.997.1⇥10

1.001.9⇥10

0.040.981.8⇥10

0.987.1⇥10

0.999.6⇥10

0.993.6⇥10

0.050.983.7⇥10

0.981.3⇥10

0.997.1⇥10

0.991.4⇥10

0.060.961.8⇥10

0.976.7⇥10

1.001.9⇥10

0.991.4⇥10

0.070.985.9⇥10

0.945.0⇥10

0.988.3⇥10

0.971.7⇥10

0.080.971.7⇥10

0.971.7⇥10

0.982.4⇥10

0.971.7⇥10

0.090.974.0⇥10

0.983.7⇥10

0.981.8⇥10

0.961.2⇥10

0.100.971.9⇥10

0.968.7⇥10

0.988.3⇥10

0.932.2⇥10

0.110.971.1⇥10

0.981.3⇥10

0.992.2⇥10

0.932.7⇥10

0.120.974.4⇥10

0.962.5⇥10

0.987.1⇥10

0.907.0⇥10

0.130.941.5⇥10

0.981.3⇥10

0.971.7⇥10

0.891.2⇥10

0.140.972.6⇥10

0.913.1⇥10

0.976.7⇥10

0.861.1⇥10

0.150.964.6⇥10

0.981.3⇥10

0.958.3⇥10

0.793.1⇥10

0.160.953.1⇥10

10quotesdbs_dbs8.pdfusesText_14

[PDF] [PDF] Control Batch Size and Learning Rate to Generalize Well - NeurIPS

[PDF] Which Algorithmic Choices Matter at Which Batch Sizes? - NIPS