Computing Polynomials with Few Multiplications
16 Sept 2011 Key words and phrases: arithmetic circuits multiplications. 1 Introduction. Arithmetic complexity is a branch of theoretical computer ...
AdderNet: Do We Really Need Multiplications in Deep Learning?
In this paper we present adder networks (AdderNets) to trade these massive multiplications in deep neural networks
On the Arithmetic Complexity of Strassen-Like Matrix Multiplications
The Strassen algorithm for multiplying 2 × 2 matrices requires seven multiplications and 18 additions. The recursive use of this algorithm for matrices of
Profinite Braid Groups Galois Representations and Complex
1-adic power series for complex multiplications of Fermat type or equivalently
Galois Action on Division Points of Abelian Varieties with Real
Suppose that X has no complex multiplication (i.e. EndcX= Z). Then the image of p.
Lecture 11 – Fast matrix multiplication
16 Jun 2014 373.) 1.4 Boolean matrix multiplication. Strassen's algorithm uses not only additions and multiplications so as to multiply matrices but also ...
Witnesses for Boolean Matrix Multiplication and for Shortest Paths
We use O(n?) to denote the running time of some subcubic algorithm for Boolean matrix multiplication. All our algorithms can be derived from any such algorithm
A NONCOMMUTATIVE ALGORITHM FOR MULTIPLYING 3 x 3
The standard definition for the multiplication of two n x n matrices yields a noncommutative algorithm using n3 multiplications. Strassen [7] produced a.
Double-Base Chains for Scalar Multiplications on Elliptic Curves
Keywords: Elliptic curve cryptography Scalar multiplication
Fast Secure Matrix Multiplications over Ring-Based Homomorphic
HElib is a software library that implements the Brakerski-Gentry-Vaikuntanathan (BGV) homomorphic scheme in which secure matrix-vector multiplication is
HantingChen
1,2?,Yunhe Wang2?,ChunjingXu 2†,BoxinShi 3,4,ChaoXu 1,QiT ian2,ChangXu 5
1 KeyLabofMachinePerception(MOE),Dept. ofMachineIntelligence, PekingUniv ersity.2Noah"sArkLab,Huawei Technologies.3NELVT,Dept.ofCS,PekingUniv ersity.4PengChengLaboratory .
5SchoolofComputer Science,Faculty ofEngineering,The University ofSydney.
{yunhe.wang,xuchunjing,tian.qi1 }@huawei.comAbstract
Comparedwithcheapaddition operation,multiplication operationisofmuch highercomputationcomple xity.The widely-usedconvol utionsindeepneuralnetworksaree x- actlycross-corr elationtomeasurethesimilaritybetween inputfeature andconvolutionfilters, whichin volvesmas- sivemultiplicationsbetween floatvalues.In thispaper, we presentaddernetworks(AdderNets)to tradethese massive multiplicationsin deepneuralnetworks,especiallycon vo- lutionalneural networks(CNNs),formuchc heaperaddi- tionstor educecomputationcosts. InAdderNets,wetake the?1-normdistancebetween filtersand inputfeature as theoutputr esponse.The influenceofthisnewsimilarity measureontheoptimizationof neuralnetwork havebeen thoroughlyanalyzed.Toac hievea betterperformance,we developaspecialback-pr opagationappr oachfor Adder- Netsbyin vestigatingthefull-pr ecisiongradient.Wethen proposeanadaptivelearningr atestrate gyto enhancethe trainingprocedureof AdderNetsaccordingtothemag- nitudeofeac hneuron" sgradient.Asar esult,thepro- posedAdderNetscan achieve 74.9%Top-1 accuracy91.7%Top-5accuracyusingResNet-50 ontheImageNetdataset
withoutanymultiplication inconvolutional layer.The codesare publiclyavailableat:https://github.com/huawei- noah/AdderNet.1.Introduction
GiventheadventofGraphicsProcessing Units(GPUs),
deepconv olutionalneuralnetworks(CNNs)withbillions offloatingnumber multiplicationscouldrecei vespeed-ups andmake importantstridesinalar gevarietyofcomputer visiontasks,e.g.imageclassification [26,17],objectde-
tection[23],segmentation [19],andhuman facev erifica-
?Equalcontribution. †Correspondingauthor. tion[32].Howe ver,thehigh-powerconsumptionofthese
high-endGPUcards (e.g.250W+for GeForceR TX2080Ti)hasblockedmodern deeplearningsystems frombeing
deployedonmobiledevices, e.g.smartphone, camera,and watch.ExistingGPUcardsare farfrom svelteand cannot beeasilymounted onmobilede vices.Thought heGPU it- selfonlytak esupa smallpartofthecard,we needmany otherhardware forsupports,e.g.memorychips, powercir - cuitry,voltageregulators andothercontrollerchips.It is thereforenecessary tostudyefficientdeepneural networks thatcanrun withaffordable computationresourceson mo- biledevices. Addition,subtraction,multiplication anddivision arethe fourmostbasic operationsinmathematics. Itiswidely knownthatmultiplicationisslo werthanaddition, butmost ofthecomputations indeepneural networksare multiplica- tionsbetweenfloat-v aluedweightsand float-valuedactiva- tionsduringthe forwardinference. Therearethus manypa- personho wtotrade multiplicationsforadditions,tospeed updeeplearning. Theseminalw ork[5]proposedBina-
ryConnecttoforce thenetwork weightstobe binary(e.g.-1 or1),so thatmany multiply-accumulateoperationscan be replacedbysimple accumulations.Afterthat, Hubaraet al.[15]proposedBNNs, whichbinarizednot onlyweights
butalsoactivations inconv olutionalneuralnetworksatrun- time.Moreov er,Rastegarietal.[22]introducedscale fac-
torstoapproximate convolutions usingbinaryoperations andoutperform[15,22]bylar gemargins. Zhouetal.[39]
utilizedlow bit-widthgradienttoacceleratethetraining ofbinarized networks.Caietal.[4]proposedan half-
achievedmuchcloserperformancetofullprecision net- works.Thoughbinarizingfilters ofdeepneural networkssig-
nificantlyreducesthe computationcost,the originalrecog- nitionaccuracy oftencannotbepreserved.In addition, thetrainingprocedure ofbinarynetw orksisnot stableand usuallyrequestsa slowercon vergence speedwithasmall 1468(a)Visualization offeaturesinAdderNets(b)V isualizationoffeatures inCNNs
Figure1.V isualizationoffeatures inAdderNetsandCNNs.Featuresof CNNsindif ferentclassesare dividedby theirangles. Incontrast,
featuresofAdderNets tendtobe clusteredtow ardsdifferent classcenters,si nceAdderNetsuse the?1-normtodistinguish differentclasses.
Thevisualizationresults suggestthat?1-distancecanserv easa similaritymeasurethedistancebetween thefilterand theinputfeature in
deepneuralnetw orks learningrate.Con volutionsin classicalCNNsareactually cross-correlationtomeasure thesimilarity oftwo inputs. Researchersandde velopersare usedtotakingconvolution asadef aultoperation toextractfeaturesfromvisualdata, andintroducev ariousmethodsto acceleratetheconvolu- tion,ev enifthereisariskofsacrificingnetw orkcapability. Butthereis hardlynoattempt toreplacecon volutionwith anothermoreef ficientsimilaritymeasure thatisbetterto onlyinv olveadditions.Infact,additionsareofmuchlower computationalcomplexities thanmultiplications.Thus,we aremotiv atedtoinvestigatethefeasibilityof replacingmul- tiplicationsbyadditions inconv olutionalneuralnetw orks.Inthispaper ,wepropose addernetworksthatmaximize
theuseof additionwhileabandoning convolution opera- tions.Giv enaseriesofsmalltemplateasfilters" inthe neuralnetwork, ?1-distancecouldbe anefficient measure tosummarizeabsolute differencesbetween theinputsig- nalandthe templateassho wninFigure1.Sincesubtrac-
complementcode,?1-distancecouldbe ahardware-friendly measurethatonly hasadditions,and naturallybecomesan efficientalternativeof theconvolutiontoconstructneural networks.Animproved back-propagationscheme withreg- ularizedgradientsis designedtoens ure sufficientupdates ofthetemplates andabetter networkcon vergence. The proposedAdderNetsare deployedon several benchmarks, andexperimental resultsdemonstratethatAdderNetscan achievecomparablerecognitionaccuracytocon ventional CNNs.Thispaperis organized asfollows. Section
2investi-
gatesrelatedworkson networkcompression. Section 3pro-posesAdderNet workswhich replacethemultiplicationintheconv entionalconvolutionfilterswithaddition.Section
4 evaluatestheproposedAdderNetsonvarious benchmark datasetsandmodels andSection5concludesthispaper .
2.Relatedw orks
Toreducethecomputationalcomple xityofcon volu-
tionalneuralnetw orks,anumber ofworkshave beenpro- posedforeliminating uselesscalculations.Pruningbasedmethods aimstoremo veredundant
weightstocompress andacceleratethe originalnetwork.Dentonetal.[
6]decomposedweight matricesoffully-
connectedlayersinto simplecalculationsby exploitingsin- gularvalue decomposition(SVD).Hanetal.[9]proposed
discardingsubtleweights inpre-traineddeep networksto omittheiroriginal calculationswithoutaf fectingtheper - formance.Wang etal.[31]furthercon vertedcon volution
filtersintothe DCTfrequency domainandelim inatedmore floatingnumbermultiplications. Inaddition,Hu etal.[ 13] discardedredundantfilters withlessimpacts todirectlyre- ducethecomputations broughtbythese filters.Luoet al.[21]discardedredundant filtersaccordingto therecon-
structionerror. Huetal.[14]proposeddubbed RobustDy-
namicInferenceNetw orks(RDI-Nets),which allowsfor eachinputto adaptively chooseoneof themultipleoutput layerstooutput itsprediction.W angetal.[29]proposeda
E2-Trainingmethod,whichcantrai ndeepneural networks withov er80%energysavings. Insteadofdirectly reducingthecomputational complex- ityofa pre-trainedheavy neuralnetwork, lotofw orksfo- cusedondesigning novel blocksoroperations toreplace theconv entionalconvolutionfilters.Howardetal.[12]de-
signedMobileNet,which decomposethecon ventionalcon- 1469volutionfiltersintothepoint-wis eanddepth-wise convolu- tionfilterswith muchfewer FLOPs.Zhangetal.[
38]com-
binedgroupcon volutions[35]anda channelshuffle oper-
ationtob uildefficient neuralnetworkswithfe wercompu- tations.Wu etal.[34]presenteda parameter-free shift"
operationwithzero flopandzero parametertoreplace con- ventionalfiltersandlargely reducethecomputational and storagecostof CNNs.Wang etal.[30]dev elopedversatile
convolutionfilterstogeneratemoreusefulfeatures utilizing fewercalculationsandparameters.Xu etal.[16]proposed
perturbativeneuralnetworkstoreplacecon volutionand in- steadcomputesits responseasa weightedlinearcombina- tionofnon-l inearlyactiv atedadditivenoiseperturbed in- puts.Hanetal.[8]proposedGhostNet togeneratemore
featuresfromcheap operationsandachie vethe state-of-the- artperformanceon lightweightarchitectures. Besideseliminatingredundant weightsorfilters indeep convolutionalneuralnetworks,Hintonetal.[11]pro-
posedthekno wledgedistillation(KD) scheme,whichtrans- ferusefulinformation fromahea vyteachernetw orkto aportablestudent networkby minimizingtheK ullback- Leiblerdiv ergencebetweentheiroutputs.Besidesmimic thefinaloutputs oftheteacher networks,Romero etal.[ 25]exploitthehintlayerto distilltheinformation infeaturesof theteachernetw orktothe studentnetwork.Youetal.[ 37]
utilizedmultipleteachers toguidethe trainingofthe student networkandachieve betterperformance.Y imetal.[
36]re-
gardedtherelationshipbetweenfeatures fromtwo layers intheteacher networkas anov elknowledgeandintroduced theFSP(Flo wofSolution Procedure)matrixtotransferthis kindofinformation tothestudent network.Nevertheless,thecompressednetworksusingthese al-
gorithmsstillcontain massive multiplications,whichcosts enormouscomputationresources. Asaresult, subtractions oradditionsare ofmuchlo wercomputationalcomple xities whencomparedwith multiplications.Howe ver, theyha ve notbeenwidely investig atedindeep neuralnetworks,es- peciallyinthe widelyusedcon volutionalnetw orks.There- fore,wepropose tominimizethe numbersofmultiplica- tionsindeep neuralnetworks byreplacingthem withsub- tractionsoradditions.3.Networks withoutMultiplication
Considerafilter F?Rd×d×cin×coutinanintermediate layerofthe deepneuralnetw ork,wherek ernelsizeis d, inputchannelis cinandoutputchannel iscout.Theinput featureisdefined asX?RH×W×cin,whereHandWare theheightand widthofthe feature,respectiv ely.The output featureYindicatesthesimilarity betweenthefilter andtheinputfeature,Y(m,n,t)=d?
i=0d j=0c in? k=0S?X(m+i,n+j,k),F(i,j,k,t)?, (1) whereS(·,·)isapre-defined similaritymeasure.If cross- correlationistak enasthe metricofdistance,i.e.S(x,y)= x×y,Eq.(1)becomesthe convol utionoperation.Eq. (1)
canalsoimplies thecalculationof afully-connectedlayer whend=1.Inf act,thereare manyothermetricsto measurethedistance betweenthefilter andtheinput fea- ture.Howe ver,mostofthesemetricsinvolvemultiplica- tions,whichbring inmorecomputational costthanaddi- tions.3.1.AdderNetw orks
Wearethereforeinterestedin deployingdistance metrics thatmaximizethe useofadditions. ?1distancecalculates thesumof theabsolutedif ferencesoftw opoints"v ector representations,whichcontains nomultiplication.Hence, bycalculating?1distancebetweenthe filterandthe input feature,Eq.(1)canbe reformulatedas:
Y(m,n,t)=-d?
i=0d j=0c in? k=0|X(m+i,n+j,k)-F(i,j,k,t)|. (2)Additionisthe majoroperationin ?1distancemeasure,
sincesubtractioncan beeasilyreduced toadditionby using complementcode.W iththehelp of?1distance,similarity betweenthefilters andfeaturescan beefficiently computed.Althoughboth?1distanceEq.(
2)andcross-correlation
inEq.(1)canmeasure thesimilaritybetween filtersand
inputs,thereare somedifferences intheiroutputs. Theout- putofa convolution filter,as aweightedsummationofval- uesinthe inputfeaturemap, canbepositi veor negati ve, buttheoutputofan adderfilteris always negati ve.Hence, weresortto batchnormalizationfor help,andthe output ofadderlayers willbenormalized toanappropriate range andallthe activation functionsusedin conventionalCNNs canthenbe usedinthe proposedAdderNets.Although the batchnormalizationlayer involv esmultiplications,its com- putationalcostis significantlylower thanthatof thecon- volutionallayersandcanbe omitted.Consideringa con- volutionallayerwithafilter F?Rd×d×cin×cout,aninput X?RH×W×cinandanoutput Y?RH?×W?×cout,the computationcomplexity ofconvolutionandbatch normal- tively.Inpractice,givenaninputchannel numbercin=512andak ernelsized=3inResNet[
10],weha ve
d2cincoutHW coutH?W?≈4068.Sincebatch normalizationlayerhas beenwidelyused inthestate-of-the-art convolutional neu- ralnetworks, wecansimplyupgradethesenetw orksinto AddNetsbyreplacing theirconv olutionallayersinto adder layerstospeed uptheinference andreducesthe energycost. 1470Intuitively,Eq.(1)hasa connectionwithtemplate
matching[3]incomputer vision,whichaims tofindthe
partsofan imagethatmatch thetemplate.FinEq.( 1) actuallyworks asatemplate,andwecalculate itsmatching scoreswithdif ferentregions oftheinputfeatureX.Since variousmetricscanbeutilized intemplatematchi ng,it is naturalthat?1distancecanbe utilizedtoreplace thecross- correlationinEq. (1).Notethat Wangetal.[28]alsodis-
cusseddifferent metricsindeepnetworks.Ho wever ,they focusedonachie vehigh performancebyemployingcom- plexmetricswhilewefocus onthe?1distancetominimize theenergy consumption.3.2.Optimization
Neuralnetworks utilizeback-propagationtocomputethe gradientsoffilters andstochasticgradient descenttoupdate theparameters.In CNNs,the partialderiv ativeofoutput featuresYwithrespectto thefiltersFiscalculatedas: ∂Y(m,n,t) ∂F(i,j,k,t)=X(m+i,n+j,k),(3) wherei?[m,m+d]andj?[n,n+d].To achieve abetterupdate oftheparameters, itisnecessary toderiv e informativegradientsforSGD.InAdderNets,the partial derivativeofYwithrespectto thefiltersFis: ∂Y(m,n,t) wheresgn(·)denotesthesign functionandthe valueof the gradientcanonly take+1, 0,or-1.Consideringthederi vativ eof?2-norm:
∂Y(m,n,t) Eq.(4)cantherefore leadtoa signSGD[2]update of?2-
norm.Howe ver,signSGDalmostnevertakesthedirec- tionofsteepest descentandthe directiononlygets worse asdimensionalitygro ws[1].Itis unsuitabletooptimize
theneuralnetw orksofa hugenumberofparametersusing signSGD.Therefore,we proposeusingEq. (5)toupdate
thegradientsin ourAdderNets.The conver genceoftak- ingthesetw okindsof gradientwillbefurtherinv estigated inthesupplementary material.Therefore,by utilizingthe full-precisiongradient,the filterscanbe updatedprecisely. Besidesthegradient ofthefilters, thegradientof thein- putfeaturesXisalsoimportant fortheupdate ofparame- ters.Therefore,we alsousethe full-precisiongradient(Eq.5))tocalculate thegradientof X.Howe ver,themagnitude
ofthefull-precision gradientmaybe largerthan +1or-1.Denotethefilters andinputsin layeriasFiandXi.Dif-
ferentfrom ∂Y ∂Fiwhichonlyaf fectsthegra dientofFiitself, thechangeof ∂Y ∂Xiwouldinfluencethegradientin notonlylayeributalsolayersbeforelayer iaccordingtothe gradi- entchainrule. Ifweuse thefull-precisio ngradientinstead ofthesign gradientof ∂Y ∂Xforeachlayer ,themagnitude ofthegradient inthel ayersbeforethis layerwould bein- creased,andthe discrepancybroughtbyusingfull-precision gradientwould bemagnified.Tothisend, weclipthe gra- dientofXto[-1,1]toprev entgradientsfromexploding. Thenthepartial derivati veof outputfeaturesYwithrespect totheinput featuresXiscalculatedas: ∂Y(m,n,t) (6) whereHT(·)denotestheHardT anhfunction:HT(x)=?????xif-1 1x>1, -1x<-1.(7) 3.3.Adaptiv eLearningRateScaling
inputfeaturesare independentandidentically distributed beroughlyestimated as: Var [YCNN]=d?
i=0d j=0c in? k=0Var [X×F] =d2cinVar [X]Var [F].(8) Ifvariance oftheweightisVar [F]=1
d2cin,thev arianceof outputwould beconsistentwiththatofthe input,whichwill bebeneficialfor theinformationflo winthe neuralnetwork. Incontrast,for AdderNets,thev arianceofthe outputcanbe approximatedas: Var [YAdderNet]=d?
i=0d j=0c in? k=0Var [|X-F|] 2d2cin(Var [X]+Var [F]),(9)
whenFandXfollownormaldistributions.In practice, thevariance ofweightsVar [F]isusuallyv erysmall[ 7], e.g.10-3or10-4inanordinary CNN.Hence,compared withmultiplyingVar [X]withasmall valuein Eq.( 8),the
additionoperationin Eq.( 9)tendsto bringina muchlarger
varianceofoutputsinAdderNets. Wenextproceedto showtheinfluenceofthis largerv ari- anceofoutputs ontheupdate ofAdderNets.T opromote theeffecti venessofactivationfunctions,weintroducebatch normalizationaftereach adderlayer. Given inputxover amini-batchB={x1,···,xm},thebatch normalization layercanbe denotedas: y=γx-μB σB+β,(10)
1471
Algorithm1Thefeedforw ardandback propagationof
adderneuralnetw orks. Input:Aninitializedadder networkNanditstraining set Xandthecorresponding labelsY,theglobal learning
rateγandtheh yper-parameterη. 1:repeat
2:Randomlyselecta batch{(x,y)}fromXandY;
3:EmploytheAdderNetNonthemini-batch: x→
N(x); ∂Fand∂Y∂Xfor adderfiltersusing Eq.(5)andEq. (6); 5:Exploitthechain ruletog eneratethe gradientofpa- rametersinN;
6:Calculatetheadapti velearning rateαlforeachadder
quotesdbs_dbs47.pdfusesText_47
3.3.Adaptiv eLearningRateScaling
inputfeaturesare independentandidentically distributed beroughlyestimated as:Var [YCNN]=d?
i=0d j=0c in? k=0Var [X×F] =d2cinVar [X]Var [F].(8)Ifvariance oftheweightisVar [F]=1
d2cin,thev arianceof outputwould beconsistentwiththatofthe input,whichwill bebeneficialfor theinformationflo winthe neuralnetwork. Incontrast,for AdderNets,thev arianceofthe outputcanbe approximatedas:Var [YAdderNet]=d?
i=0d j=0c in? k=0Var [|X-F|]2d2cin(Var [X]+Var [F]),(9)
whenFandXfollownormaldistributions.In practice, thevariance ofweightsVar [F]isusuallyv erysmall[ 7], e.g.10-3or10-4inanordinary CNN.Hence,compared withmultiplyingVar [X]withasmall valuein Eq.(8),the
additionoperationin Eq.(9)tendsto bringina muchlarger
varianceofoutputsinAdderNets. Wenextproceedto showtheinfluenceofthis largerv ari- anceofoutputs ontheupdate ofAdderNets.T opromote theeffecti venessofactivationfunctions,weintroducebatch normalizationaftereach adderlayer. Given inputxover amini-batchB={x1,···,xm},thebatch normalization layercanbe denotedas: y=γx-μBσB+β,(10)
1471Algorithm1Thefeedforw ardandback propagationof
adderneuralnetw orks. Input:Aninitializedadder networkNanditstraining setXandthecorresponding labelsY,theglobal learning
rateγandtheh yper-parameterη.1:repeat
2:Randomlyselecta batch{(x,y)}fromXandY;
3:EmploytheAdderNetNonthemini-batch: x→
N(x); ∂Fand∂Y∂Xfor adderfiltersusing Eq.(5)andEq. (6);5:Exploitthechain ruletog eneratethe gradientofpa- rametersinN;
6:Calculatetheadapti velearning rateαlforeachadder
quotesdbs_dbs47.pdfusesText_47[PDF] multiplicité des critères pour rendre compte de la structure sociale
[PDF] Multiplié deux identités remarquables
[PDF] multiplier
[PDF] Multiplier des equations par (-1) et Diviser des equations par (-2)
[PDF] Multiplier des fractions
[PDF] multiplier des heures
[PDF] multiplier des nombres en écriture fractionnaire
[PDF] multiplier des nombres en écritures fractionnaires
[PDF] Multiplier des nombres positifs en écriture fractionnaire
[PDF] Multiplier des nombres positifs en écriture fractionnaire - 4eme
[PDF] Multiplier des nombres positifs en écriture fractionnaire - Maths
[PDF] multiplier deux fractions
[PDF] multiplier deux racines carrées identiques
[PDF] Multiplier et diviser des nombres relatifs en écriture fractionnaire