AdderNet: Do We Really Need Multiplications in Deep Learning? PDF

Computing Polynomials with Few Multiplications

16 Sept 2011 Key words and phrases: arithmetic circuits multiplications. 1 Introduction. Arithmetic complexity is a branch of theoretical computer ...

AdderNet: Do We Really Need Multiplications in Deep Learning?

In this paper we present adder networks (AdderNets) to trade these massive multiplications in deep neural networks

On the Arithmetic Complexity of Strassen-Like Matrix Multiplications

The Strassen algorithm for multiplying 2 × 2 matrices requires seven multiplications and 18 additions. The recursive use of this algorithm for matrices of

Profinite Braid Groups Galois Representations and Complex

1-adic power series for complex multiplications of Fermat type or equivalently

Galois Action on Division Points of Abelian Varieties with Real

Suppose that X has no complex multiplication (i.e. EndcX= Z). Then the image of p.

Lecture 11 – Fast matrix multiplication

16 Jun 2014 373.) 1.4 Boolean matrix multiplication. Strassen's algorithm uses not only additions and multiplications so as to multiply matrices but also ...

Witnesses for Boolean Matrix Multiplication and for Shortest Paths

We use O(n?) to denote the running time of some subcubic algorithm for Boolean matrix multiplication. All our algorithms can be derived from any such algorithm

A NONCOMMUTATIVE ALGORITHM FOR MULTIPLYING 3 x 3

The standard definition for the multiplication of two n x n matrices yields a noncommutative algorithm using n3 multiplications. Strassen [7] produced a.

Double-Base Chains for Scalar Multiplications on Elliptic Curves

Keywords: Elliptic curve cryptography Scalar multiplication

Fast Secure Matrix Multiplications over Ring-Based Homomorphic

HElib is a software library that implements the Brakerski-Gentry-Vaikuntanathan (BGV) homomorphic scheme in which secure matrix-vector multiplication is

AdderNet:DoW eReallyNeed MultiplicationsinDeepLearning?

HantingChen

1,2?,Yunhe Wang2?,ChunjingXu 2†,BoxinShi 3,4,ChaoXu 1,QiT ian2,ChangXu 5

1 KeyLabofMachinePerception(MOE),Dept. ofMachineIntelligence, PekingUniv ersity.

2Noah"sArkLab,Huawei Technologies.3NELVT,Dept.ofCS,PekingUniv ersity.4PengChengLaboratory .

5SchoolofComputer Science,Faculty ofEngineering,The University ofSydney.

{yunhe.wang,xuchunjing,tian.qi1 }@huawei.com

Abstract

Comparedwithcheapaddition operation,multiplication operationisofmuch highercomputationcomple xity.The widely-usedconvol utionsindeepneuralnetworksaree x- actlycross-corr elationtomeasurethesimilaritybetween inputfeature andconvolutionfilters, whichin volvesmas- sivemultiplicationsbetween floatvalues.In thispaper, we presentaddernetworks(AdderNets)to tradethese massive multiplicationsin deepneuralnetworks,especiallycon vo- lutionalneural networks(CNNs),formuchc heaperaddi- tionstor educecomputationcosts. InAdderNets,wetake the?1-normdistancebetween filtersand inputfeature as theoutputr esponse.The influenceofthisnewsimilarity measureontheoptimizationof neuralnetwork havebeen thoroughlyanalyzed.Toac hievea betterperformance,we developaspecialback-pr opagationappr oachfor Adder- Netsbyin vestigatingthefull-pr ecisiongradient.Wethen proposeanadaptivelearningr atestrate gyto enhancethe trainingprocedureof AdderNetsaccordingtothemag- nitudeofeac hneuron" sgradient.Asar esult,thepro- posedAdderNetscan achieve 74.9%Top-1 accuracy91.7%

Top-5accuracyusingResNet-50 ontheImageNetdataset

withoutanymultiplication inconvolutional layer.The codesare publiclyavailableat:https://github.com/huawei- noah/AdderNet.

1.Introduction

GiventheadventofGraphicsProcessing Units(GPUs),

deepconv olutionalneuralnetworks(CNNs)withbillions offloatingnumber multiplicationscouldrecei vespeed-ups andmake importantstridesinalar gevarietyofcomputer visiontasks,e.g.imageclassification [

26,17],objectde-

tection[

23],segmentation [19],andhuman facev erifica-

?Equalcontribution. †Correspondingauthor. tion[

32].Howe ver,thehigh-powerconsumptionofthese

high-endGPUcards (e.g.250W+for GeForceR TX2080

Ti)hasblockedmodern deeplearningsystems frombeing

deployedonmobiledevices, e.g.smartphone, camera,and watch.ExistingGPUcardsare farfrom svelteand cannot beeasilymounted onmobilede vices.Thought heGPU it- selfonlytak esupa smallpartofthecard,we needmany otherhardware forsupports,e.g.memorychips, powercir - cuitry,voltageregulators andothercontrollerchips.It is thereforenecessary tostudyefficientdeepneural networks thatcanrun withaffordable computationresourceson mo- biledevices. Addition,subtraction,multiplication anddivision arethe fourmostbasic operationsinmathematics. Itiswidely knownthatmultiplicationisslo werthanaddition, butmost ofthecomputations indeepneural networksare multiplica- tionsbetweenfloat-v aluedweightsand float-valuedactiva- tionsduringthe forwardinference. Therearethus manypa- personho wtotrade multiplicationsforadditions,tospeed updeeplearning. Theseminalw ork[

5]proposedBina-

ryConnecttoforce thenetwork weightstobe binary(e.g.-1 or1),so thatmany multiply-accumulateoperationscan be replacedbysimple accumulations.Afterthat, Hubaraet al.[

15]proposedBNNs, whichbinarizednot onlyweights

butalsoactivations inconv olutionalneuralnetworksatrun- time.Moreov er,Rastegarietal.[

22]introducedscale fac-

torstoapproximate convolutions usingbinaryoperations andoutperform[

15,22]bylar gemargins. Zhouetal.[39]

utilizedlow bit-widthgradienttoacceleratethetraining ofbinarized networks.Caietal.[

4]proposedan half-

achievedmuchcloserperformancetofullprecision net- works.

Thoughbinarizingfilters ofdeepneural networkssig-

nificantlyreducesthe computationcost,the originalrecog- nitionaccuracy oftencannotbepreserved.In addition, thetrainingprocedure ofbinarynetw orksisnot stableand usuallyrequestsa slowercon vergence speedwithasmall 1468
(a)Visualization offeaturesinAdderNets(b)V isualizationoffeatures inCNNs

Figure1.V isualizationoffeatures inAdderNetsandCNNs.Featuresof CNNsindif ferentclassesare dividedby theirangles. Incontrast,

featuresofAdderNets tendtobe clusteredtow ardsdifferent classcenters,si nceAdderNetsuse the?1-normtodistinguish differentclasses.

Thevisualizationresults suggestthat?1-distancecanserv easa similaritymeasurethedistancebetween thefilterand theinputfeature in

deepneuralnetw orks learningrate.Con volutionsin classicalCNNsareactually cross-correlationtomeasure thesimilarity oftwo inputs. Researchersandde velopersare usedtotakingconvolution asadef aultoperation toextractfeaturesfromvisualdata, andintroducev ariousmethodsto acceleratetheconvolu- tion,ev enifthereisariskofsacrificingnetw orkcapability. Butthereis hardlynoattempt toreplacecon volutionwith anothermoreef ficientsimilaritymeasure thatisbetterto onlyinv olveadditions.Infact,additionsareofmuchlower computationalcomplexities thanmultiplications.Thus,we aremotiv atedtoinvestigatethefeasibilityof replacingmul- tiplicationsbyadditions inconv olutionalneuralnetw orks.

Inthispaper ,wepropose addernetworksthatmaximize

theuseof additionwhileabandoning convolution opera- tions.Giv enaseriesofsmalltemplateasfilters" inthe neuralnetwork, ?1-distancecouldbe anefficient measure tosummarizeabsolute differencesbetween theinputsig- nalandthe templateassho wninFigure

1.Sincesubtrac-

complementcode,?1-distancecouldbe ahardware-friendly measurethatonly hasadditions,and naturallybecomesan efficientalternativeof theconvolutiontoconstructneural networks.Animproved back-propagationscheme withreg- ularizedgradientsis designedtoens ure sufficientupdates ofthetemplates andabetter networkcon vergence. The proposedAdderNetsare deployedon several benchmarks, andexperimental resultsdemonstratethatAdderNetscan achievecomparablerecognitionaccuracytocon ventional CNNs.

Thispaperis organized asfollows. Section

2investi-

gatesrelatedworkson networkcompression. Section 3pro-

posesAdderNet workswhich replacethemultiplicationintheconv entionalconvolutionfilterswithaddition.Section

4 evaluatestheproposedAdderNetsonvarious benchmark datasetsandmodels andSection

5concludesthispaper .

2.Relatedw orks

Toreducethecomputationalcomple xityofcon volu-

tionalneuralnetw orks,anumber ofworkshave beenpro- posedforeliminating uselesscalculations.

Pruningbasedmethods aimstoremo veredundant

weightstocompress andacceleratethe originalnetwork.

Dentonetal.[

6]decomposedweight matricesoffully-

connectedlayersinto simplecalculationsby exploitingsin- gularvalue decomposition(SVD).Hanetal.[

9]proposed

discardingsubtleweights inpre-traineddeep networksto omittheiroriginal calculationswithoutaf fectingtheper - formance.Wang etal.[

31]furthercon vertedcon volution

filtersintothe DCTfrequency domainandelim inatedmore floatingnumbermultiplications. Inaddition,Hu etal.[ 13] discardedredundantfilters withlessimpacts todirectlyre- ducethecomputations broughtbythese filters.Luoet al.[

21]discardedredundant filtersaccordingto therecon-

structionerror. Huetal.[

14]proposeddubbed RobustDy-

namicInferenceNetw orks(RDI-Nets),which allowsfor eachinputto adaptively chooseoneof themultipleoutput layerstooutput itsprediction.W angetal.[

29]proposeda

E2-Trainingmethod,whichcantrai ndeepneural networks withov er80%energysavings. Insteadofdirectly reducingthecomputational complex- ityofa pre-trainedheavy neuralnetwork, lotofw orksfo- cusedondesigning novel blocksoroperations toreplace theconv entionalconvolutionfilters.Howardetal.[

12]de-

signedMobileNet,which decomposethecon ventionalcon- 1469
volutionfiltersintothepoint-wis eanddepth-wise convolu- tionfilterswith muchfewer FLOPs.Zhangetal.[

38]com-

binedgroupcon volutions[

35]anda channelshuffle oper-

ationtob uildefficient neuralnetworkswithfe wercompu- tations.Wu etal.[

34]presenteda parameter-free shift"

operationwithzero flopandzero parametertoreplace con- ventionalfiltersandlargely reducethecomputational and storagecostof CNNs.Wang etal.[

30]dev elopedversatile

convolutionfilterstogeneratemoreusefulfeatures utilizing fewercalculationsandparameters.Xu etal.[

16]proposed

perturbativeneuralnetworkstoreplacecon volutionand in- steadcomputesits responseasa weightedlinearcombina- tionofnon-l inearlyactiv atedadditivenoiseperturbed in- puts.Hanetal.[

8]proposedGhostNet togeneratemore

featuresfromcheap operationsandachie vethe state-of-the- artperformanceon lightweightarchitectures. Besideseliminatingredundant weightsorfilters indeep convolutionalneuralnetworks,Hintonetal.[

11]pro-

posedthekno wledgedistillation(KD) scheme,whichtrans- ferusefulinformation fromahea vyteachernetw orkto aportablestudent networkby minimizingtheK ullback- Leiblerdiv ergencebetweentheiroutputs.Besidesmimic thefinaloutputs oftheteacher networks,Romero etal.[ 25]
exploitthehintlayerto distilltheinformation infeaturesof theteachernetw orktothe studentnetwork.Youetal.[ 37]
utilizedmultipleteachers toguidethe trainingofthe student networkandachieve betterperformance.Y imetal.[

36]re-

gardedtherelationshipbetweenfeatures fromtwo layers intheteacher networkas anov elknowledgeandintroduced theFSP(Flo wofSolution Procedure)matrixtotransferthis kindofinformation tothestudent network.

Nevertheless,thecompressednetworksusingthese al-

gorithmsstillcontain massive multiplications,whichcosts enormouscomputationresources. Asaresult, subtractions oradditionsare ofmuchlo wercomputationalcomple xities whencomparedwith multiplications.Howe ver, theyha ve notbeenwidely investig atedindeep neuralnetworks,es- peciallyinthe widelyusedcon volutionalnetw orks.There- fore,wepropose tominimizethe numbersofmultiplica- tionsindeep neuralnetworks byreplacingthem withsub- tractionsoradditions.

3.Networks withoutMultiplication

Considerafilter F?Rd×d×cin×coutinanintermediate layerofthe deepneuralnetw ork,wherek ernelsizeis d, inputchannelis cinandoutputchannel iscout.Theinput featureisdefined asX?RH×W×cin,whereHandWare theheightand widthofthe feature,respectiv ely.The output featureYindicatesthesimilarity betweenthefilter andtheinputfeature,

Y(m,n,t)=d?

i=0d j=0c in? k=0S?X(m+i,n+j,k),F(i,j,k,t)?, (1) whereS(·,·)isapre-defined similaritymeasure.If cross- correlationistak enasthe metricofdistance,i.e.S(x,y)= x×y,Eq.(

1)becomesthe convol utionoperation.Eq. (1)

canalsoimplies thecalculationof afully-connectedlayer whend=1.Inf act,thereare manyothermetricsto measurethedistance betweenthefilter andtheinput fea- ture.Howe ver,mostofthesemetricsinvolvemultiplica- tions,whichbring inmorecomputational costthanaddi- tions.

3.1.AdderNetw orks

Wearethereforeinterestedin deployingdistance metrics thatmaximizethe useofadditions. ?1distancecalculates thesumof theabsolutedif ferencesoftw opoints"v ector representations,whichcontains nomultiplication.Hence, bycalculating?1distancebetweenthe filterandthe input feature,Eq.(

1)canbe reformulatedas:

Y(m,n,t)=-d?

i=0d j=0c in? k=0|X(m+i,n+j,k)-F(i,j,k,t)|. (2)

Additionisthe majoroperationin ?1distancemeasure,

sincesubtractioncan beeasilyreduced toadditionby using complementcode.W iththehelp of?1distance,similarity betweenthefilters andfeaturescan beefficiently computed.

Althoughboth?1distanceEq.(

2)andcross-correlation

inEq.(

1)canmeasure thesimilaritybetween filtersand

inputs,thereare somedifferences intheiroutputs. Theout- putofa convolution filter,as aweightedsummationofval- uesinthe inputfeaturemap, canbepositi veor negati ve, buttheoutputofan adderfilteris always negati ve.Hence, weresortto batchnormalizationfor help,andthe output ofadderlayers willbenormalized toanappropriate range andallthe activation functionsusedin conventionalCNNs canthenbe usedinthe proposedAdderNets.Although the batchnormalizationlayer involv esmultiplications,its com- putationalcostis significantlylower thanthatof thecon- volutionallayersandcanbe omitted.Consideringa con- volutionallayerwithafilter F?Rd×d×cin×cout,aninput X?RH×W×cinandanoutput Y?RH?×W?×cout,the computationcomplexity ofconvolutionandbatch normal- tively.Inpractice,givenaninputchannel numbercin=

512andak ernelsized=3inResNet[

10],weha ve

d2cincoutHW coutH?W?≈4068.Sincebatch normalizationlayerhas beenwidelyused inthestate-of-the-art convolutional neu- ralnetworks, wecansimplyupgradethesenetw orksinto AddNetsbyreplacing theirconv olutionallayersinto adder layerstospeed uptheinference andreducesthe energycost. 1470

Intuitively,Eq.(1)hasa connectionwithtemplate

matching[

3]incomputer vision,whichaims tofindthe

partsofan imagethatmatch thetemplate.FinEq.( 1) actuallyworks asatemplate,andwecalculate itsmatching scoreswithdif ferentregions oftheinputfeatureX.Since variousmetricscanbeutilized intemplatematchi ng,it is naturalthat?1distancecanbe utilizedtoreplace thecross- correlationinEq. (

1).Notethat Wangetal.[28]alsodis-

cusseddifferent metricsindeepnetworks.Ho wever ,they focusedonachie vehigh performancebyemployingcom- plexmetricswhilewefocus onthe?1distancetominimize theenergy consumption.

3.2.Optimization

Neuralnetworks utilizeback-propagationtocomputethe gradientsoffilters andstochasticgradient descenttoupdate theparameters.In CNNs,the partialderiv ativeofoutput featuresYwithrespectto thefiltersFiscalculatedas: ∂Y(m,n,t) ∂F(i,j,k,t)=X(m+i,n+j,k),(3) wherei?[m,m+d]andj?[n,n+d].To achieve abetterupdate oftheparameters, itisnecessary toderiv e informativegradientsforSGD.InAdderNets,the partial derivativeofYwithrespectto thefiltersFis: ∂Y(m,n,t) wheresgn(·)denotesthesign functionandthe valueof the gradientcanonly take+1, 0,or-1.

Consideringthederi vativ eof?2-norm:

∂Y(m,n,t) Eq.(

4)cantherefore leadtoa signSGD[2]update of?2-

norm.Howe ver,signSGDalmostnevertakesthedirec- tionofsteepest descentandthe directiononlygets worse asdimensionalitygro ws[

1].Itis unsuitabletooptimize

theneuralnetw orksofa hugenumberofparametersusing signSGD.Therefore,we proposeusingEq. (

5)toupdate

thegradientsin ourAdderNets.The conver genceoftak- ingthesetw okindsof gradientwillbefurtherinv estigated inthesupplementary material.Therefore,by utilizingthe full-precisiongradient,the filterscanbe updatedprecisely. Besidesthegradient ofthefilters, thegradientof thein- putfeaturesXisalsoimportant fortheupdate ofparame- ters.Therefore,we alsousethe full-precisiongradient(Eq.

5))tocalculate thegradientof X.Howe ver,themagnitude

ofthefull-precision gradientmaybe largerthan +1or-1.

Denotethefilters andinputsin layeriasFiandXi.Dif-

ferentfrom ∂Y ∂Fiwhichonlyaf fectsthegra dientofFiitself, thechangeof ∂Y ∂Xiwouldinfluencethegradientin notonlylayeributalsolayersbeforelayer iaccordingtothe gradi- entchainrule. Ifweuse thefull-precisio ngradientinstead ofthesign gradientof ∂Y ∂Xforeachlayer ,themagnitude ofthegradient inthel ayersbeforethis layerwould bein- creased,andthe discrepancybroughtbyusingfull-precision gradientwould bemagnified.Tothisend, weclipthe gra- dientofXto[-1,1]toprev entgradientsfromexploding. Thenthepartial derivati veof outputfeaturesYwithrespect totheinput featuresXiscalculatedas: ∂Y(m,n,t) (6) whereHT(·)denotestheHardT anhfunction:

HT(x)=?????xif-1 1x>1, -1x<-1.(7)
3.3.Adaptiv eLearningRateScaling
inputfeaturesare independentandidentically distributed beroughlyestimated as:
Var [YCNN]=d?
i=0d j=0c in? k=0Var [X×F] =d2cinVar [X]Var [F].(8)
Ifvariance oftheweightisVar [F]=1
d2cin,thev arianceof outputwould beconsistentwiththatofthe input,whichwill bebeneficialfor theinformationflo winthe neuralnetwork. Incontrast,for AdderNets,thev arianceofthe outputcanbe approximatedas:
Var [YAdderNet]=d?
i=0d j=0c in? k=0Var [|X-F|]
2d2cin(Var [X]+Var [F]),(9)
whenFandXfollownormaldistributions.In practice, thevariance ofweightsVar [F]isusuallyv erysmall[ 7], e.g.10-3or10-4inanordinary CNN.Hence,compared withmultiplyingVar [X]withasmall valuein Eq.(
8),the
additionoperationin Eq.(
9)tendsto bringina muchlarger
varianceofoutputsinAdderNets. Wenextproceedto showtheinfluenceofthis largerv ari- anceofoutputs ontheupdate ofAdderNets.T opromote theeffecti venessofactivationfunctions,weintroducebatch normalizationaftereach adderlayer. Given inputxover amini-batchB={x1,···,xm},thebatch normalization layercanbe denotedas: y=γx-μB
σB+β,(10)
1471

Algorithm1Thefeedforw ardandback propagationof
adderneuralnetw orks. Input:Aninitializedadder networkNanditstraining set
Xandthecorresponding labelsY,theglobal learning
rateγandtheh yper-parameterη.
1:repeat

2:Randomlyselecta batch{(x,y)}fromXandY;

3:EmploytheAdderNetNonthemini-batch: x→
N(x); ∂Fand∂Y∂Xfor adderfiltersusing Eq.(5)andEq. (6);
5:Exploitthechain ruletog eneratethe gradientofpa- rametersinN;

6:Calculatetheadapti velearning rateαlforeachadder
quotesdbs_dbs47.pdfusesText_47

[PDF] multiplications des nombres relatifs

[PDF] multiplicité des critères pour rendre compte de la structure sociale

[PDF] Multiplié deux identités remarquables

[PDF] multiplier

[PDF] Multiplier des equations par (-1) et Diviser des equations par (-2)

[PDF] Multiplier des fractions

[PDF] multiplier des heures

[PDF] multiplier des nombres en écriture fractionnaire

[PDF] multiplier des nombres en écritures fractionnaires

[PDF] Multiplier des nombres positifs en écriture fractionnaire

[PDF] Multiplier des nombres positifs en écriture fractionnaire - 4eme

[PDF] Multiplier des nombres positifs en écriture fractionnaire - Maths

[PDF] multiplier deux fractions

[PDF] multiplier deux racines carrées identiques

[PDF] Multiplier et diviser des nombres relatifs en écriture fractionnaire

[PDF] AdderNet: Do We Really Need Multiplications in Deep Learning?