Image Transformer
In this work we generalize a recently proposed model architecture based on self-attention
Attention-Aligned Transformer for Image Captioning
tive and influential image features. In this paper we present. A2 - an attention-aligned Transformer for image captioning
Transformer les images
Le module PIL permet de manipuler un fichier image (reconnaissance automatique de la largeur et de la hauteur en pixels de l'image création d'une grille de
Can Vision Transformers Learn without Natural Images?
Is it possible to complete Vision Transformer (ViT) pre- training without natural images and human-annotated labels? This question has become increasingly
COTR: Correspondence Transformer for Matching Across Images
Our method is the first application of transformers to image correspondence problems. 1. Functional methods using deep learning. While the idea existed already
Uformer: A General U-Shaped Transformer for Image Restoration
cient Transformer-based architecture for image restoration in which we build a hierarchical age restoration tasks
Entangled Transformer for Image Captioning
We name our model as ETA-Transformer. Remarkably. ETA-Transformer achieves state-of-the-art performance on the MSCOCO image captioning dataset. The ablation
Generating images with sparse representations
5 mars 2021 Deep generative models of images are neural networks ... the flattened DCT image through a Transformer encoder: Einput = encode (Dflat) .
SiT: Self-supervised vIsion Transformer
In this work we investigate the merits of self-supervised learning for pretraining image/vision transformers and then using them for downstream classification
Towards End-to-End Image Compression and Analysis with
Instead of placing an existing. Transformer-based image classification model directly after an image codec we aim to redesign the Vision Transformer. (ViT)
WeiJiang
1 ,EduardT rulls 2 ,JanHosang 2 ,AndreaT agliasacchi 2,3 ,Kwang MooYi 1 1UniversityofBritishColumbia,
2GoogleResearch,
3UniversityofToronto
Abstract
Weproposeano velframeworkforÞnding corresponden- cesinima gesbased onadeepneuralnetworkthat,given twoimag esandaquerypointinoneofthem, Þndsitscor - respondenceintheother. Bydoingso, onehasthe option toqueryonly thepoints ofinterest andretrie vesparse cor- respondences,ortoquery allpointsin animag eandobtain densemappings.Importantly ,inor dertocapturebothlocal andglobalprior s,and toletourmodelrelate betweenimag e regionsusingthemostrelevant amongsaidprior s,wer eal- izeournetwork usingatr ansformer.At inferencetime ,we applyour correspondencenetwork byrecursivelyzooming inaround theestimates,yieldingamultiscalepipeline able topro videhighly-accuratecorrespondences.Ourmethod signiÞcantlyoutperforms thestateofthearton bothsparse anddensecorr espondencepr oblemsonmultipledatasets andtasks,r angingfrom wide-baselinestereotoopticalßow , withoutanyr etrainingfor aspeciÞcdataset.Wecommit toreleasing data,code,andallthe toolsnecessaryto train fromscratch andensurereproducibility.1.Introduction
Findingcorrespondencesacross pairsofimages isafun- damentaltaskin computervision,with applicationsranging fromcameracalibration [22,28]tooptical ßow [32,15],StructurefromMotion (SfM)[56,28],visuallocaliza-
tion[55,53,36],pointtracking [35,68],andhuman pose estimation[43,20].Traditionally ,twofundamentalresearch directionsexist forthisproblem.Oneisto extractsets of sparsekeypointsfrombothimagesandmatch theminorder tominimizean alignmentmetric[ 33,55,28].Theother is tointerpretcorrespondence asadenseprocess,wheree v- erypixel intheÞrstimagemapsto apixel inthesecond image[32,60,77,72].Thedi videbetweensparseanddenseemergednaturally
fromtheapplications theywere devisedfor .Sparsemethods havelargelybeenusedtoreco verasingleglobal camera motion,suchas inwide-baseline stereo,usinggeometrical constraints.The yrelyonlocalfeatures[ 34,74,44,13]Figure1. TheCorr espondenceTransformerÐ(a)COTR
formulatesthecorrespondence problemasa functionalmapping frompointxtopointx ,conditionalon twoinput imagesIand I .(b)COTRiscapableof sparsematching underdifferent mo- tiontypes, includingcameramotion,multi-objectmotion,and object-posechanges.(c) COTRgeneratesasmooth correspon- dencemapfor stereopairs:gi ven(c.1,2) asinput,(c.3) showsthe predicteddensecorrespondence map(color-coded ÔxÕchannel), and(c.4)w arps(c.2)onto (c.1)withthepredictedcorrespondences. andfurtherprune theputati vecorrespondences formedwith themina separatestagewith sampling-basedrobust match- ers[18,3,12],ortheir learnedcounterparts[ 75,7,76,64,54].Densemethods, bycontrast,usually modelsmalltem-
poralchanges, suchasoptical ßowin videosequences,and relyonlocalsmoothness[35,24].Exploitingconte xtin thismannerallo wsthem toÞndcorrespondencesatarbitrary locations,includingseemingly texture-lessareas. Inthisw ork,wepresent asolutionthatbridgesthisdi vide, anov elnetworkarchitecturethatcanexpress bothformsof priorknowledge ÐglobalandlocalÐand learnthemimplic- itlyfromdata. To achieve this,weleveragethe inductive biasthatdensely connectednetw orkspossessin representing smoothfunctions[ 1,4,48]anduse atransformer [73,10,14] 1 6207toautomaticallycontrol thenatureof priorsandlearn howto utilizethemthrough itsattentionmechanism. For example, ground-truthopticalßo wtypicallydoes notchangesmoothly acrossobjectboundaries, andsimple(attention-agnostic) denselyconnectednetw orkswould havechallengesinmod- ellingsucha discontinuouscorrespondence map,whereasa transformerw ouldnot.Moreover, transformersallow encod- ingthe relationshipbetweendifferentlocationsof theinput data,makingthem anatural Þtforcorrespondence problems. SpeciÞcally,weexpressthe problemofÞnding corres- pondencesbetweenimages IandI infunctional form,as x =F (x|I,I ),whereF isourneural networkarchi- tecture,parameterizedby !,xindexesaquerylocationinI, andx indexesitscorrespondinglocationinI ;seeFigure 1. Differentlyfromsparsemethods,COTRcan matcharbitrary querypointsvia thisfunctionalmapping, predictingonlyas manymatchesasdesired.Dif ferentlyfromdense methods, COTRlearnssmoothnessimplicitlyandcandeal withlarge cameramotionef fectiv ely. Ourwork istheÞrsttoapplytransformers toobtainaccu- ratecorrespondences.Our maintechnical contributionsare: ¥weproposea functionalcorrespondence architecturethat combinesthestrengths ofdenseand sparsemethods; ¥wesho whowtoapplyour methodrecursivelyatmulti- plescalesduring inferenceinorder tocomputehighly- accuratecorrespondences; ¥wedemonstrate thatCOTR achieves state-of-the-artper- formanceinboth denseand sparsecorrespondenceprob- lemson multipledatasetsand tasks,withoutretraining; ¥wesubstantiateour designchoices andshow thatthetrans- formerisk eyto ourapproachbyreplacingitwithasimpler model,based onaMulti-Layer Perceptron(MLP).
2.Related works
Wereviewthe literatureonbothsparseanddensematch-
ing,aswell asw orksthatutilize transformersforvision. Sparsemethods.Sparse methodsgenerallyconsistofthree stages:ke ypointdetection,featuredescription,andfea- turematching.Seminal detectorsinclude DoG[34]andFAST[51].Popularpatch descriptorsrangefrom hand-
crafted[34,9]tolearned [42,66,17]ones.Learned fea- tureextractors becamepopularwiththeintroductionofLIFT[74],withman yfollow-ups [13,44,16,49,5,71].
Localfeaturesare designedwithsparsity inmind,b uthav e alsobeenapplied denselyinsome cases[ 67,32].Learned localfeaturesare trainedwithintermediate metrics,suchas descriptordistanceor numberofmatches.Featurematchingis treatedasa separatestage,where
descriptorsare matched,followed byheuristicssuch asthe ratiotest,and robustmatchers, whicharek eytodealwith highoutlier ratios.Thelat terarethe focusofmuch research, whetherhand-crafted,follo wingRANSAC [18,12,3], consensus-ormotion-based heuristics[ 11,31,6,37],or learned[75,7,76,64].Thecurrent stateofthe artbuilds on attentionalgraphneural networks[ 54].Notethat whilesome ofthesetheoretically allow featuree xtractionandmatching tobetrained endtoend, thisav enueremainslar gelyunex- plored.We showthatourmethod,which doesnotdividethe pipelineinto multiplestagesand islearnedend-to-end, can outperformthesesparse methods. Densemethods .Dense methodsaimto solveoptical ßow. Thistypicallyimplies smalldisplacements,such asthemo- tionbetweenconsecuti vevideo frames.TheclassicalLucas- Kanademethod[ 35]solves forcorrespondencesoverlocal neighbourhoods,whileHorn-Schunck [24]imposesglobal smoothness.Moremodern algorithmsstillrely onthese principles,withdif ferentalgorithmicchoices [59],orfocus onlarger displacements[8].Estimating densecorresponden- cesunderlar gebaselines anddrasticappearancechanges wasnotexploreduntil methodssuchas DeMoN[72]and SfMLearner[77]appeared,which recovered bothdepthand cameramotionÐ howev er,t heirperformancefellsomewhat shortof sparsemethods[ 75].NeighbourhoodConsens us Networks[50]explored 4DcorrelationsÐwhilepowerful, thislimitsthe imagesizethe ycantackle. Morerecently, DGC-Net[38]appliedCNNs inacoarse-to-Þne approach, trainedonsynthetic transformations,GLU-Net[ 69]com- binedglobaland localcorrelationlayers inafeature pyramid, andGOCor[ 70]improv edthefeaturecorrelationlayersto disambiguaterepeatedpatterns. Wesho wthatwe outper- formDGC-Net,GLU-Net andGOCor over multipledatasets, whileretainingour abilitytoquery individualpoints.Attentionmechanisms.Theattention mechanismenables
aneuralnetw orkto focusonpartoftheinput. Hardat- tentionwas pioneeredbySpatialTransformers[ 26],which introducedapo werfuldiff erentiablesampler,andwas later improvedin[27].Softattention waspioneered bytransform- ers[73],whichhas sincebecome thede-factostandardin naturallanguage processingÐits applicationtovision tasks isstillin itsearlystages. Recently,DETR [10]usedT rans- formersforobject detection,whereasV iT[14]appliedthem toimagerecognition. Ourmethodis theÞrstapplication of transformerstoimage correspondenceproblems. 1 Functionalmethodsusing deeplearning .While theidea existedalready,e.g.togenerate images[58],using neuralnet- worksinfunctionalformhas recentlygained muchtraction. DeepSDF[45]usesdeep networksas afunctionthat returns thesigneddistance Þeldvalue ofaquery point.Theseideas wererecentlye xtendedby [21]toestablish correspondences betweenincompleteshapes. Whilenotdirectly relatedtoim- agecorrespondence,this researchhassho wnthatfunctional methodscanachie vestate-of-the-art performance. 1 Aconcurrentrele vantw orkforfeature-lessimagematchingwaspro- posedshortlyafter ourwork becamepublic[ 63].62083.Method
WeÞrstformalizeourproblem (Section3.1),thendetail ourarchitecture(Section 3.2),its recursive useatinference time(Section3.3),andour implementation(Section3.4).3.1.Pr oblemformulation
Letx![0,1]
2 bethenormalized coordinatesofthe query pointinimage I,forwhich wewishto Þndthecorrespond- ingpoint,x ![0,1] 2 ,inimage I .We frametheproblemof learningtoÞnd correspondencesas thatofÞnding thebest setofparameters !foraparametric functionF x|I,I minimizing argmin E (x,x ,I,I )"D L corr +L cycle ,(1) L corr x F x|I,I 2 2 ,(2) L cycle x"F F x|I,I |I,I 2 2 ,(3) whereDisthetraining datasetofground correspondences, L corr measuresthecorrespondence estimationerrors, and L cycle enforcescorrespondencesto becycle-consistent.3.2.Network architecture
WeimplementF
withatransformer .Ourarchitecture, inspiredby[ 10,14],isillustrated inFigure2.We Þrstcrop andresizethe inputintoa 256#256image,andcon vertit intoado wnsampledfeaturemap size16#16#256witha sharedCNN backbone,E.W ethenconcatenatetherepresen- tationsfortw ocorresponding imagessidebyside ,forming afeaturemap size16#32#256,towhich weaddposi- tionalencodingP(withN=256channels)ofthe coordinate c=E(I),E(I
+P("),(4) where[á]denotesconcatenationalong thespatialdimension Ð asubtlyimportant detailnov eltoour architecturethatwe discussingreater depthlater on.We thenfeedthe context featuremapctoatransformer encoderT E ,andinterpret its resultswitha transformerdecoderT D ,alongwith thequery pointx,encodedby PÐthepositional encoderusedt o generate".We Þnallyprocesstheoutputof thetransformer decoderwitha fullyconnectedlayer Dtoobtainour estimate forthecorresponding point,x x =F x|I,I =D(T D (P(x),T E (c))).(5) Forarchitecturaldetailsofeach componentpleaserefer to supplementarymaterial .Importanceof contextconcatenation.Concatenationof
thefeaturemaps alongthespatial dimensioniscritical, as Figure2.TheCOTR architectureÐWeÞrstprocesseach image witha(shared) backboneCNNEtoproducefeature mapssize16x16,whichwe thenconcatenate together,and addpositional
encodingstoform ourcontext featuremap.The resultsarefed intoa transformerT,along withtheque rypoint(s)x.Theoutput ofthetransformer isdecodedby amulti-layerperceptron Dinto correspondence(s)x itallows thetransformerencoderT E torelatebetwee nloca- tionswithintheimage (self-attention),and acrossimages (cross-attention).Notethat, toallow theencoderto distin- guishbetweenpix elsin thetwoimages,weemplo yasingle positionalencodingfor theentireconcatenated featuremap; seeFig.2.We concatenatealongthespatialdimensionrather thanthechannel dimension,asthe latterwould createarti- Þcialrelationshipsbetween featurescoming fromthesame pixellocationsineach image.Concatenationallo wsthe featuresineach maptobe treatedina waythat issimilar towords inasentence[73].The encoderthenassociates andrelatesthem todiscov erwhichones toattendto given theircontext Ðwhichisarguablya morenaturalw aytoÞnd correspondences. Linearpositionalen coding.We founditcriticaltousea linearincreasein frequencyfor thepositionalencoding,as opposedtothe commonlyusedlog-linear strategy[ 73,10], whichmadeour optimizationunstable;see supplementary material.Hence,for agiv enlocationx=[x,y]wewrite P(x)= p 1 (x),p 2 (x),...,pN 4 (x) ,(6) p k (x)= sin(k!x ),cos(k!x ,(7) whereN=256isthenumber ofchannelsof thefeature map.Notethat p k generatesfour values,so thattheoutput oftheencoder PissizeN. Queryingmultiple points.We haveintroducedourframe- workasafunctionoperating onasingle querypoint,x.How- ever,asshowninFig.2,extending ittomultiplequerypoints isstraightforward. Wecansimplyinputmultiple queriesat once,whichthe transformerdecoder T D andthedecoder D willtranslate intomultiplecoordinates. Importantly,while doingso,we disallowself attentionamongthe querypoints inorderto ensurethatthe yaresolv edindependently.6209 Figure3.RecursiveCOTRatinference timeÐWeobtainac- curatecorrespondencesby applyingourfunctional approachre- cursively,zoomingintotheresultsoftheprevious iteration,and runningthe samenetworkonthepair ofzoomed-incrops. Wegrad- uallyfocuson thecorrectcorrespondence, withgreateraccurac y.3.3.Inference
Wenextdiscuss howtoapplyourfunctional approachat
inferencetimein ordertoobtain accuratecorrespondences. Inferencewithrecursi vewith zoom-in.Applyingthe pow- erfultransformerattention mechanismtovision problems comesata costÐit requiresheavily downsampledfeature maps,whichin ourcase naturallytranslates topoorlylocal- izedcorrespondences;see Section4.6.We addressthisby exploitingthefunctionalnatureof ourapproach, applying outnetwork F recursively.As shownin Fig.3,weitera- tivelyzoomintoapreviouslyestimated correspondence,on bothimages,in ordertoobtain areÞnedestimate. There isatrade-of fbetweencompute andthenumberofzoom-in steps.We ablatedthiscarefullyonthev alidationdataand settledona zoom-infactor oftw oateach step,withfour zoom-insteps.It isworth notingthatmultiscale reÞnement iscommonin manycomputer visionalgorithms[ 32,15], butthankstoourfunctional correspondencemodel,realizing suchamultiscale inferenceprocessis notonlypossible, but alsostraightforward toimplement. Compensatingfor scaledifferences.While matchingim- agesrecursiv ely,onemustaccountforapotentialmismatch inscalebetween images.W eachiev ethisby makingthe scaleofthe patchtocrop proportionaltothe commonlyvis- ibleregions ineachimage,whichwe computeonthe Þrst step,usingthe wholeimages.T oextract thisregion, we computethe cycleconsistenc yerroratthecoarsest level, forev erypixel,andthresholditat" visible =5pixelsonthe256#256image;seeFig. 4.Insubsequent stagesÐthe
zoom-insÐwe simplyadjustthe cropsizeso verIandI sothattheir relationshipisproportional tothesum ofvalid pixels(theunmaskedpix elsinFig. 4). Dealingwithimages ofarbitrarysize .Ournet work ex- pectsimagesof Þxed256#256shape.T oprocessimagesof arbitrarysize,in theinitialstep wesimplyresize (i.e.stretch) themto256#256,andestimate theinitial correspondences. Insubsequentzoom-ins, wecrop squarepatchesfrom the originalimage aroundtheestimated points,ofa sizecom- mensuratewiththe currentzoomle vel,and resizethemto Figure4.Estimatingscaleby Þndingco-visibler egionsÐWe showtwoimageswe wishtoputincorrespondence, andthees - timatedregions incommonÐimagelocationswith ahighc ycle- consistencyerroraremask edout.256#256.Whilethis mayseema limitationon imageswith
non-standardaspect ratios,ourapproach performswellonKITTI,whichare extremelywide (3.3:1).Moreov er,we
presentastrate gyto tiledetectionsinSection4.4.Discardingerr oneouscorrespondences.Whatshould we
dowhen wequerya pointisoccluded oroutsidethe viewport intheother image?Similarlyto ourstrategy tocompensatequotesdbs_dbs27.pdfusesText_33[PDF] fiche technique 1 - Académie de Clermont-Ferrand
[PDF] PROCÉDÉ A SUIVRE POUR UNE MUTATION - USSB Handball
[PDF] Changement de filière en deuxième année (S3) - Faculté des
[PDF] formulaire admission TERMINALE GT PRO R2017
[PDF] Questions-réponses sur le changement de série - Cité scolaire d 'Apt
[PDF] Changer de vie : le guide COMPLET - Penser et Agir : Le
[PDF] Changez de vie en 7 jours (livre + CD)
[PDF] Changer de vie : comment gagner sa vie ? la - CDURABLEinfo
[PDF] Aastra 5370/5370ip - ATRP telecom
[PDF] Changement du mot de passe Exchange sous Android - UQAC
[PDF] proc chang mot passe
[PDF] Changer son mot de passe sur mobiles tablettes Android
[PDF] quick REF GUIDE - easyJetcom
[PDF] Mesure de dimensions dans PDF (Acrobat X)