COTR: Correspondence Transformer for Matching Across Images PDF

Image Transformer

In this work we generalize a recently proposed model architecture based on self-attention

Attention-Aligned Transformer for Image Captioning

tive and influential image features. In this paper we present. A2 - an attention-aligned Transformer for image captioning

Transformer les images

Le module PIL permet de manipuler un fichier image (reconnaissance automatique de la largeur et de la hauteur en pixels de l'image création d'une grille de

Can Vision Transformers Learn without Natural Images?

Is it possible to complete Vision Transformer (ViT) pre- training without natural images and human-annotated labels? This question has become increasingly

COTR: Correspondence Transformer for Matching Across Images

Our method is the first application of transformers to image correspondence problems. 1. Functional methods using deep learning. While the idea existed already

Uformer: A General U-Shaped Transformer for Image Restoration

cient Transformer-based architecture for image restoration in which we build a hierarchical age restoration tasks

Entangled Transformer for Image Captioning

We name our model as ETA-Transformer. Remarkably. ETA-Transformer achieves state-of-the-art performance on the MSCOCO image captioning dataset. The ablation

Generating images with sparse representations

5 mars 2021 Deep generative models of images are neural networks ... the flattened DCT image through a Transformer encoder: Einput = encode (Dflat) .

SiT: Self-supervised vIsion Transformer

In this work we investigate the merits of self-supervised learning for pretraining image/vision transformers and then using them for downstream classification

Towards End-to-End Image Compression and Analysis with

Instead of placing an existing. Transformer-based image classification model directly after an image codec we aim to redesign the Vision Transformer. (ViT)

COTR:CorrespondenceTransf ormerforMatchingAcross Images

WeiJiang

1 ,EduardT rulls 2 ,JanHosang 2 ,AndreaT agliasacchi 2,3 ,Kwang MooYi 1 1

UniversityofBritishColumbia,

GoogleResearch,

UniversityofToronto

Abstract

Weproposeano velframeworkforÞnding corresponden- cesinima gesbased onadeepneuralnetworkthat,given twoimag esandaquerypointinoneofthem, Þndsitscor - respondenceintheother. Bydoingso, onehasthe option toqueryonly thepoints ofinterest andretrie vesparse cor- respondences,ortoquery allpointsin animag eandobtain densemappings.Importantly ,inor dertocapturebothlocal andglobalprior s,and toletourmodelrelate betweenimag e regionsusingthemostrelevant amongsaidprior s,wer eal- izeournetwork usingatr ansformer.At inferencetime ,we applyour correspondencenetwork byrecursivelyzooming inaround theestimates,yieldingamultiscalepipeline able topro videhighly-accuratecorrespondences.Ourmethod signiÞcantlyoutperforms thestateofthearton bothsparse anddensecorr espondencepr oblemsonmultipledatasets andtasks,r angingfrom wide-baselinestereotoopticalßow , withoutanyr etrainingfor aspeciÞcdataset.Wecommit toreleasing data,code,andallthe toolsnecessaryto train fromscratch andensurereproducibility.

1.Introduction

Findingcorrespondencesacross pairsofimages isafun- damentaltaskin computervision,with applicationsranging fromcameracalibration [22,28]tooptical ßow [32,15],

StructurefromMotion (SfM)[56,28],visuallocaliza-

tion[55,53,36],pointtracking [35,68],andhuman pose estimation[43,20].Traditionally ,twofundamentalresearch directionsexist forthisproblem.Oneisto extractsets of sparsekeypointsfrombothimagesandmatch theminorder tominimizean alignmentmetric[ 33,55,28].Theother is tointerpretcorrespondence asadenseprocess,wheree v- erypixel intheÞrstimagemapsto apixel inthesecond image[32,60,77,72].

Thedi videbetweensparseanddenseemergednaturally

fromtheapplications theywere devisedfor .Sparsemethods havelargelybeenusedtoreco verasingleglobal camera motion,suchas inwide-baseline stereo,usinggeometrical constraints.The yrelyonlocalfeatures[ 34,74,44,13]

Figure1. TheCorr espondenceTransformerÐ(a)COTR

formulatesthecorrespondence problemasa functionalmapping frompointxtopointx ,conditionalon twoinput imagesIand I .(b)COTRiscapableof sparsematching underdifferent mo- tiontypes, includingcameramotion,multi-objectmotion,and object-posechanges.(c) COTRgeneratesasmooth correspon- dencemapfor stereopairs:gi ven(c.1,2) asinput,(c.3) showsthe predicteddensecorrespondence map(color-coded ÔxÕchannel), and(c.4)w arps(c.2)onto (c.1)withthepredictedcorrespondences. andfurtherprune theputati vecorrespondences formedwith themina separatestagewith sampling-basedrobust match- ers[18,3,12],ortheir learnedcounterparts[ 75,7,76,64,

54].Densemethods, bycontrast,usually modelsmalltem-

poralchanges, suchasoptical ßowin videosequences,and relyonlocalsmoothness[35,24].Exploitingconte xtin thismannerallo wsthem toÞndcorrespondencesatarbitrary locations,includingseemingly texture-lessareas. Inthisw ork,wepresent asolutionthatbridgesthisdi vide, anov elnetworkarchitecturethatcanexpress bothformsof priorknowledge ÐglobalandlocalÐand learnthemimplic- itlyfromdata. To achieve this,weleveragethe inductive biasthatdensely connectednetw orkspossessin representing smoothfunctions[ 1,4,48]anduse atransformer [73,10,14] 1 6207
toautomaticallycontrol thenatureof priorsandlearn howto utilizethemthrough itsattentionmechanism. For example, ground-truthopticalßo wtypicallydoes notchangesmoothly acrossobjectboundaries, andsimple(attention-agnostic) denselyconnectednetw orkswould havechallengesinmod- ellingsucha discontinuouscorrespondence map,whereasa transformerw ouldnot.Moreover, transformersallow encod- ingthe relationshipbetweendifferentlocationsof theinput data,makingthem anatural Þtforcorrespondence problems. SpeciÞcally,weexpressthe problemofÞnding corres- pondencesbetweenimages IandI infunctional form,as x =F (x|I,I ),whereF isourneural networkarchi- tecture,parameterizedby !,xindexesaquerylocationinI, andx indexesitscorrespondinglocationinI ;seeFigure 1. Differentlyfromsparsemethods,COTRcan matcharbitrary querypointsvia thisfunctionalmapping, predictingonlyas manymatchesasdesired.Dif ferentlyfromdense methods, COTRlearnssmoothnessimplicitlyandcandeal withlarge cameramotionef fectiv ely. Ourwork istheÞrsttoapplytransformers toobtainaccu- ratecorrespondences.Our maintechnical contributionsare: ¥weproposea functionalcorrespondence architecturethat combinesthestrengths ofdenseand sparsemethods; ¥wesho whowtoapplyour methodrecursivelyatmulti- plescalesduring inferenceinorder tocomputehighly- accuratecorrespondences; ¥wedemonstrate thatCOTR achieves state-of-the-artper- formanceinboth denseand sparsecorrespondenceprob- lemson multipledatasetsand tasks,withoutretraining; ¥wesubstantiateour designchoices andshow thatthetrans- formerisk eyto ourapproachbyreplacingitwithasimpler model,based onaMulti-Layer Perceptron(MLP).

2.Related works

Wereviewthe literatureonbothsparseanddensematch-

ing,aswell asw orksthatutilize transformersforvision. Sparsemethods.Sparse methodsgenerallyconsistofthree stages:ke ypointdetection,featuredescription,andfea- turematching.Seminal detectorsinclude DoG[34]and

FAST[51].Popularpatch descriptorsrangefrom hand-

crafted[34,9]tolearned [42,66,17]ones.Learned fea- tureextractors becamepopularwiththeintroductionof

LIFT[74],withman yfollow-ups [13,44,16,49,5,71].

Localfeaturesare designedwithsparsity inmind,b uthav e alsobeenapplied denselyinsome cases[ 67,32].Learned localfeaturesare trainedwithintermediate metrics,suchas descriptordistanceor numberofmatches.

Featurematchingis treatedasa separatestage,where

descriptorsare matched,followed byheuristicssuch asthe ratiotest,and robustmatchers, whicharek eytodealwith highoutlier ratios.Thelat terarethe focusofmuch research, whetherhand-crafted,follo wingRANSAC [18,12,3], consensus-ormotion-based heuristics[ 11,31,6,37],or learned[75,7,76,64].Thecurrent stateofthe artbuilds on attentionalgraphneural networks[ 54].Notethat whilesome ofthesetheoretically allow featuree xtractionandmatching tobetrained endtoend, thisav enueremainslar gelyunex- plored.We showthatourmethod,which doesnotdividethe pipelineinto multiplestagesand islearnedend-to-end, can outperformthesesparse methods. Densemethods .Dense methodsaimto solveoptical ßow. Thistypicallyimplies smalldisplacements,such asthemo- tionbetweenconsecuti vevideo frames.TheclassicalLucas- Kanademethod[ 35]solves forcorrespondencesoverlocal neighbourhoods,whileHorn-Schunck [24]imposesglobal smoothness.Moremodern algorithmsstillrely onthese principles,withdif ferentalgorithmicchoices [59],orfocus onlarger displacements[8].Estimating densecorresponden- cesunderlar gebaselines anddrasticappearancechanges wasnotexploreduntil methodssuchas DeMoN[72]and SfMLearner[77]appeared,which recovered bothdepthand cameramotionÐ howev er,t heirperformancefellsomewhat shortof sparsemethods[ 75].NeighbourhoodConsens us Networks[50]explored 4DcorrelationsÐwhilepowerful, thislimitsthe imagesizethe ycantackle. Morerecently, DGC-Net[38]appliedCNNs inacoarse-to-Þne approach, trainedonsynthetic transformations,GLU-Net[ 69]com- binedglobaland localcorrelationlayers inafeature pyramid, andGOCor[ 70]improv edthefeaturecorrelationlayersto disambiguaterepeatedpatterns. Wesho wthatwe outper- formDGC-Net,GLU-Net andGOCor over multipledatasets, whileretainingour abilitytoquery individualpoints.

Attentionmechanisms.Theattention mechanismenables

aneuralnetw orkto focusonpartoftheinput. Hardat- tentionwas pioneeredbySpatialTransformers[ 26],which introducedapo werfuldiff erentiablesampler,andwas later improvedin[27].Softattention waspioneered bytransform- ers[73],whichhas sincebecome thede-factostandardin naturallanguage processingÐits applicationtovision tasks isstillin itsearlystages. Recently,DETR [10]usedT rans- formersforobject detection,whereasV iT[14]appliedthem toimagerecognition. Ourmethodis theÞrstapplication of transformerstoimage correspondenceproblems. 1 Functionalmethodsusing deeplearning .While theidea existedalready,e.g.togenerate images[58],using neuralnet- worksinfunctionalformhas recentlygained muchtraction. DeepSDF[45]usesdeep networksas afunctionthat returns thesigneddistance Þeldvalue ofaquery point.Theseideas wererecentlye xtendedby [21]toestablish correspondences betweenincompleteshapes. Whilenotdirectly relatedtoim- agecorrespondence,this researchhassho wnthatfunctional methodscanachie vestate-of-the-art performance. 1 Aconcurrentrele vantw orkforfeature-lessimagematchingwaspro- posedshortlyafter ourwork becamepublic[ 63].6208

3.Method

WeÞrstformalizeourproblem (Section3.1),thendetail ourarchitecture(Section 3.2),its recursive useatinference time(Section3.3),andour implementation(Section3.4).

3.1.Pr oblemformulation

Letx![0,1]

2 bethenormalized coordinatesofthe query pointinimage I,forwhich wewishto Þndthecorrespond- ingpoint,x ![0,1] 2 ,inimage I .We frametheproblemof learningtoÞnd correspondencesas thatofÞnding thebest setofparameters !foraparametric functionF x|I,I minimizing argmin E (x,x ,I,I )"D L corr +L cycle ,(1) L corr x F x|I,I 2 2 ,(2) L cycle x"F F x|I,I |I,I 2 2 ,(3) whereDisthetraining datasetofground correspondences, L corr measuresthecorrespondence estimationerrors, and L cycle enforcescorrespondencesto becycle-consistent.

3.2.Network architecture

WeimplementF

withatransformer .Ourarchitecture, inspiredby[ 10,14],isillustrated inFigure2.We Þrstcrop andresizethe inputintoa 256#256image,andcon vertit intoado wnsampledfeaturemap size16#16#256witha sharedCNN backbone,E.W ethenconcatenatetherepresen- tationsfortw ocorresponding imagessidebyside ,forming afeaturemap size16#32#256,towhich weaddposi- tionalencodingP(withN=256channels)ofthe coordinate c=

E(I),E(I

+P("),(4) where[á]denotesconcatenationalong thespatialdimension Ð asubtlyimportant detailnov eltoour architecturethatwe discussingreater depthlater on.We thenfeedthe context featuremapctoatransformer encoderT E ,andinterpret its resultswitha transformerdecoderT D ,alongwith thequery pointx,encodedby PÐthepositional encoderusedt o generate".We Þnallyprocesstheoutputof thetransformer decoderwitha fullyconnectedlayer Dtoobtainour estimate forthecorresponding point,x x =F x|I,I =D(T D (P(x),T E (c))).(5) Forarchitecturaldetailsofeach componentpleaserefer to supplementarymaterial .

Importanceof contextconcatenation.Concatenationof

thefeaturemaps alongthespatial dimensioniscritical, as Figure2.TheCOTR architectureÐWeÞrstprocesseach image witha(shared) backboneCNNEtoproducefeature mapssize

16x16,whichwe thenconcatenate together,and addpositional

encodingstoform ourcontext featuremap.The resultsarefed intoa transformerT,along withtheque rypoint(s)x.Theoutput ofthetransformer isdecodedby amulti-layerperceptron Dinto correspondence(s)x itallows thetransformerencoderT E torelatebetwee nloca- tionswithintheimage (self-attention),and acrossimages (cross-attention).Notethat, toallow theencoderto distin- guishbetweenpix elsin thetwoimages,weemplo yasingle positionalencodingfor theentireconcatenated featuremap; seeFig.2.We concatenatealongthespatialdimensionrather thanthechannel dimension,asthe latterwould createarti- Þcialrelationshipsbetween featurescoming fromthesame pixellocationsineach image.Concatenationallo wsthe featuresineach maptobe treatedina waythat issimilar towords inasentence[73].The encoderthenassociates andrelatesthem todiscov erwhichones toattendto given theircontext Ðwhichisarguablya morenaturalw aytoÞnd correspondences. Linearpositionalen coding.We founditcriticaltousea linearincreasein frequencyfor thepositionalencoding,as opposedtothe commonlyusedlog-linear strategy[ 73,10], whichmadeour optimizationunstable;see supplementary material.Hence,for agiv enlocationx=[x,y]wewrite P(x)= p 1 (x),p 2 (x),...,pN 4 (x) ,(6) p k (x)= sin(k!x ),cos(k!x ,(7) whereN=256isthenumber ofchannelsof thefeature map.Notethat p k generatesfour values,so thattheoutput oftheencoder PissizeN. Queryingmultiple points.We haveintroducedourframe- workasafunctionoperating onasingle querypoint,x.How- ever,asshowninFig.2,extending ittomultiplequerypoints isstraightforward. Wecansimplyinputmultiple queriesat once,whichthe transformerdecoder T D andthedecoder D willtranslate intomultiplecoordinates. Importantly,while doingso,we disallowself attentionamongthe querypoints inorderto ensurethatthe yaresolv edindependently.6209 Figure3.RecursiveCOTRatinference timeÐWeobtainac- curatecorrespondencesby applyingourfunctional approachre- cursively,zoomingintotheresultsoftheprevious iteration,and runningthe samenetworkonthepair ofzoomed-incrops. Wegrad- uallyfocuson thecorrectcorrespondence, withgreateraccurac y.

3.3.Inference

Wenextdiscuss howtoapplyourfunctional approachat

inferencetimein ordertoobtain accuratecorrespondences. Inferencewithrecursi vewith zoom-in.Applyingthe pow- erfultransformerattention mechanismtovision problems comesata costÐit requiresheavily downsampledfeature maps,whichin ourcase naturallytranslates topoorlylocal- izedcorrespondences;see Section4.6.We addressthisby exploitingthefunctionalnatureof ourapproach, applying outnetwork F recursively.As shownin Fig.3,weitera- tivelyzoomintoapreviouslyestimated correspondence,on bothimages,in ordertoobtain areÞnedestimate. There isatrade-of fbetweencompute andthenumberofzoom-in steps.We ablatedthiscarefullyonthev alidationdataand settledona zoom-infactor oftw oateach step,withfour zoom-insteps.It isworth notingthatmultiscale reÞnement iscommonin manycomputer visionalgorithms[ 32,15], butthankstoourfunctional correspondencemodel,realizing suchamultiscale inferenceprocessis notonlypossible, but alsostraightforward toimplement. Compensatingfor scaledifferences.While matchingim- agesrecursiv ely,onemustaccountforapotentialmismatch inscalebetween images.W eachiev ethisby makingthe scaleofthe patchtocrop proportionaltothe commonlyvis- ibleregions ineachimage,whichwe computeonthe Þrst step,usingthe wholeimages.T oextract thisregion, we computethe cycleconsistenc yerroratthecoarsest level, forev erypixel,andthresholditat" visible =5pixelsonthe

256#256image;seeFig. 4.Insubsequent stagesÐthe

zoom-insÐwe simplyadjustthe cropsizeso verIandI sothattheir relationshipisproportional tothesum ofvalid pixels(theunmaskedpix elsinFig. 4). Dealingwithimages ofarbitrarysize .Ournet work ex- pectsimagesof Þxed256#256shape.T oprocessimagesof arbitrarysize,in theinitialstep wesimplyresize (i.e.stretch) themto256#256,andestimate theinitial correspondences. Insubsequentzoom-ins, wecrop squarepatchesfrom the originalimage aroundtheestimated points,ofa sizecom- mensuratewiththe currentzoomle vel,and resizethemto Figure4.Estimatingscaleby Þndingco-visibler egionsÐWe showtwoimageswe wishtoputincorrespondence, andthees - timatedregions incommonÐimagelocationswith ahighc ycle- consistencyerroraremask edout.

256#256.Whilethis mayseema limitationon imageswith

non-standardaspect ratios,ourapproach performswellon

KITTI,whichare extremelywide (3.3:1).Moreov er,we

presentastrate gyto tiledetectionsinSection4.4.

Discardingerr oneouscorrespondences.Whatshould we

dowhen wequerya pointisoccluded oroutsidethe viewport intheother image?Similarlyto ourstrategy tocompensatequotesdbs_dbs27.pdfusesText_33

[PDF] Désactivation des coussins gonflables - SAAQ

[PDF] fiche technique 1 - Académie de Clermont-Ferrand

[PDF] PROCÉDÉ A SUIVRE POUR UNE MUTATION - USSB Handball

[PDF] Changement de filière en deuxième année (S3) - Faculté des

[PDF] formulaire admission TERMINALE GT PRO R2017

[PDF] Questions-réponses sur le changement de série - Cité scolaire d 'Apt

[PDF] Changer de vie : le guide COMPLET - Penser et Agir : Le

[PDF] Changez de vie en 7 jours (livre + CD)

[PDF] Changer de vie : comment gagner sa vie ? la - CDURABLEinfo

[PDF] Aastra 5370/5370ip - ATRP telecom

[PDF] Changement du mot de passe Exchange sous Android - UQAC

[PDF] proc chang mot passe

[PDF] Changer son mot de passe sur mobiles tablettes Android

[PDF] quick REF GUIDE - easyJetcom

[PDF] Mesure de dimensions dans PDF (Acrobat X)

[PDF] COTR: Correspondence Transformer for Matching Across Images