[PDF] [PDF] High-Resolution Image Synthesis and Semantic - CVF Open Access

to model the conditional distribution of real images given the input (B) 1 A semantic label map of resolution 1024×512 is passed through the 3 components sequentially to output [48] B M Smith, L Zhang, J Brandt, Z Lin, and J Yang



Previous PDF Next PDF





[PDF] 3D Modeling of Historic Sites Using Range and Image Data

Peter K Allen, Ioannis Stamos†, A Troccoli, B Smith, M Leordeanu†, Y C Hsu Dept of and 2D to 3D texture mapping of the models with imagery The testbed for images were acquired, 120 interior and 100 exterior scans, most of them 



[PDF] Automatic Image Alignment for 3D Environment Modeling

treat images and 3D models as random variables and to ap- ply statistical [1] P K Allen, I Stamos, A Troccoli, B Smith, M Leordeanu, and Y C Hsu



[PDF] Whole-body modelling of people from multi-view images to populate

Hilton,A , Beresford,D , Gentils,T , Smith,R J , Sun,W and Illingworth,J Centre for Keywords: Avatar, Virtual Human, Whole-body Modelling, 3D Vision, VRML 1 images from different views of an object against a uniform blue background



[PDF] Automatic Image Alignment for 3D Environment Modeling

treat images and 3D models as random variables and to ap- ply statistical [1] P K Allen, I Stamos, A Troccoli, B Smith, M Leordeanu, and Y C Hsu



[PDF] The Impact of the Latest 3D Technologies on the - CORE

For the first time, stereo photography and photogrammetry was used for 3D model of Roman barge, (a) and (b) 3D reconstructions from two different thesis, 2012 [28] P Allen, S Feiner, A Troccoli, H Benko, E Ishak, and B Smith, “See-



[PDF] High-Resolution Image Synthesis and Semantic - CVF Open Access

to model the conditional distribution of real images given the input (B) 1 A semantic label map of resolution 1024×512 is passed through the 3 components sequentially to output [48] B M Smith, L Zhang, J Brandt, Z Lin, and J Yang

[PDF] b smith net worth 2019

[PDF] b smith restaurant dc

[PDF] b smithi for sale

[PDF] b.ed admission 2019 mumbai

[PDF] b.ed cet books pdf free download

[PDF] b.ed cet exam form 2020

[PDF] b.ed cet study material pdf

[PDF] b1 bus timetable

[PDF] b1 bus to bay ridge

[PDF] b1 english test pdf with answers

[PDF] b1 vocabulary exercises pdf with answers

[PDF] b100 bus time schedule

[PDF] b103 bus near me

[PDF] b11 bus timetable

[PDF] b15 sentra se r exhaust

High-ResolutionImageSynthesis andSemanticManipulation withConditionalGANs

Ting-ChunWang

1NVIDIACorporation2UCBerkele y

Cascaded refinement network[5]

Our result

(c) Application: Edit object appearance(b) Application: Change label types (a) Synthesized result

Figure1:W eproposea generativeadversarialframe workfor synthesizing2048×1024imagesfromsemantic labelmaps

(lowerleftcornerin(a)). Comparedtopre viouswork [

5],ourresults expressmore naturaltextures anddetails.(b)Wecan

changelabelsin theoriginal labelmapto createnewscenes,like replacingtreeswith buildings.(c) Ourframew orkalso

allowstheusertoedit theappearanceof individualobjects inthescene, e.g.changingthe col orofa carorthe textureofa

road.Pleasevisit our websiteformoreside-by-side comparisonsaswell asinteractiv eeditingdemos.

Abstract

Wepresentane wmethodforsynthesizinghigh-

resolutionphoto-realisticimag esfromsemanticlabelmaps usingconditionalg enerativeadver sarialnetworks(condi- tionalGANs).Conditional GANshaveenabled avariety ofapplications,b utther esultsareoftenlimitedto low- resolutionandstillfarfr omrealistic. Inthiswork, wegen- erate2048×1024visuallyappealingr esultswitha novel adversarialloss,aswellas newmulti-scale generator and

discriminatorarc hitectures.Furthermore,weextendourframeworktointeractivevisualmanipulationwith twoad-ditionalfeatures. First,weincorporate objectinstanceseg-mentationinformation,whic henablesobject manipulationssuchasremoving/adding objectsandc hangingtheobjectcategory.Second,weproposeamethod togener atedi-verseresultsgiventhe sameinput,allowingusersto edittheobjectappear anceinteractively .Humanopinionstud-iesdemonstrate thatourmethodsignificantlyoutperformsexistingmethods,advancingboththe qualityandthe reso-lutionofdeep image synthesisandediting .

1 8798

1.Introduction

Photo-realisticimagerendering usingstandardgraphics techniquesisin volved, sincegeometry,materials,andlight transportmustbe simulatede xplicitly.Although existing graphicsalgorithmse xcelatthe task,buildingandedit- ingvirtualen vironmentsise xpensiveandtime-consuming. Thatisbecause wehav etomodel every aspectoftheworld explicitly.Ifwewereabletorender photo-realisticimages usingamodel learnedfromdata, wecouldturn theprocess ofgraphicsrendering intoamodel learningandinference problem.Then,we couldsimplifythe processofcreating newvirtualworldsby trainingmodelson newdatasets.We couldev enmakeiteasiertocustomizeenvironments byal- lowinguserstosimplyspecify theov erallsemanticstruc- tureratherthan modelinggeometry, materials,orlighting.

Inthispaper ,wediscuss anewapproachthatproduces

high-resolutionimagesfrom semanticlabelmaps. This methodhasa widerangeof applications.For example,we canuseit tocreatesynthetic trainingdatafor trainingvi- sualrecognitionalgorithms, sinceitis mucheasierto create semanticlabelsfor desiredscenariosthan togeneratetrain- ingimages.Using semanticsegmentation methods,wecan transformimagesinto asemanticlabel domain,editthe ob- jectsin thelabeldomain,andthentransform thembackto theimagedomain. Thismethodalso gives usnew toolsfor changingtheappearance ofexisting objects.

Tosynthesizeimagesfromsemantic labels,onecan use

thepix2pixmethod, animage-to-image translationframe- work[

21]whichle veragesgenerati veadversarialnetworks

(GANs)[

16]ina conditionalsetting.Recently ,Chenand

Koltun[

5]suggestthat adversarialtraining mightbeun-

stableandprone tofailure forhigh-resolutionimage gen- erationtasks.Instead, theyadopt amodifiedperceptual loss[

11,13,22]tosynthesize images,whichare high-

resolutionbut oftenlackfinedetailsandrealistic textures.

Hereweaddress twomain issuesofthe abovestate-

of-the-artmethods:(1) thedifficulty ofgeneratinghigh- resolutionimageswith GANs[

21]and(2) thelackof de-

tailsandrealistic texturesin theprevious high-resolution results[

5].We showthatthroughane w,robustadversarial

learningobjectiv etogetherwithnewmulti-scalegenerator anddiscriminatorarchitectures, wecansynthesize photo- realisticimagesa t2048×1024resolution,whichare more visuallyappealingthan thosecomputedby previousmeth- ods[

5,21].We firstobtainourresultswithadv ersarialtrain-

ingonly, withoutrelyingonanyhand-crafted losses[ 43]
orpre-trainednetw orks(e.g.V GGNet[

47])forperceptual

losses[

11,22](Figs.7c,9b).Thenwe showthat addingper-

ceptuallossesfrom pre-trainednetworks [

47]canslightly

7d,9c)if

apre-trainednetw orkisa vailable.Bothresultsoutperform previousworkssubstantiallyin termsofimagequality. Figure2:Exampleresultsof usingourframe workfor translating edgestohigh-resolution naturalphotos,using CelebA-HQ[ 26]
andinternetcat images. Furthermore,tosupport interactive semanticmanipula- tion,wee xtendourmethod intwodirections.First,we useinstance-lev elobjectsegmentationinformation,which canseparatedif ferentobjectinstances withinthesamecat- egory.Thisenablesflexibleobjectmanipulations, suchas adding/removingobjectsandchangingobject types.Sec- ond,wepropose amethodto generatediv erseresultsgi ven thesameinput labelmap,allo wingtheuser toeditthe ap- pearanceofthe sameobjectinteracti vely. Wecompareagainststate-of-the-art visualsynthesissys- tems[

5,21],andsho wthatour methodoutperformsthese

approachesreg ardingbothquantitativeevaluationsand hu- manperceptionstudies. Wealso performanablation study regardingthetrainingobjectivesand theimportanceof instance-levelsegmentationinformation.Inadditionto se- manticmanipulation,we testourmethod onedge2photoap- plications(Fig.

2),whichsho wsthegeneralizability ofour

approach.Ourcode anddataare available atour website.

Pleasecheckout thefullv ersionofour paperat

arXiv.

2.RelatedW ork

ialnetworks (GANs)[

16]aimto modelthenatural image

distributionbyforcingthegenerated samplestobe indistin- guishablefromnatural images.GANsenable awidev ariety ofapplicationssuch asimagegeneration [

1,41,60],rep-

resentationlearning[

44],imagemanipulation [62],object

detection[

32],andvideo applications[ 37,50,52].Various

coarse-to-fineschemes[

4]hav ebeenproposed[9,19,26,55]

tosynthesizelar gerimages(e.g. 256×256)inan uncon- ditionalsetting. Inspiredbytheirsuccesses,wepropose a newcoarse-to-finegeneratorandmulti-scale discriminator architecturessuitablefor conditionalimagegeneration ata muchhigherresolution. tion[

21],whosegoal istotranslate aninputimage from

onedomainto anotherdomaingi veninput-output image pairsastraining data.Comparedto L1loss,whichoften leadstoblurry images[

21,22],theadv ersarialloss[ 16]

hasbecomea popularchoicefor manyimage-to-image tasks[

10,24,25,31,40,45,53,58,64].Thereason isthat

8799

thediscriminatorcan learnatrainable lossfunctionand automaticallyadaptto thedifferences betweenthegener -atedandreal imagesinthe targetdomain. Fore xample,therecentpix2pix framework [

21]usedimage-conditional

GANs[

38]fordif ferentapplications,such astransforming

Googlemapsto satelliteviews andgeneratingcats from usersketches. Variousmethodshave alsobeenproposedto learnanimage-to-image translationinthe absenceoftrain- ingpairs[

2,33,34,46,49,51,54,63].

Recently,ChenandKoltun [

5]suggestthat itmightbe

hardforconditional GANstogenerate high-resolutionim- agesdueto thetraininginstability andoptimization issues. basedona perceptualloss[

11,13,22]andproduce thefirst

modelthatcan synthesize2048×1024images.Thegen- eratedresultsare high-resolutionbut oftenlackfine details andrealisticte xtures.Ourmethod ismotivatedbytheir suc- cess.We showthatusingourne wobjectivefunctionas well asnov elmulti-scalegeneratorsanddiscriminators,wenot onlylargely stabilizethetrainingofconditionalGANs on high-resolutionimages,b utalsoachie vesignificantlybet- terresultscom paredtoChen andKoltun[

5].Side-by-side

comparisonsclearlysho wouradv antage(Figs.

1,7,8,9).

Deepvisualmanipulation Recently,deepneuralnet-

workshaveobtained promisingresultsinvariousimage processingtasks,such asstyletransfer [

13],inpainting [40],

colorization[

56],andrestoration [14].Howe ver,mostof

theseworks lackaninterfaceforusers toadjustthe current resultore xploretheoutput space.Toaddressthisissue,

Zhuetal.[

62]dev elopedanoptimizationmethodforedit-

ingtheobject appearancebasedon thepriorslearned by

GANs.Recentw orks[

21,45,57]alsopro videuserinter -

facesforcreatingnov elimageryfrom low-lev elcuessuch ascolorand sketch.All oftheprior worksreportresultson low-resolutionimages.Oursystemshares thesamespirit asthispast work,b utwefocus onobject-levelsemantic editing,allowing userstointeractwiththeentire sceneand manipulateindividual objectsintheimage.Asa result, userscanquickly createano velscene withminimalef fort. Ourinterface isinspiredbypriordata-driv engraphicssys- tems[

6,23,28].Butour systemallows moreflexible ma-

nipulationsandproduces high-resresultsin real-time.

3.Instance-Lev elImageSynthesis

Weproposeaconditionaladv ersarialframew orkforgen- eratinghigh-resolutionphoto-realistic imagesfromseman- ticlabelmaps. Wefirst review ourbaselinemodelpix2pix (Sec.

3.1).We thendescribehowweincrease thephoto-

realismandresolution oftheresults withour improved ob- jectivefunctionandnetworkdesign(Sec.

3.2).Next, we

useadditionalinstance-le velobject semanticinformationto furtherimprov etheimagequality(Sec.

3.3).Finally, wein-

troduceaninstance-le velfeature embeddingschemetobet-terhandlethe multi-modalnatureof imagesynthesis,which enablesinteractiv eobjectediting(Sec.

3.4).

3.1.Thepix2pix Baseline

Thepix2pixmethod [

21]isa conditionalGANframe-

workforimage-to-imagetranslation.It consistsofa gen- eratorGandadiscriminator D.For ourtask,theobjective ofthegenerator Gistotranslate semanticlabelmaps to realistic-lookingimages,while thediscriminatorDaimsto workoperatesinasupervised setting.Inother words,the trainingdataseti sgiv enasasetofpairs ofcorresponding images{(si,xi)},wheresiisasemantic labelmapand xi isacorresponding naturalphoto.Conditional GANsaim tomodelthe conditionaldistribution ofrealimages given theinputsemantic labelmapsvia thefollo wingminimax game:minGmaxDLGAN(G,D),wherethe objective func- tionLGAN(G,D)

1isgiv enby

E (s,x)[logD(s,x)]+Es[log(1-D(s,G(s))].(1)

Thepix2pixmethod adoptsU-Net[

42]asthe generator

andapatch-based fullyconv olutionalnetwork [

35]asthe

discriminator.Theinputtothe discriminatorisa channel- wiseconcatenation ofthesemanticlabelmapand thecor- respondingimage.The resolutionofthe generatedimages isupto 256×256.We testeddirectlyapplyingthepix2pix frameworktogeneratehigh-resolutionimages,but found thetrainingunstable andthequality ofgeneratedimages unsatisfactory.Wethereforedescribehow weimprovethe pix2pixframew orkinthenextsubsection.

3.2.Impro vingPhotorealismandResolution

Weimprovethe pix2pixframeworkbyusingacoarse-to-

finegenerator, amulti-scalediscriminatorarchitecture,and arobust adversariallearningobjective function. intotwo sub-networks:G1andG2.We termG1asthe globalgeneratornetw orkandG2asthelocal enhancer network.Thegeneratoristhen given bythetuple G= {G1,G2}asvisualizedin Fig.

3.Theglobal generatornet-

workoperatesataresolution of1024×512,andthe local enhancernetwork outputsanimagewitharesolution thatis

4×theoutputsize ofthepre viousone( 2×alongeachim-

agedimension).F orsynthesizingimages atanevenhigher resolution,additionallocal enhancernetworks couldbeuti- lized.For example,theoutputimageresolution ofthegen- eratorG={G1,G2}is2048×1024,andthe outputimage resolutionofG={G1,G2,G3}is4096×2048. Ourglobalgeneratorisbuiltonthearchitec tureproposed byJohnsonetal.[

22],whichhas beenprov ensuccessful

forneuralstyle transferonimages upto512×512.Itcon- sistsof3components:acon volutionalfront-end G(F) 1,a 8800

Figure3:Netw orkarchitectureof ourgenerator.Wefirst trainaresidual networkG1onlower resolutionimages.Then,an-

otherresidualnetw orkG2isappendedto G1andthetw onetworks aretrainedjointlyonhighresolution images.Specifically,

theinputto theresidualblocks inG2istheelement-wise sumofthe featuremapfrom G2andthelast featuremapfrom G1.

setofresidual blocksG(R) 1[

18],anda transposedconv olu-

tionalback -endG(B)

1.Asemantic labelmapof resolution

1024×512ispassedthrough the3components sequentially

tooutputan imageofresolution 1024×512.

Thelocalenhancer networkalso consistsof3 com-

ponents:acon volutionalfront-end G(F)

2,aset ofresid-

ualblocksG(R)

2,anda transposedconv olutionalback-end

G (B)

2.Theresolution oftheinput labelmapto G2is

2048×1024.Different fromtheglobalgeneratornetwork,

theinputto theresidualblock G(R)

2istheelement-wise sum

oftwo featuremaps:theoutputfeaturemap ofG(F) 2,and thelastfeature mapofthe back-endofthe globalgeneratorquotesdbs_dbs17.pdfusesText_23