to model the conditional distribution of real images given the input (B) 1 A semantic label map of resolution 1024×512 is passed through the 3 components sequentially to output [48] B M Smith, L Zhang, J Brandt, Z Lin, and J Yang
Previous PDF | Next PDF |
[PDF] 3D Modeling of Historic Sites Using Range and Image Data
Peter K Allen, Ioannis Stamos†, A Troccoli, B Smith, M Leordeanu†, Y C Hsu Dept of and 2D to 3D texture mapping of the models with imagery The testbed for images were acquired, 120 interior and 100 exterior scans, most of them
[PDF] Automatic Image Alignment for 3D Environment Modeling
treat images and 3D models as random variables and to ap- ply statistical [1] P K Allen, I Stamos, A Troccoli, B Smith, M Leordeanu, and Y C Hsu
[PDF] Whole-body modelling of people from multi-view images to populate
Hilton,A , Beresford,D , Gentils,T , Smith,R J , Sun,W and Illingworth,J Centre for Keywords: Avatar, Virtual Human, Whole-body Modelling, 3D Vision, VRML 1 images from different views of an object against a uniform blue background
[PDF] Automatic Image Alignment for 3D Environment Modeling
treat images and 3D models as random variables and to ap- ply statistical [1] P K Allen, I Stamos, A Troccoli, B Smith, M Leordeanu, and Y C Hsu
[PDF] The Impact of the Latest 3D Technologies on the - CORE
For the first time, stereo photography and photogrammetry was used for 3D model of Roman barge, (a) and (b) 3D reconstructions from two different thesis, 2012 [28] P Allen, S Feiner, A Troccoli, H Benko, E Ishak, and B Smith, “See-
[PDF] High-Resolution Image Synthesis and Semantic - CVF Open Access
to model the conditional distribution of real images given the input (B) 1 A semantic label map of resolution 1024×512 is passed through the 3 components sequentially to output [48] B M Smith, L Zhang, J Brandt, Z Lin, and J Yang
[PDF] b smith restaurant dc
[PDF] b smithi for sale
[PDF] b.ed admission 2019 mumbai
[PDF] b.ed cet books pdf free download
[PDF] b.ed cet exam form 2020
[PDF] b.ed cet study material pdf
[PDF] b1 bus timetable
[PDF] b1 bus to bay ridge
[PDF] b1 english test pdf with answers
[PDF] b1 vocabulary exercises pdf with answers
[PDF] b100 bus time schedule
[PDF] b103 bus near me
[PDF] b11 bus timetable
[PDF] b15 sentra se r exhaust
High-ResolutionImageSynthesis andSemanticManipulation withConditionalGANs
Ting-ChunWang
1NVIDIACorporation2UCBerkele y
Cascaded refinement network[5]
Our result
(c) Application: Edit object appearance(b) Application: Change label types (a) Synthesized resultFigure1:W eproposea generativeadversarialframe workfor synthesizing2048×1024imagesfromsemantic labelmaps
(lowerleftcornerin(a)). Comparedtopre viouswork [5],ourresults expressmore naturaltextures anddetails.(b)Wecan
changelabelsin theoriginal labelmapto createnewscenes,like replacingtreeswith buildings.(c) Ourframew orkalso
allowstheusertoedit theappearanceof individualobjects inthescene, e.g.changingthe col orofa carorthe textureofa
road.Pleasevisit our websiteformoreside-by-side comparisonsaswell asinteractiv eeditingdemos.Abstract
Wepresentane wmethodforsynthesizinghigh-
resolutionphoto-realisticimag esfromsemanticlabelmaps usingconditionalg enerativeadver sarialnetworks(condi- tionalGANs).Conditional GANshaveenabled avariety ofapplications,b utther esultsareoftenlimitedto low- resolutionandstillfarfr omrealistic. Inthiswork, wegen- erate2048×1024visuallyappealingr esultswitha novel adversarialloss,aswellas newmulti-scale generator anddiscriminatorarc hitectures.Furthermore,weextendourframeworktointeractivevisualmanipulationwith twoad-ditionalfeatures. First,weincorporate objectinstanceseg-mentationinformation,whic henablesobject manipulationssuchasremoving/adding objectsandc hangingtheobjectcategory.Second,weproposeamethod togener atedi-verseresultsgiventhe sameinput,allowingusersto edittheobjectappear anceinteractively .Humanopinionstud-iesdemonstrate thatourmethodsignificantlyoutperformsexistingmethods,advancingboththe qualityandthe reso-lutionofdeep image synthesisandediting .
1 87981.Introduction
Photo-realisticimagerendering usingstandardgraphics techniquesisin volved, sincegeometry,materials,andlight transportmustbe simulatede xplicitly.Although existing graphicsalgorithmse xcelatthe task,buildingandedit- ingvirtualen vironmentsise xpensiveandtime-consuming. Thatisbecause wehav etomodel every aspectoftheworld explicitly.Ifwewereabletorender photo-realisticimages usingamodel learnedfromdata, wecouldturn theprocess ofgraphicsrendering intoamodel learningandinference problem.Then,we couldsimplifythe processofcreating newvirtualworldsby trainingmodelson newdatasets.We couldev enmakeiteasiertocustomizeenvironments byal- lowinguserstosimplyspecify theov erallsemanticstruc- tureratherthan modelinggeometry, materials,orlighting.Inthispaper ,wediscuss anewapproachthatproduces
high-resolutionimagesfrom semanticlabelmaps. This methodhasa widerangeof applications.For example,we canuseit tocreatesynthetic trainingdatafor trainingvi- sualrecognitionalgorithms, sinceitis mucheasierto create semanticlabelsfor desiredscenariosthan togeneratetrain- ingimages.Using semanticsegmentation methods,wecan transformimagesinto asemanticlabel domain,editthe ob- jectsin thelabeldomain,andthentransform thembackto theimagedomain. Thismethodalso gives usnew toolsfor changingtheappearance ofexisting objects.Tosynthesizeimagesfromsemantic labels,onecan use
thepix2pixmethod, animage-to-image translationframe- work[21]whichle veragesgenerati veadversarialnetworks
(GANs)[16]ina conditionalsetting.Recently ,Chenand
Koltun[
5]suggestthat adversarialtraining mightbeun-
stableandprone tofailure forhigh-resolutionimage gen- erationtasks.Instead, theyadopt amodifiedperceptual loss[11,13,22]tosynthesize images,whichare high-
resolutionbut oftenlackfinedetailsandrealistic textures.Hereweaddress twomain issuesofthe abovestate-
of-the-artmethods:(1) thedifficulty ofgeneratinghigh- resolutionimageswith GANs[21]and(2) thelackof de-
tailsandrealistic texturesin theprevious high-resolution results[5].We showthatthroughane w,robustadversarial
learningobjectiv etogetherwithnewmulti-scalegenerator anddiscriminatorarchitectures, wecansynthesize photo- realisticimagesa t2048×1024resolution,whichare more visuallyappealingthan thosecomputedby previousmeth- ods[5,21].We firstobtainourresultswithadv ersarialtrain-
ingonly, withoutrelyingonanyhand-crafted losses[ 43]orpre-trainednetw orks(e.g.V GGNet[
47])forperceptual
losses[11,22](Figs.7c,9b).Thenwe showthat addingper-
ceptuallossesfrom pre-trainednetworks [47]canslightly
7d,9c)if
apre-trainednetw orkisa vailable.Bothresultsoutperform previousworkssubstantiallyin termsofimagequality. Figure2:Exampleresultsof usingourframe workfor translating edgestohigh-resolution naturalphotos,using CelebA-HQ[ 26]andinternetcat images. Furthermore,tosupport interactive semanticmanipula- tion,wee xtendourmethod intwodirections.First,we useinstance-lev elobjectsegmentationinformation,which canseparatedif ferentobjectinstances withinthesamecat- egory.Thisenablesflexibleobjectmanipulations, suchas adding/removingobjectsandchangingobject types.Sec- ond,wepropose amethodto generatediv erseresultsgi ven thesameinput labelmap,allo wingtheuser toeditthe ap- pearanceofthe sameobjectinteracti vely. Wecompareagainststate-of-the-art visualsynthesissys- tems[
5,21],andsho wthatour methodoutperformsthese
approachesreg ardingbothquantitativeevaluationsand hu- manperceptionstudies. Wealso performanablation study regardingthetrainingobjectivesand theimportanceof instance-levelsegmentationinformation.Inadditionto se- manticmanipulation,we testourmethod onedge2photoap- plications(Fig.2),whichsho wsthegeneralizability ofour
approach.Ourcode anddataare available atour website.Pleasecheckout thefullv ersionofour paperat
arXiv.2.RelatedW ork
ialnetworks (GANs)[16]aimto modelthenatural image
distributionbyforcingthegenerated samplestobe indistin- guishablefromnatural images.GANsenable awidev ariety ofapplicationssuch asimagegeneration [1,41,60],rep-
resentationlearning[44],imagemanipulation [62],object
detection[32],andvideo applications[ 37,50,52].Various
coarse-to-fineschemes[4]hav ebeenproposed[9,19,26,55]
tosynthesizelar gerimages(e.g. 256×256)inan uncon- ditionalsetting. Inspiredbytheirsuccesses,wepropose a newcoarse-to-finegeneratorandmulti-scale discriminator architecturessuitablefor conditionalimagegeneration ata muchhigherresolution. tion[21],whosegoal istotranslate aninputimage from
onedomainto anotherdomaingi veninput-output image pairsastraining data.Comparedto L1loss,whichoften leadstoblurry images[21,22],theadv ersarialloss[ 16]
hasbecomea popularchoicefor manyimage-to-image tasks[10,24,25,31,40,45,53,58,64].Thereason isthat
8799thediscriminatorcan learnatrainable lossfunctionand automaticallyadaptto thedifferences betweenthegener -atedandreal imagesinthe targetdomain. Fore xample,therecentpix2pix framework [
21]usedimage-conditional
GANs[38]fordif ferentapplications,such astransforming
Googlemapsto satelliteviews andgeneratingcats from usersketches. Variousmethodshave alsobeenproposedto learnanimage-to-image translationinthe absenceoftrain- ingpairs[2,33,34,46,49,51,54,63].
Recently,ChenandKoltun [
5]suggestthat itmightbe
hardforconditional GANstogenerate high-resolutionim- agesdueto thetraininginstability andoptimization issues. basedona perceptualloss[11,13,22]andproduce thefirst
modelthatcan synthesize2048×1024images.Thegen- eratedresultsare high-resolutionbut oftenlackfine details andrealisticte xtures.Ourmethod ismotivatedbytheir suc- cess.We showthatusingourne wobjectivefunctionas well asnov elmulti-scalegeneratorsanddiscriminators,wenot onlylargely stabilizethetrainingofconditionalGANs on high-resolutionimages,b utalsoachie vesignificantlybet- terresultscom paredtoChen andKoltun[5].Side-by-side
comparisonsclearlysho wouradv antage(Figs.1,7,8,9).
Deepvisualmanipulation Recently,deepneuralnet-
workshaveobtained promisingresultsinvariousimage processingtasks,such asstyletransfer [13],inpainting [40],
colorization[56],andrestoration [14].Howe ver,mostof
theseworks lackaninterfaceforusers toadjustthe current resultore xploretheoutput space.Toaddressthisissue,Zhuetal.[
62]dev elopedanoptimizationmethodforedit-
ingtheobject appearancebasedon thepriorslearned byGANs.Recentw orks[
21,45,57]alsopro videuserinter -
facesforcreatingnov elimageryfrom low-lev elcuessuch ascolorand sketch.All oftheprior worksreportresultson low-resolutionimages.Oursystemshares thesamespirit asthispast work,b utwefocus onobject-levelsemantic editing,allowing userstointeractwiththeentire sceneand manipulateindividual objectsintheimage.Asa result, userscanquickly createano velscene withminimalef fort. Ourinterface isinspiredbypriordata-driv engraphicssys- tems[6,23,28].Butour systemallows moreflexible ma-
nipulationsandproduces high-resresultsin real-time.3.Instance-Lev elImageSynthesis
Weproposeaconditionaladv ersarialframew orkforgen- eratinghigh-resolutionphoto-realistic imagesfromseman- ticlabelmaps. Wefirst review ourbaselinemodelpix2pix (Sec.3.1).We thendescribehowweincrease thephoto-
realismandresolution oftheresults withour improved ob- jectivefunctionandnetworkdesign(Sec.3.2).Next, we
useadditionalinstance-le velobject semanticinformationto furtherimprov etheimagequality(Sec.3.3).Finally, wein-
troduceaninstance-le velfeature embeddingschemetobet-terhandlethe multi-modalnatureof imagesynthesis,which enablesinteractiv eobjectediting(Sec.
3.4).3.1.Thepix2pix Baseline
Thepix2pixmethod [
21]isa conditionalGANframe-
workforimage-to-imagetranslation.It consistsofa gen- eratorGandadiscriminator D.For ourtask,theobjective ofthegenerator Gistotranslate semanticlabelmaps to realistic-lookingimages,while thediscriminatorDaimsto workoperatesinasupervised setting.Inother words,the trainingdataseti sgiv enasasetofpairs ofcorresponding images{(si,xi)},wheresiisasemantic labelmapand xi isacorresponding naturalphoto.Conditional GANsaim tomodelthe conditionaldistribution ofrealimages given theinputsemantic labelmapsvia thefollo wingminimax game:minGmaxDLGAN(G,D),wherethe objective func- tionLGAN(G,D)1isgiv enby
E (s,x)[logD(s,x)]+Es[log(1-D(s,G(s))].(1)Thepix2pixmethod adoptsU-Net[
42]asthe generator
andapatch-based fullyconv olutionalnetwork [35]asthe
discriminator.Theinputtothe discriminatorisa channel- wiseconcatenation ofthesemanticlabelmapand thecor- respondingimage.The resolutionofthe generatedimages isupto 256×256.We testeddirectlyapplyingthepix2pix frameworktogeneratehigh-resolutionimages,but found thetrainingunstable andthequality ofgeneratedimages unsatisfactory.Wethereforedescribehow weimprovethe pix2pixframew orkinthenextsubsection.3.2.Impro vingPhotorealismandResolution
Weimprovethe pix2pixframeworkbyusingacoarse-to-
finegenerator, amulti-scalediscriminatorarchitecture,and arobust adversariallearningobjective function. intotwo sub-networks:G1andG2.We termG1asthe globalgeneratornetw orkandG2asthelocal enhancer network.Thegeneratoristhen given bythetuple G= {G1,G2}asvisualizedin Fig.3.Theglobal generatornet-
workoperatesataresolution of1024×512,andthe local enhancernetwork outputsanimagewitharesolution thatis4×theoutputsize ofthepre viousone( 2×alongeachim-
agedimension).F orsynthesizingimages atanevenhigher resolution,additionallocal enhancernetworks couldbeuti- lized.For example,theoutputimageresolution ofthegen- eratorG={G1,G2}is2048×1024,andthe outputimage resolutionofG={G1,G2,G3}is4096×2048. Ourglobalgeneratorisbuiltonthearchitec tureproposed byJohnsonetal.[22],whichhas beenprov ensuccessful
forneuralstyle transferonimages upto512×512.Itcon- sistsof3components:acon volutionalfront-end G(F) 1,a 8800Figure3:Netw orkarchitectureof ourgenerator.Wefirst trainaresidual networkG1onlower resolutionimages.Then,an-
otherresidualnetw orkG2isappendedto G1andthetw onetworks aretrainedjointlyonhighresolution images.Specifically,
theinputto theresidualblocks inG2istheelement-wise sumofthe featuremapfrom G2andthelast featuremapfrom G1.
setofresidual blocksG(R) 1[