Aligning Books and Movies: Towards Story-Like Visual Explanations PDF

fiction released between 2015 and 2020. Sources of the data: IMDB public files for theatrical films writers and TV fiction directors and writers.

The MediaEval 2015 Affective Impact of Movies Task

Sep 15 2015 Induced affect detection: the emotional impact of a video or movie can be a strong indicator for search or recommendation;. • Violence detection ...

Aligning Books and Movies: Towards Story-Like Visual Explanations

To align movies and books we exploit a neural sentence movies and books such as book retrieval (finding the book ... arXiv preprint arXiv 2015.

SARAJEVO FILM FESTIVAL 2015

Aug 22 2015 departments of the University 'Hasan Prishtina' will attend the Sarajevo Film festival 2015: The filmmakers: Blerta ZEQIRI.

NII-UIT at MediaEval 2015 Affective Impact of Movies Task

Sep 15 2015 NII-UIT at MediaEval 2015. Affective Impact of Movies Task. Vu Lam. University of Science

2015 Feature Film Study

4. In 2015 three of the 19 films produced in California were animated movies. Of the 16 live-action movies filmed primarily in the state

Inequality in 800 Popular Films: Examining Portrayals of Gender

LGBT. Only 32 speaking or named characters were lesbian gay

Banff Mountain Film Festival World Tour Films 2015/16

Winner of the British. Mountaineering Council's 2014 Women in Adventure Film Competition. Builder (special edit). (2015 Canada

PARENTS RATINGS ADVISORY STUDY - 2015

Parents feel that the rating system advises best on the amount of violence content versus other content types. •. Parents indicate that movies containing

DISCUSSION DRAFT

Jun 26 2015 CAYMAN ISLANDS. Supplement No.1 published with Gazette No. 16 dated 3rd August

AligningBooksand Movies:T owardsStory-lik eVisualExplanationsby

WatchingMoviesandReading Books

YukunZhu

?,1RyanKiros*,1RichardZemel1RuslanSalakhutdinov 1

RaquelUrtasun

1AntonioTorralba 2SanjaFidler1

1UniversityofToronto2MassachusettsInstituteof Technology

Abstract

Booksare arichsourceof bothfine-grained information, howac haracter, anobjectorascenelookslike,aswellas high-levelsemantics,whatsomeoneis thi nking,feeling and howthesestates evolvethr oughastory .Thispaperaimsto alignbooksto theirmovie releasesin orderto providerich descriptiveexplanations forvisualcontentthatgosemanti- callyfarbe yondthecaptions availableincurrentdatasets.

Toalignmoviesand bookswee xploitaneuralsentence

embeddingthatis trainedin anunsupervisedway froma largecorpusofbooks,aswellas avideo-text neuralem- beddingforcomputing similaritiesbetweenmo vieclipsand sentencesinthe book.We propose acontext-awar eCNN tocombineinformation frommultiple sources.W edemon- strategoodquantitativeperformancefor movie/bookalign- mentandshow several qualitativeexamples thatshowcase thediversity oftasksourmodelcanbe usedfor.

1.Introduction

Atrulyintelligent machineneedsto notonlyparse the

surrounding3Den vironment,but alsounderstandwhypeo- pletake certainactions,whatthey willdonext,whatthe y couldpossiblybe thinking,ande ventry toempathizewith them.Inthis quest,languagewill playacrucial rolein groundingvisualinformation tohigh-lev elsemanticcon- cepts.Onlya feww ordsina sentencemayconvey really richsemanticinformation. Languagealsorepresents anatu- ralmeansof interactionbetweena naive userandour vision algorithms,whichis particularlyimportantfor applications suchassocial roboticsorassisti vedri ving.

Combiningimagesor videoswithlanguage hasgotten

significantattentionin thepastyear ,partlydue tothecre- ationofCoCo [

20],Microsoft"s large-scalecaptionedim-

agedataset.The fieldhastackled adiv ersesetof taskssuch ascaptioning[

15,13,40,39,24],alignment[ 13,17,38],

Q&A[

22,21],visualmodel learningfromte xtualdescrip-

tions[

9,29],andsemantic visualsearchwith naturalmulti-

sentencequeries[ 19]. ?Denotesequalcontrib ution Figure1:Shotfromthe movieGoneGirl,alongwith thesubtitle, alignedwiththe book.We reasonaboutthe visualanddialog (text) alignmentbetweenthe movieand abook. Booksprovide uswithverydescriptiv etext thatconv eys bothfine-grainedvisual details(how thingslooklik e)as wellashigh-le velsemantics (whatpeoplethink,feel,and howtheirstatesev olvethrough astory).This sourceof knowledge,however ,doesnotcomewithassociatedvisual informationthatw ouldenableus togrounditwithnatural language.Groundingdescr iptionsinbooks tovisionwould allowustogette xtualexplanationsorstoriesabout thevi- sualworld ratherthanshortcaptionsavailableincurrent datasets.Itcould alsoprovide uswitha verylar geamount ofdata(with tensofthousands booksav ailableonline).

Inthispaper ,wee xploitthefactthatman ybooksha ve

beenturnedinto movies.Books andtheirmo viereleases havealotofcommonknowledge aswellas theyare com- plementaryinman yways. Forinstance,booksprovide de- taileddescriptionsabout theintentionsand mentalstatesof thecharacters,while moviesare betteratcapturing visual aspectsofthe settings.

Thefirstchallenge weneedto address,andthe focusof

thispaper, istoalignbookswiththeir moviereleases inor- dertoobtain richdescriptionsfor thevisualcontent. We aimtoalign thetwo sourceswithtw otypesof information: visual,wherethe goalisto linkamo vieshotto abookpara- graph,anddialog,wherewe wantto findcorrespondences betweensentencesin themovie" ssubtitleand sentencesin thebook(Fig.

1).We introduceanovelsentence similarity

measurebasedon aneuralsentence embeddingtrainedon millionsofsentences fromalar gecorpusof books.Onthe visualside,we extendthe neuralimage-sentenceembed- dingstothe videodomainand trainthemodel onDVS de- scriptionsofmo vieclips.Our approachcombinesdifferent similaritymeasuresand takesinto accountcontextual infor- 1 19

mationcontainedin thenearbyshots andbooksentences. Ourfinalalignment modelisformulated asanener gymin-imizationproblemthat encouragesthealignment tofollow asimilartimeline. Toe valuatethe book-moviealignmentmodelwecollected adatasetwith 11movie/book pairsan-notatedwith2,070 shot-to-sentencecorrespondences.W edemonstrategoodquantitati veperformance andshowsev-eralqualitativ eexamplesthatshowcasethediv ersityoftasksourmodel canbeused for.All ourdataand codeareavailable:

http://www.cs.utoronto.ca/˜mbweb/.

Thealignmentmodel enablesmultipleapplications.

Imagineanapp whichallows theuserto browsethe book

asthescenes unrollinthe movie:perhaps itsendingor act- ingareambiguous, andonew ouldlike toquerythe book foranswers.V ice-versa,while readingthebookonemight wanttoswitchfromte xttovideo, particularlyforthe juicy scenes.We alsoshowotherapplicationsof learningfrom moviesandbookssuchas bookretriev al(findingthe book thatgoeswith amovie andfindingother similarbooks),and captioningCoCoimages withstory-like descriptions.

2.RelatedW ork

Mosteffort inthedomainofvisionand languagehas

beendev otedtotheproblemofimagecaptioning.Older themintote xtualdescriptions[

7,18].Recently, several

approachesbasedon RNNsemerged, generatingcaptions viaalearned jointimage-text embedding[

15,13,40,24].

Theseapproachesha veals obeenextendedtogeneratede- scriptionsofshort videoclips[

39].In[ 27],theauthors go

beyonddescribingwhatishappeningin animageand pro- videexplanations aboutwhysomethingishappening. Re- latedtoours isalsow orkonimage retrieval [

11]whichaims

tofindan imagethatbest depictsacomple xdescription.

Fortext-to-imagealignment,[

17,8]findcorrespon-

dencesbetweennouns andpronounsin acaptionand visual objectsusingse veralvisual andtextualpotentials.Linet al.[

19]doesso forvideos.In [23],theauthors aligncook-

ingvideoswith therecipes.Bojano wskietal. [

2]local-

izeactionsfrom anorderedlist oflabelsin videoclips. In[

13,34],theauthors useRNNembeddings tofindthe

correspondences.[

41]combinesneural embeddingswith

softattentionin ordertoalign thewords toimagere gions.

Earlywork onmovie-to-textalignmentinclude dynamic

timewarping foraligningmoviestoscripts withthehelp ofsubtitles[

6,5].Sankaretal.[31]furtherde velopeda

systemwhichidentified setsofvisual andaudiofeatures to Suchalignmenthas beenexploited toprovide weaklabels forpersonnaming tasks[

6,33,28].

Closesttoour workis [

38],whichaligns plotsynopsesto

shotsinthe TVseriesfor story-basedcontent retrieval. This

workadoptsasimilarityfunction betweensentencesin plotsynopsesandshots basedonperson identitiesandk eywords insubtitles.Our workdif ferswiththeirs inseveralimpor- tantaspects.First, wetacklea morechallengingproblem ofmovie/bookalignment.Unlikeplot synopsis,which closelyfollowthestorylineofmo vies,booksare moreverbose andmightvary inthestorylinefromtheirmo vierelease.Fur -thermore,weuse learnedneuralembeddings tocomputethe similaritiesratherthan hand-designedsimilarityfunctions.

Paralleltoourwork, [

37]aimsto alignscenesin movies

tochaptersin thebook.Ho wever ,theirapproach operates onav erycoarsele vel(chapters),whileoursdoes soonthe sentence/paragraphlev el.Theirdatasetthusevaluateson

90scene-chaptercorrespondences, whileourdataset draws

1,800shot-to-paragraphalignments. Furthermore,theap-

proachesareinherently different.[

37]matchesthe pres-

enceofcharacters inascene tothosein achapter, aswell asuseshand-crafted similaritymeasuresbetween sentences inthesubtitles anddialogsin thebooks,similarly to[ 38].

Rohrbachetal.[

30]recentlyreleased theMovie De-

scriptiondatasetwhich containsclipsfrom movies,each time-stampedwitha sentencefromD VS(Descriptiv eVideo Service).Thedataset containsclipsfromovera100movies, andprovides agreatresourceforthecapti oningtechniques. Oureffort hereistoalignmovies withbooksin ordertoob- tainlonger, richerandmorehigh-level videodescriptions. Westartbydescribingour newdataset, andthene xplain ourproposedapproach.

3.TheMo vieBookandBookCor pusDatasets

Wecollectedtwolar gedatasets,one formovie/book

alignmentandone withalar genumberof books.

TheMovieBook Dataset.Sincenoprior workor dataex-

istonthe problemofmo vie/bookalignment,we collecteda newdatasetwith11moviesandcorrespondingbooks.F or eachmovie wealsohavesubtitles, whichweparse intoa setoftime-stamped sentences.Notethat nospeaker infor- mationispro videdinthe subtitles.Weparseeachbook into sentencesandparagraphs.

Ourannotatorshad themovie andabook openedside

byside.The ywereask edtoiteratebetweenbrowsing the bookandw atchingafe wshots/scenesofthemovie, and tryingtofind correspondencesbetweenthem. Inparticular, theymarkedthee xacttime(inseconds)ofcorrespondence inthemo vieandthe matchinglinenumberinthe bookfile, indicatingthebe ginningofthe matchedsentence.Onthe videoside,we assumethatthe matchspansacross ashot(a videounitwith smoothcameramotion). Ifthematch was longerinduration, theannotatoralso indicatedtheending time.Similarlyfor thebook,if moresentencesmatched, theannotatorindicated fromwhichto whichlinea match occurred.Eachali gnment wastaggedasavisual,dialogue, oranaudiomatch .Notethat even fordialogs, themovie andbookv ersionsaresemantically similarbutnotexactly 20

BOOKMOVIEANNOTATION

Title#sent.#words #unique

wordsavg.#words persent.max#w ords persent.#para- graphs#shots#sent.in subtitles#dialog align.#visual align. NoCountryfor OldMen8,05069,8241,70410683,1891,34888922347 HarryPotterand theSorcerersStone 6,45878,5962,363152272,9252,6471,22716473 TheGreenMile 9,467133,2413,043171192,7602,3501,846208102 OneFlew OvertheCuckooNest 7,103112,9782,949191922,2361,6711,5536425

Table1:Statisticsforour MovieBookDatasetwithground-truthfor alignmentbetweenbooks andtheirmo viereleases.

#ofbooks #ofsentences #ofw ords#ofunique wordsmean#of wordsper sentencemedian#of wordsper sentence

11,03874,004,228984,846,3571,316,4201311

Table2:Summarystatisticsof ourBookCorpusdataset.We usethiscorpustotrainthe sentenceembeddingmodel. thesame.Thus decidingonwhat definesamatch ornotis alsosomewhat subjectiveandmayslightly varyacrossour annotators.Altogether, theannotatorsspent90hourslabel- Table

1presentsourdataset, whileFig.6showsafew

ground-truthalignments.Thenumberofsentences perbook varyfrom638to15,498, even thoughthemo viesaresimilar induration.This indicatesahuge diversity indescriptiv e- nessacrossliterature, andpresentsa challengeformatch- ing.Sentencesalso varyin length,withthose inBrokeback

Mountainbeingtwice aslongas thoseinThe Road.The

longestsentencein AmericanPsychohas 422words and spansov erapageinthebook.

Aligningmovies withbooksischallengingev enforhu-

mans,mostlydue tothescale ofthedata. Eachmovie ison average2hlongandhas1,800shots, whileabook hason average7,750sentences.Booksalsohav edifferent styles ofwriting,formatting, language,maycontain slang(go- ingvsgoin",ore venwasvs"us),etc.T able

1showsthat

findingvisualmatches wasparticularly challenging.This isbecausedescriptions inbookscan beeitherv eryshort andhiddenwithin longerparagraphsor even withinalonger sentence,orv eryverbose -inwhichcasethey getobscured withthesurrounding text- andarehard tospot.Ofcourse, howclosethemovie followsthe bookisalso uptothedi- rector,whichcanbeseen throughthenumber ofalignments thatourannotators foundacrossdif ferentmovie/books.

BookCorpus.Inorderto trainoursentence similarity

modelwecollected acorpusof 11,038booksfrom theweb. Thesearefree bookswrittenby yetunpublishedauthors.

Weonlyincludedbooksthat hadmorethan 20Kwords

inorderto filteroutperhaps noisiershorterstories. The datasethasbooks in16dif ferentgenres,e.g., Romance Table

2highlightsthesummary statisticsofour corpus.

4.AligningBooks andMovies

Ourapproachaims toaligna moviewith abookby ex-

ploitingvisualinformation aswellas dialogs.We takeshots asvideounits andsentencesfrom subtitlestorepresent di-alogs.Ourgoal istomatch thesetothe sentencesinthe book.We proposeseveralmeasuresto computesimilari-tiesbetweenpairs ofsentencesas wellasshots andsen-tences.We useournoveldeep neuralembeddingtrained onourlar gecorpusof bookstopredictsimilaritiesbetweensentences.Note thatanextendedversion ofthesentence embeddingisdescribed indetailin [

16]showing howto

dealwithmillion-w ordvocab ularies,anddemonstratingits performanceona largev arietyofNLP benchmarks.For comparingshotswith sentenceswee xtendtheneural em- beddingofimages andtext [

15]tooperate inthevideo do-

main.We nextdevelopa novelcontextualalignment model thatcombinesinformation fromvarious similaritymeasures andalar gertime-scalein ordertomakebetterlocal align- mentpredictions.Finally ,wepropose asimplepairwise

ConditionalRandomField (CRF)thatsmooths thealign-

mentsbyencouraging themtofollo walinear timeline,both inthevideo andbookdomain. totext embedding.Wenextpropose ourcontextual model thatcombinessimilarities anddiscussCRF inmoredetail.

4.1.SkipThoughtV ectors

Inorderto scorethesimilarity betweentwo sentences, weexploit ourarchitectureforlearningunsupervised rep- resentationsofte xt[

16].Themodel islooselyinspired by

theskip-gram[

25]architecturefor learningrepresentations

ofwords. Inthewordskip-grammodel, aword wiischo- senandmust predictitssurrounding context(e.g. wi+1and w i-1foraconte xtwindow ofsize1).Ourmodelw orksin asimilarw aybut atthesentencelevel. Thatis,givena sen- tencetuple( si-1,si,si+1)ourmodel firstencodesthe sen- tencesiintoafix edvector ,thenconditionedonthisv ector triestoreconstruct thesentencessi-1andsi+1,assho wn inFig.

2.Themoti vationfor thisarchitectureisinspired

bythedistrib utionalhypothes is:sentencesthathave similar surroundingcontext arelikelytobeboth semanticallyand syntacticallysimilar. Thus,twosentencesthatha vesimilar 21

Figure2:Sentenceneuralembedding [16].Giv enatuple(si-1,si,si+1)ofcontiguous sentences,wheresiisthei-thsentenceof abook,

thesentencesiisencodedand triestoreconstruct theprevious sentencesi-1andnext sentencesi+1.Unattachedarro wsareconnected to

theencoderoutput. Colorsindicatewhich componentsshareparameters. ?eos?istheend ofsentencetok en. hestartedthe car,left theparkinglot andmerged ontothehighw ayafe wmilesdo wntheroad . hedrov edownthestreetoffintothe distance.heshutthe doorandw atchedthetaxi drive off. shewatched thelightsflickerthroughthe treesasthe mendrov etowardtheroad . amessyb usinesstobe sure,butnecessaryto achieve afineand nobleend.

themostef fective waytoendthebattle.theysawtheironly goalassurvivaland logicallyplanneda strategyto achieve it.

therewould befarfewercasualties andfar lessdestruction.

Table3:Qualitativeresultsfromthesentenceskip-grammodel. Foreach querysentenceon theleft,we retrieve the4nearest neighbor

sentences(byinner product)chosenfrom booksthemodel hasnotseen before.Moreresults insupplementary. syntaxandsemantics arelikely tobeencoded toasimilar vector.Oncethemodelistrained,we canmapan ysentence throughtheencoder toobtainv ectorrepresentations,then scoretheirsimilarity throughaninner product.

Thelearningsignal ofthemodel dependsonha vingcon-

tiguoustext, wheresentencesfollowoneanother inse- quence.Anatural corpusfortraining ourmodelis thus alarge collectionofbooks.Given thesizeand diversity ofgenres,our BookCorpusallows ustolearn verygeneral representationsofte xt.For instance,Table

3illustratesthe

nearestneighboursof querysentences,tak enfromheld out booksthatthe modelwas nottrainedon. Thesequalitativ e resultsdemonstratethat ourintuitionis correct,withresult- ingnearestneighbors correspondslargely tosyntactically andsemanticallysimilar sentences.Notethat thesentence embeddingisgeneral andcanbe appliedtoother domains notconsideredin thispaper, whichise xploredin[ 16].

Toconstructanencoder, weusea recurrentneuralnet-

work,inspiredbythesuccess ofencoder-decoder models forneuralmachine translation[

12,3,1,35].Tw okinds

ofactiv ationfunctionshaverecentlygainedtract ion:long short-termmemory(LSTM) [

10]andthe gatedrecurrent

unit(GRU) [

4].Bothtypes ofactiv ationsuccessfullysolv e

thevanishing gradientproblem,throughtheuseof gates tocontrolthe flowof information.TheLSTM unitexplic- ityemploys acellthatactsasa carouselwithan identity weight.Theflo wofinformation throughacelliscontrolled byinput,output andforget gateswhich controlwhatgoes intoacell, whatleav esacell andwhetherto resetthecon- tentsofthe cell.TheGR Udoesnot useacell butemplo ys twogates:anupdate andaresetgate.In aGRU, thehidden stateisa linearcombinationof theprevious hiddenstateand theproposedhidden state,wherethe combinationweights arecontrolledby theupdateg ate.GRUs have beenshown toperformjust aswellas LSTMonse veralsequence pre- dictiontasks[

4]whilebeing simpler.Thus, weuseGR Uas

theactiv ationfunctionforourencoderanddecoderRNNs. Supposeweare given asentencetuple (si-1,si,si+1),andletwtidenotethet-thword forsiandletxtibeits wordembedding.Webreak themodeldescription intothree parts:theencoder ,decoderand objectivefunction. Encoder.Letw1i,...,wNidenotewords insentencesiwith

Nthenumberof wordsin thesentence.The encoderpro-

ducesahidden statehtiateachtime stepwhichforms the representationofthe sequencew1i,...,wti.Thus,the hid- denstatehNiistherepresentation ofthewhole sentence. TheGRU producesthenexthiddenstate asalinear combi- nationofthe previoushidden stateandthe proposedstate update(wedrop subscripti): h t=(1 -zt)?ht-1+zt?¯ht(1) where ¯htistheproposed stateupdateat timet,ztistheup- dategate and(?)denotesa component-wiseproduct.The updategate takesvaluesbetweenzero andone.Intheex- tremecases,if theupdateg ateisthe vectorof ones,the previoushiddenstateiscompletely forgottenand ht=¯ht. Alternatively,iftheupdategateisthezero vector, thanthe hiddenstatefrom theprevious timestepis simplycopied over,thatisht=ht-1.Theupdate gateis computedas z t=σ(Wzxt+Uzht-1)(2) whereWzandUzaretheupdate gateparameters. The proposedstateupdate isgiv enby ht=tanh(Wxt+U(rt?ht-1))(3) wherertisthereset gate,which iscomputedasquotesdbs_dbs46.pdfusesText_46

[PDF] 2015 nancy grace response to ohio shooting

[PDF] 2015 nancy meyers movie

[PDF] 2015 nc 700x review

[PDF] 2015 nc d400 instructions

[PDF] 2015 nc drivers license

[PDF] 2015 nc plumbing code book

[PDF] 2015 nc-3 form for north carolina

[PDF] 2015 news article on invokana lawsuits

[PDF] 2015 news bloopers youtube

[PDF] 2015 news headlines

[PDF] 2015 nice actress photo nepali

[PDF] 2015 nice bronchiolitis guideline

[PDF] 2015 nice list certificate printable

[PDF] 2015 nice new fashion hand bands

[PDF] 2015 nice to know you incubus

[PDF] Aligning Books and Movies: Towards Story-Like Visual Explanations