[PDF] Hierarchical Photo-Scene Encoder for Album Storytelling





Previous PDF Next PDF



PHOTO ALBUM

The International Photo Contest was organized to raise public awareness on Ozone layer protection and climate change issues and to promote careful attitude 



MEMORY STICK

Album 8. 125 MB. By deleting at most two music albums is it possible for Ivan to have enough space on his memory stick to add the photo album?



Hierarchical Photo-Scene Encoder for Album Storytelling

Feb 2 2019 structure information of the photos within an album. Specif- ically



USER MANUAL

BACK. Nothing is deleted and ALBUM returns to photograph viewing. THIS PHOTO. Deletes the currently displayed photograph. ALL PREVIOUS PHOTOS. Deletes all 





Adding Captions to Images in Google Photos

A. You can add a personal caption to individual images in the Google Photos album by tapping the thumbnail of each picture to open it to the full-screen 



Hierarchical Photo-Scene Encoder for Album Storytelling

Only five representative photos from an album of visual storytelling (VIST) (Huang et al. 2016) dataset are shown. Sentences in image captioning describe 



Op2M

Sub out the imagery with amount own photos. Gather all your memorable photos in one album using this album template. Wedding Album Templates Free PSD 



Creating a Photo Album in PowerPoint 2007

Creating a photo album in Microsoft Office PowerPoint from pictures or images is a great way to share photographs or other illustrations.



Hierarchically-Attentive RNN for Album Summarization and

We address the problem of end-to-end vi- sual storytelling. Given a photo album our model first selects the most representative. (summary) photos



Créer un album photo personnalisé - facile et gratuit Canva

Créez votre album photo avec notre outil en ligne intuitif et facile Imprimez téléchargez ou partagez par e-mail votre album au format PDF ou en tant 







10 modèles de livre photo gratuits PDF InDesign PowerPoint Word

30 mar 2023 · Apprenez des 10 modèles de livre photo en PDF InDesign PowerPoint Word Laissez-vous inspirer et créez un album photo numérique en ligne 



Logiciel dalbum photo en ligne - FlipBuilder

Flip PDF Plus est un logiciel d'album photo numérique tout-en-un qui permet aux utilisateurs de créer un album photo numérique attrayant à partir de PDF ou 





PDF Photo Album dans le Mac App Store

26 avr 2023 · Créez de magnifiques albums photo et collages en utilisant PDF Photo Album Personnalisez la mise en page et l'arrière-plan de votre album 



Créez vos albums photo PDF avec Album Photo PDF - Soy de Mac

Grâce à l'application PDF Photo Album nous pouvons créer rapidement de fabuleux albums photo en quelques secondes



Créez un Album Photo en Ligne avec Notre Modèle - Flipsnack

Notre créateur d'albums photo vous permet de télécharger votre album photo en ligne sous forme de fichier PDF prêt à être imprimé aussi étonnant sur papier 



Imprimer mon PDF : Mon album personnalisé à la demande - BlookUp

Chargez votre PDF sur BlookUp personnalisez votre couverture et recevez chez vous vos contenus en livre papier grand format de grande qualité !

  • Comment faire un album photo en pdf ?

    Comment faire un album photo gratuit ? Créez-le sur Canva et imprimez votre création vous-même. Pour cela, choisissez le format “PDF haute qualité” pour télécharger votre album photo. Privilégiez un modèle au format compatible avec votre imprimante pour éviter de devoir découper les bords par la suite.
  • Quel est le meilleur site pour faire album photo ?

    Les 5 meilleurs sites pour créer un album photo

    1Photoweb : le plus simple. Tout est fluide et facile dans la création d'albums sur la plateforme Photoweb. 2Tribu : le plus familial. 3Cewe : le plus primé 4Rosemood : le plus design. 5Flexilivre : le plus personnalisable.
  • Quel est le meilleur format pour un album photo ?

    Le A4 est le choix le plus usuel, assez proche d'une bande dessinée. Il se rangera facilement dans une bibliothèque. C'est le format de toutes les feuilles d'imprimantes et des grands cahiers utilisés dans les établissements scolaires : 21X29,7 cm (grand portrait).
  • Tout d'abord Flexilivre vous offre la possibilité de réaliser votre album photo en ligne et entièrement en ligne Aucun logiciel n'est à télécharger contrairement à d'autres sites de création de livres photo en ligne. De plus l'application en ligne est conçue de façon à être le plus simple possible d'utilisation.
Hierarchical Photo-Scene Encoder for Album Storytelling

Bairui Wang

1Lin Ma2yWei Zhang1yWenhao Jiang2Feng Zhang2

1School of Control Science and Engineering, Shandong University2Tencent AI Lab

fbairuiwong, forest.linma, cswhjiangg@gmail.com davidzhang@sdu.edu.cn jayzhang@tencent.com

Abstract

In this paper, we propose a novel model with a hierarchical photo-scene encoder and a reconstructor for the task of al- bum storytelling. The photo-scene encoder contains two sub- encoders, namely the photo and scene encoders, which are stacked together and behave hierarchically to fully exploit the structure information of the photos within an album. Specif- ically, the photo encoder generates semantic representation for each photo while exploiting temporal relationships among them. The scene encoder, relying on the obtained photo repre- sentations, is responsible for detecting the scene changes and generating scene representations. Subsequently, the decoder dynamically and attentively summarizes the encoded photo and scene representations to generate a sequence of album representations, based on which a story consisting of multiple coherent sentences is generated. In order to fully extract the useful semantic information from an album, a reconstructor is employed to reproduce the summarized album representa- tions based on the hidden states of the decoder. The proposed model can be trained in an end-to-end manner, which re- sults in an improved performance over the state-of-the-arts on the public visual storytelling (VIST) dataset. Ablation studies further demonstrate the effectiveness of the proposed hierar- chical photo-scene encoder and reconstructor.

Introduction

Album storytelling (Yu, Bansal, and Berg 2017; Huang et al. 2016; Liu, Li, and Shi 2017) is a task to produce a para- graph to describe an ordered photo stream, and has become a hot research topic in the vision and language community. Images in an album are usually redundant and diverse, since people tend to take multiple photos under multiple scenes. To describe an album, the model needs not only to extract the salient contents from the photo stream, but also gen- erate coherent sentences to describe them. Hence, it is to- tally different from the image captioning task and album storytelling is more challenging. Human labeled examples of both image captioning and album storytelling are illus- trated in Fig. 1. In this example, five representative images as well as their labeled captions and story are selected from This work was done while Bairui Wang was a Research Intern with Tencent AI Lab. yCorresponding authors.

Copyrightc

2019, Association for the Advancement of Artificial

Intelligence (www.aaai.org). All rights reserved.1) After a long summer day of playing hard.

2) Swinging and playing and playing with friends.

3) Making up dances and helping clean up after the picnic.

4) We headed for the city fireworks.

5) What a great ending to a great day!

1) The picture is of a little boy sitting in a swing.

2) A young blonde girl soaking wet holding onto a ladder.

3) Two young girls wearing pink and posing the same for the picture.

4) The fireworks are shot off in the distance.

5) A large firework exploding in the sky on a dark night.

Image

Captioning

Album

Storytelling

··· ······ ······Figure 1: Differences between album storytelling and image

captioning. Only five representative photos from an album of visual storytelling (VIST) (Huang et al. 2016) dataset are shown. Sentences in image captioning describe what exactly happens in the current image, while the sentences in album storytelling focus on the sentence coherence and story com- pleteness. Please note that the blue and green boxes repre- sent two different scenes in the album. the album. It can be observed that the sentences for the im- age captioning task are independent, only expressing the ex- act visual content of each image. On the contrary, the sen- tences for the album storytelling task take the sentence co- herence and story completeness into consideration. Lastly, some sentences in album storytelling might not describe any photos in the stream. The goal of such sentences is to pre- serve sentence coherence and story completeness. For ex- ample, the last sentence in storytelling "what a great ending to a great day!" does not describe any im- ages in the album, but it perfectly concludes a story. For these reasons, we need to consider how to extract related salient information, detect the exiting events or scenes in the album, and finally generate coherent sentences to present the story. Album storytelling is usually realized in an encoder- decoder architecture. The encoder relies on the convolu- tional neural network (CNN) widely used for different works (Zhang, Yu, and He 2017; Zhang et al. 2018b; 2018a; Ma, Lu, and Li 2016; Qi et al. 2018), to extract the visual feature of each photo and fuses them together to yield the whole album representation. The decoder usually employsarXiv:1902.00669v1 [cs.CV] 2 Feb 2019 long short-term memory (LSTM) or gated recurrent unit (GRU), to generate the corresponding story. Liuet al.(Liu et al. 2017) bridge the semantically related photos with large visual gap by projecting them into one common semantic space for capturing their visual variance, and construct a co- herence matrix to enforce the sentence coherence for story- telling. Yuet al.(Yu, Bansal, and Berg 2017) step further by introducing a photo selector between encoder and decoder to automatically choose five photos as the summarization of an album,based on which five sequential sentences are gen- erated as the album story. For the photos in one album, some of them might reflect events in the same scene, although they may have significant visual variance. For example, in Fig. 1, the photos highlighted with blue boxes should be in the same scene of "playing with friends", while the other two pho- tos highlighted in green boxes should be related to another scene of "fireworks". These scene changes are important for the album storytelling, which are neglected for existing ap- proaches. To hierarchically exploit the image and scene informa- tion, we propose to employ a scene encoder, stacked on the photo encoder, to detect the scene changes and meanwhile aggregate the scene information. Afterwards, the decoder at- tentively summarizes the photo and scene representations to form a sequence of album representations and decodes them taken into consideration, the problem of large visual vari- ances in a photo stream is addressed, which helps improve the sentence coherence in a story. Additionally, observed the effectiveness of dual learning in machine translation (Tu et al. 2017) and video caption- ing (Wang et al. 2018a), we employ the technique of dual learning to boost the album storytelling performance by re- constructing the album representations from decoder hidden states. As such, the hierarchical image and scene informa- tion are fully exploited in our model. The major contributions of this work are summarized as follows: 1) To detect scene changes and aggregate the scene representation, a hierarchical photo-scene encoder for album storytelling is proposed. 2) We propose to reconstruct the attentively aggregated album representations from decoder hidden states, which help exploit the image and scene infor- mation. 3) Extensive results on the video storytelling dataset indicate that the proposed photo-scene encoder and recon- structor can help boost the performance, resulting in a new state-of-the-art on album storytelling.

Related Work

Album storytelling, a special case of generating natural sen- tences from visual contents, is related to image caption- ing (Karpathy, Joulin, and Fei-Fei 2014; Ma et al. 2015; Vinyals et al. 2015; Chen et al. 2018b; Jiang et al. 2018a;

2018b) and video captioning (Pan et al. 2017; Wang et al.

2018a; 2018b; Chen et al. 2018a), which share some com-

mon techniques. In this section, we present a short survey on the related works. Image and Video Captioning.In the early stage, tem-

plate based methods were proposed to generate captionsfrom images. The sentences are generated by filling a pre-

defined template with contents detected from input image. Later, inspired by the advance in neural machine translation, the encoder-decoder framework (Vinyals et al. 2015) was introduced into image captioning. Nowadays, many variants have been proposed (Xu et al. 2015; He et al. 2016a). Re- achieved remarkable results (Rennie et al. 2016; Ren et al.

2017). Similar to image captioning, encoder-decoder based

methods were also proposed for video captioning (Venu- gopalan et al. 2015; Pan et al. 2016). Different from image captioning, video captioning models need to exploit tempo- ral information in videos, which is the key to boost perfor- mance. tioning, the task of album storytelling aims at generating several sentences to describe a set of images which may vi- sual uncorrelated. The first work for this area is (Park and Kim 2015), in which two datasets named NYC and Disney- land are released. The authors in (Park and Kim 2015) em- ployed a local coherence model (Barzilay and Lapata 2008) to parse the patterns of local transitions of sentences in the whole text. After that, Huanget al.(Huang et al. 2016) con- structed a dataset named VIST which contains more rele- vant stories. Liuet al.proposed to obtain a semantic space by jointly embedding each image and its corresponding sen- tence to bridge the images that have similar semantics but large visual variances. Meanwhile, a semantic relation ma- trix is identified by distance measure in the semantic space, which is used to enforce the sentence coherence (Liu et al.

2017). To automatically summarize the contents of the al-

bum for the decoder, Yuet al.(Yu, Bansal, and Berg 2017) utilized a learnable selector on the top of visual encoder. Al- though previous works modeled the relationships between the photos in an album, the effects of scenes are never con- sidered.

Architecture

For an album withmphotosA=fa1;a2;:::;amg,

whereaidenotes thei-th photo, the album storytelling aims at generating a story composed ofnsentencesS= fS1;S2;:::;Sngto describe the album, whereSj=n sj 1;sj

2;:::;sj

t1o is thej-th sentence andsj tdenotes the i-th word in sentenceSj. In this paper, we propose an encoder-decoder-reconstructor architecture for the album storytelling, as shown in Fig. 2. Specifically, a novel hier- archical photo-scene encoder, containing stacked photo en- coder and scene encoder, exploits the hierarchical structure information within the album photos. The decoder dynam- ically and attentively summarizes the outputs of the photo- scene encoder and decodes several sequential sentences to form a story. A reconstructor that relies on the decoder hidden states is employed to regenerate the summarization by the decoder, which further helps exploit the information from the album. CNN afterā a long

encoder, decoder and reconstructor. The hierarchical photo-scene encoder is composed of two sub-encoders, namely photo en-

coder and scene encoder. The photo encoder extracts the semantic representations of the photos, and the scene encoder explores

scene representations. The decoder attentively summarizes the photo and scene representations and generates multiple coherent

sentences as one story for each album. The reconstructor translates story back to the album representations. Superscripts of

hidden states, such aspenc,senc,attn, anddec, denote photo encoder, scene encoder, attention, and decoder, respectively. The

anddenote weighted sum and average process.

Hierarchical Photo-Scene Encoder

The proposed photo-scene encoder contains two sub- encoders, namely photo encoder and scene encoder. The photo encoder models the contents and the temporal infor- mation of the album. The scene encoder detects the scene changes. We will present the details in the following subsec- tions. Photo Encoder.In our model, the image contents are extracted with a CNN, specifically the ResNet (He et al.

2016b), and the temporal information in the photo stream

is captured with a bidirectional GRU (Bi-GRU). The details of the photo encoder are listed as follows: f i=CNN(ai); !h(penc) i=!GRU f i;!h(penc) i1 h(penc) i= GRU f i; h(penc) i1 v i=ReLUh!h(penc) i; h(penc) ii +Wffi ;(1) whereWfis a linear function,!h(penc) iand h(penc) iare hid- den states of Bi-GRU andviis the representation for the in- put photoai. Obviously,vinot only contains the photo con- tent but also captures the context information (other photos) of one album in both forward and backward directions. In this way, an albumAcan be encoded as a sequence of photo representationsV=fv1;v2;:::;vmg. Scene Encoder.Different from the videos, in which the visual appearances of adjacent frames are very similar, the photos in an album may be not visually relevant, as illus- trated in Fig. 1. Although these photos are of great differ- ences, they may be taken in the same scene and describe phone representations, the scene encoder meanwhile detects scene representations with a GRU when the scene bound- aries are detected. The anddenote multiplication and subtraction process. the same activities within an album. In this paper, we iden- tify the semantic discontinuities between photos and thereby detect the scene changes. Meanwhile, each detected scene is ilar boundary detection technique in video (Baraldi, Grana, and Cucchiara 2016) to detect scene changes in an album based on the obtained photo representationsV. As shown in Fig. 3, the scene encoder consists of a linear classifier and one GRU to detect scene changes and summarizes the scene information. The two components couple together to gener- ate the final scene representations. For a given photo representation sequence, the scene de- tector acts as a judger to determine whether the current in- put denotes a start of a new scene, with the consideration of the previous GRU hidden states, which relates to the context by a linear classifier: k i=( 1;if w

Tsvvi+wTshh(senc)

i1+bs >0:5;

0;otherwise(2)

wherekiis the flag indicating whether a new scene is de- tected,h(senc) i1denotes the previous hidden state of GRU, w sv,wsh, andbsare the learnable parameters and the() denotes a sigmoid function. As the scene detector is a step function, which is a discrete operation, we employ the tech- nique of straight-through estimator (Bengio, L

´eonard, and

Courville 2013) to back-propagate error signals.

Based on the results of scene detector, GRU updates its previous hidden stateh(senc) i1as follow: h (senc) i1= (1ki)h(senc) i1:(3)

Therefore, if the scene detector regards the current inputvias the starting point of a new scene, the flagkiwill be set

to 1 andh(senc) i1will be collected as the final representation of the previous scene. Moreover, as a new scene begins, the hidden stateh(senc) i1will be cleared as 0 and the encoding for a new scene begins. If the scene detector does not detect a new scene, the flagkiwill be 0, and no scene representation needs to be generated.

1The hidden state updating rules are

the same as in vanilla GRU. The scene encoder will generate a sequence of scene rep-

resentationsX=fx1;x2;:::;xugfor each album, withxidenoting the hidden state of the GRU when the flagkiis

equal to 1 anduis the number of scenes detected.

Decoder

The obtained photo and scene representations,i.e.VandX, capture the hierarchical semantic information of the album, which contribute differently to the final story generation. We combineVandXto form a new matrixR= [V;X], and employ attention mechanism to dynamically and attentively summarize the photo and scene representations. We denote thel-th column ofRasrland denote the sequence of gener- ated summarization asZ=fz1;z2;:::;zng. The procedure of computingzjis expressed as: h (attn) j=GRU j1;h(attn) j1 ~j=Wtanh W hh(attn) j1T+WrR+b j=softmax~j; z j=Rj; (4) where1is a vector of all ones, andW,Wh,Wrandbare learnable parameters.1 Actually, we have to keep the total number of photo and scene representations in the code implementation. So whenki= 0, we take0as a false scene representation and collect it. We also intro-

duced a mask to mark the false and true scene representations.It can be observed that the attention process on photo

and scene representations is relied on GUR. The benefits of such attention strategy lie in two-fold. First, employing attention on both photo and scene representations simulta- neously bridges the semantic gaps between each photo and each scene. Second, as the summarizing for current content is affected by the previous attention state, it further enhances the sentence coherence for storytelling.

Based on the generatednalbum representations

fz1;z2;:::;zng,nsentences are sequentially generated, composing the final story. For each album representationzj, we use another GRU to decode its related sentence, which is the same as the decoder in image and video captioning. vious wordsj t1, and the hidden state at previous steph(dec) t1as inputs: h (dec) t=GRUh E(sj t);zji ;h(dec) t1 d j t=MLPh h(dec) t;zji P sj tjsj Pdenotes the word probability for wordsj tofj-th sentence at time steptwhen the generated partial captionsj 2;:::;sj t1o ) is known.

Reconstructor

On top of the decoder, we build a GRU-based reconstructor to reconstruct the generated album representationsZbased on the decoder hidden states. As such, the information from the sentences to the album can be further exploited, which is believed to benefit the album storytelling.

As shown in Fig. 2, the logitsDj=n

dj 1;dj

2;:::;djno

for thej-th sentence contains the sentence semantic in- formation. The reconstructor first performs the mean pool- ing onDjto obtain the global sentence informationdj= 1n P n i=1dj i. Then at each time step, GRU is used to recon- struct the corresponding album representation: c j t=GRUh dj t;dji ;cj t1 ;(6) wherecj tis reconstructor hidden state of thej-th sentence.

Here we useCj=n

cj 1;cj

2;:::;cjno

to represent the hid- den states of the reconstructor. Finally, we obtain the recon- structed album representation by averagingCjto obtain~zj, which will be used to compare withzjto obtain reconstruc- tion loss. The reconstructor we design in this paper is different from (Wang et al. 2018a) on twofold. First, we reproduce one al- bum representation with all hidden states from the decoder after a sentence is generated, while the model in (Wang et al.

2018a) needs to reconstruct the feature of each frame. This

is mainly attributed to the differences between storytelling and captioning tasks. Second, what we reconstruct is the at- tentively summarized album representations, which contains the photo and scene information as well as their temporal relationships. In contrast, only video frame features are con- sidered to be reconstructed in (Wang et al. 2018a).

Loss Function and Training Strategy

In this subsection, we present the loss function used at each step and introduce the training strategy. Given the albums, we aim at minimizing the negative log probability of the story sentences in the decoder step: L dec() =NX y=1n X j=1logPSy jjAy;(7) whereNdenotes the total number of albums and the number nis the number of sentences in the story for an album. The sentencesSjare generated word by word, with the proba-quotesdbs_dbs35.pdfusesText_40
[PDF] otto dix les joueurs de skat histoire des arts diaporama

[PDF] exemple de mouvement rectiligne

[PDF] mouvement elliptique

[PDF] 50 activités autour des carnets de voyage ? l'école

[PDF] récit de voyage cm2

[PDF] carnet de voyage imaginaire cycle 3

[PDF] projet carnet de voyage cycle 3

[PDF] mouvement d un projectile dans un champ de pesanteur uniforme exercices

[PDF] mouvement d un projectile dans un champ de pesanteur uniforme tp

[PDF] projectile physique pdf

[PDF] compte rendu tp physique 1ere s

[PDF] lire un paysage cycle 3

[PDF] calcul d'antécédent

[PDF] flexion et extension du pied

[PDF] exercices sur le mouvement d'une particule chargée dans un champ magnétique uniforme