Characterisation and generation of expressivity in function of PDF

La Venus dIlle

Sans doute ils sont incrustés dans le bronze. Ce sera peut-être quelque statue romaine. — Romaine ! c'est cela. M. de Peyrehorade dit que c'est

Characterisation and generation of expressivity in function of

03-Feb-2021 styles de parole pour la construction de livres audio. Thèse présentée et soutenue à Lannion le 02 ... La Vénus d'Ille

SynPaFlex Corpus Annotation Manual

04-Oct-2019 Global particularities of the speech audio files and their texts were written for each book

Building a Search Engine for Music and Audio on the World Wide

tiques l't points d'iutc>r('t concernant la nature des fichiers audio sur le veb. Shirk~". alld Linda Kuopke have always been there for Ille.

français langue étrangère

14-Feb-2019 Cahier d'exercices + Livre numérique de l'élève. • Guide pédagogique avec Tests et ressources + 2 CD audio pour la classe. • Livre numérique de ...

FRW 3101: Introduction à la littérature française II Dr. Rori Bloom

la littérature à enchanter à travers des représentations du magique du surnaturel ou du Mérimée

Caractérisation et génération de lexpressivité en fonction des styles

styles de parole pour la construction de livres audio. Thèse présentée et soutenue à Lannion le 02 Octobre 2020 La Vénus d'Ille

Jocastas Divine Head: English with a Foreign Accent - DS CARNE

Venus makes love to Mars which prompted a notable essay of. Montaigne: Elle avait dit et

A complete treatise on French pronunciation

Voila le bureau d'acajou. Made-. 24. When does the syllabie dirision take place in words composed of compound sounds?

Thèse de doctorat de

École Doctorale N

°601

Mathématiques et Sciences et Technologies

de l"Information et de la Communication

Spécialité :Informatique

Par

Aghilas SINI

Caractérisation et génération de l'expressivité en fonction des styles de parole pour la construction de livres audio Thèse présentée et soutenue à Lannion, le 02 Octobre 2020

Unité de recherche : IRISA UMR 6074

Thèse N°:

Rapporteurs avant soutenance :

Yannick Esteve Professeur à l'Université d'Avignon et des ⎷ays de Vaucluse Anne-Catherine Simon Professeure à l'Université Catholique de Louvain

Composition du Jury :

Présidente : Sylvie GibetProfesseure à l'Université de Bretagne Sud Examinateurs : Laurent BesacierProfesseur à l'Université Jose⎷h Fourier Sylvie GibetProfesseure à l'Université de Bretagne Sud Simon KingProfesseur à l'Université d'Édimbourg Dir. de thèse : Damien LoliveMaitre de Conférence-HDR à l'Université de Rennes 1,

Co-dir. de thèse : Élisabeth Delais-Roussarie Directrice de recherche CNRS-Univérsité de Nantes

Acronyms

Synthèse en Français 1

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 Approches proposées . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.1 Construction de corpus . . . . . . . . . . . . . . . . . . . . . . . . . 2

2.2 Étude émotionnelle de corpus SynPaFlex . . . . . . . . . . . . . . . 3

2.3 Étude discursif des livres audio . . . . . . . . . . . . . . . . . . . . 5

2.4Identité prosodique d"un locuteur dans un système de synthèse vocale

multilocuteurs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.1 Perspective à court terme . . . . . . . . . . . . . . . . . . . . . . . 6

3.2 Perspective à long terme . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Discussion générale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

Introduction 1

1 Text-to-Speech Synthesis 5

1 Text-To-Speech Synthesis System . . . . . . . . . . . . . . . . . . . . . . . 5

1.1 Front-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Back-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2 Statistical Parametric Speech Synthesis . . . . . . . . . . . . . . . . . . . . 8

2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Expressive Speech Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1 What do we mean by "expressive speech synthesis"? . . . . . . . . . 14

3.2 Transversal questions . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2 Speech Prosody 17

1 What is prosody? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Roles of speech prosody . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.1 Linguistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Para-linguistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.3 Extra-linguistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Prosody Modeling for Text-to-Speech Synthesis . . . . . . . . . . . . . . . 20

3.1 Rule-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Statistical data-driven methods . . . . . . . . . . . . . . . . . . . . 21

3.3 Hybrid approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 What are the topics discussed in this manuscript? . . . . . . . . . . . . . . 22

3 Audiobooks Corpora For Expressive Speech Synthesis 23

1 SynPaFlex Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

1.2 Relation to previous work . . . . . . . . . . . . . . . . . . . . . . . 24

1.3 Data Collection and Pre-processing . . . . . . . . . . . . . . . . . . 25

2MUltispeaker French Audiobooks corpus dedicated to expressive read Speech

Analysis (MUFASA) Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.2 The novelty of this work . . . . . . . . . . . . . . . . . . . . . . . . 30

2.3 General Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Gap between Text-to-Speech (TTS) designed corpora and amateur audio- book recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.1 Data and features extraction . . . . . . . . . . . . . . . . . . . . . . 32

3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 A Phonetic Comparison between Different French Corpora Types . . . . . 38

4.1 Corpus design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Annotation Protocol and Emotional Studies of SynPaFlex-Corpus 45

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2 Speech annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.1 Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.2 Intonation Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

2.3 Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

2.4 Emotions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

2.5 Other Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3 Evaluation of the emotion annotation . . . . . . . . . . . . . . . . . . . . . 53

3.1 Data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Emotion Lexicon Study of Audiobooks . . . . . . . . . . . . . . . . . . . . 57

4.1 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 Pre-processing stage . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Features Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Clustering Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.5 Acoustic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6 Experiments and Results . . . . . . . . . . . . . . . . . . . . . . . . 62

4.7 Discussion and issues . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5 Automatic Annotation of discourses in Audiobooks 67

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2 Corpus and material . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3 Rule-based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

3.1 Rule-based results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

4 Machine learning approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.1 General Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.2 Data used and feature extraction . . . . . . . . . . . . . . . . . . . 77

4.3 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6 Automatic prosodic analysis of discourse changes 81

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

2 Corpus Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

2.1 Experimental dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 83

iii

2.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

2.3 Text annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

3 Prosodic analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.1 Features Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.2 Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

4 Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5 Conclusion and perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . 91

7 Speaker Prosodic Identity 93

1 General Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

3 Speaker Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3.1 OneHot-Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

3.2 X-Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

3.3 P-Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

4 Analysis Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.1 Input and Output features . . . . . . . . . . . . . . . . . . . . . . . 96

4.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

5 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

5.2 Models configuration . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.1 Standard measurements . . . . . . . . . . . . . . . . . . . . . . . . 99

6.2 Visualizing the first hidden-layer output . . . . . . . . . . . . . . . 100

6.3 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

Conclusion 107

Summary of the Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Further Issuer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 General Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

A Audiobooks Corpora 117

1 SynPaFlex Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

2 SynPaFlex Annotated Subset . . . . . . . . . . . . . . . . . . . . . . . . . 119

3 MUFASA Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

4 MUFASA Parallel Subcorpus . . . . . . . . . . . . . . . . . . . . . . . . . 133

B Data visualization and high dimension reduction 137

1 Principal Component Analysis (PCA) . . . . . . . . . . . . . . . . . . . . . 137

C Discourses Annotation 139

1 Speech Verbs List . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

D Manual Annotation and Subjective Assessment Materials 143

1 Intonation Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

1.1Exclamationpattern . . . . . . . . . . . . . . . . . . . . . . . . . 143

1.2Nopippattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

1.3Nuancepattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

1.4Resolutionpattern . . . . . . . . . . . . . . . . . . . . . . . . . . 146

1.5Suspensepattern . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

1.6Notepattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

1.7Singingpattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

2 List of stimulis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

3 Subjective Assessment Platform . . . . . . . . . . . . . . . . . . . . . . . 154

E Futur Work 155

1 End-to-End Tacotran-2 Architecture . . . . . . . . . . . . . . . . . . . . . 155

Bibliography 156

List of Figures

1.1 Text-to-Speech (TTS) system pipeline . . . . . . . . . . . . . . . . . . . . 5

3.1 Overview of the Speech Segmentation process . . . . . . . . . . . . . . . . 26

3.2 the vowel trapezoids of the three cardinal vowels /u/, /i/, and /a/ . . . . 37

3.3Pauses distribution and average duration for"Mademoiselle Albertine est

partie". . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.4 Pauses distribution and average duration for"Vingt mille lieues sous les mers Chapter 3".. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.5 The vowel trapezoids of the three cardinal vowel, in the context of occlusive /p/,/t/,/k/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.6 The density distribution according to the duration of the three vowels preceded by an occlusive consonant /p/,/t/,/k/ . . . . . . . . . . . . . . . 43 4.1 The ten fundamental intonations defined in [Delattre 1966], illustrated by a dialogue:- Si ces oeufs étaient frais j"en prendrias. Qui les vend? C"est bien toi, ma jolie? - Évidemment, Monsieur. - Allons doc! Prouve-le-moi.[- If these eggs were fresh, I"d take some. Who sells them? Is it you, my pretty? - Of course it is, sir. - Come on, then! Prove it to me.] . . . . . . . . . . . . 48 4.2 Nuance Intonation Pattern Example :puis il me semblait avoir entendu sur l"escalier les pas légers de plusieurs femmes se dirigeant vers l"extrémité du corridor opposé à ma chambre.. . . . . . . . . . . . . . . . . . . . . . . . 49 4.3 A combination of three non exclusive intonation pattern. The nuance pattern is recognized with its particular pitch contour described in Figure D.3Dans cette cruelle position, elle ne s"est donc pas adresséeat begining of the utterance, followed by an emotional pattern characterized by the dynamic pitch (high F 0 -range)à la marquise d"Harville, sa parente,and finishing with an explicit question patternsa meilleure amie ?. . . . . . . . . . . . 50

4.4 Scheme of proposed framework . . . . . . . . . . . . . . . . . . . . . . . . 58

vii

LIST OF FIGURES

4.5The data points scatter in k = 18 groups -doc2vecfeatures. The right-hand

side shows the result of K-means, i.e., the data points of each cluster. The left-hand side shows the silhouette coefficient of each cluster. The thickness of each cluster plot depends on the number of data points lying in the cluster. The red bar is the average of the silhouette coefficient of entire clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.6 The data points scatter in k = 7 groups -doc2vecs+ emotional vector features. The right-hand side shows the result of K-means, i.e., the data points of each cluster. The left-hand side shows the silhouette coefficient of each cluster. The thickness of each cluster plot depends on the number ofquotesdbs_dbs46.pdfusesText_46

[PDF] Characterisation and generation of expressivity in function of