Quantifying Language Understanding of Neural Language Models PDF

LE LANGAGE FACEBOOK - (Mémoire de licence)

07-Mar-2011 Chris Couteau : « ya une différence entre l'abrégé et le pas francais laisse écrit lece sa veu rien dire c'est tout ». Remarque : pratiquement ...

Abréviations = chat sms = langage texto (envoyer / recevoir un sms

en début de ligne ou jouxtant le mot erroné) utilisé pour spécifier que l'on tente de corriger une erreur dans son dernier message. @2m1 a2m1 = à demain.

FINANCES & DÉVELOPPEMENT

03-Dec-2021 bien qu'elle s'en tienne généralement au langage mesuré ... tique est le bit quantique (ou qubit en abrégé). Il est géné-.

Viral Marketing Strategies on Facebook: Analysis experimentation

13-Sept-2016 Stratégie du marketing viral sur Facebook ... disciplines ont pu parler le même langage et contribuer efficacement au même savoir. Ainsi

Detecting Abuse on the Internet: Its Subtle

ABRÉGÉ. Le langage abusif utilisé sur les plateformes virtuelles a des [91] and aggression identification on Facebook posts in Hindi and English [47].

La néographie dans les pratiques langagières des jeunes

Le Facebook et le langage des jeunes. Le Facebook représente le plus grand l'adverbe bien peut s'écrire B1 l'expression Inshallah peut-être abrégé.

Quantifying Language Understanding of Neural Language Models

Abrégé. La compréhension du langage a été un sujet d'étude qui a attiré sity Mila

Limpact de lutilisation du langage SMS sur lorthographe

06-Apr-2017 Le langage SMS : une menace pour l'orthographe ? ... certains réseaux sociaux comme Facebook ou Gmail. Toutefois il existe également les.

Anomie et culture écrite. Enquête dethnographie linguistique sur le

31-Dec-2017 jeunes Tunisiens sur Facebook. THÈSE. Pour obtenir le diplôme de doctorat. Spécialité (SCIENCES DU LANGAGE - LINGUISTIQUE).

Thème :

24-Apr-2019 Notre but de recherche est de savoir si le langage Facebook a un impact sur la langue française utilisée par les jeunes algériens.

Quantifying Language Understanding

of Neural Language Models

Prasanna Parthasarathi

Doctor of Philosophy

School of Computer Science

McGill University

Montreal, Quebec, Canada

A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of

Doctor of Philosophy

Prasanna Parthasarathi, 2022

Abstract

Language understanding has been a topic of study that has drawn attention from a variety of disciplines like linguistics, formal semantics, computer science, and psychology. Modern neural language understanding models with a trainable "end-to-end" set up have replaced the classical language pipeline. Although such approaches have shown remarkable suc- cess, the opaqueness in their mechanisms have raised concerns recently. There are several contemporary works that argue that such end-to-end neural models do mimic the classical pipeline while a few works take a more critical stand. This thesis too takes a critical stand, and proposes novel techniques to quantify the lack of understanding of conventional syn- tax and semantics. First, we quantify the semantic understanding of neural models in the task of dialogue prediction by analysing the representation of input learned by the end-to- end models. The results highlight a lack of correlation between the models" performance and the discriminative abilities of the representation learned by the neural language models. Following that, we propose a framework to evaluate syntactic understanding of neural mod- els by analyzing the performance on samples stripped of any conventional notion of syntax in the task of natural language inference. Observing the lack of understanding of syntax, we explore the cost of the models hallucinating a probably correct input by analysing the trade-off on faithfulness vs robustness in machine translation in the subsequent chapter. Finally, we attempt at quantifying the unnaturalness in the language understanding through novel metrics that capture local and global order of tokens. The work compiled together aims at building interpretable techniques for language understanding in neural models and, towards that, does a comprehensive study on quantifying the language understanding in neural models on a variety of language tasks. i

Abrégé

La compréhension du langage a été un sujet d"étude qui a attiré l"attention de diverses dis-

ciplines comme la linguistique, la sémantique formelle, l"informatique et la psychologie. Les modèles neuronaux modernes de compréhension du langage pouvant être entraînée avec une configuration " de bout en bout » ont remplacé l"approche séquentielle classique de modules spécialisés. Bien que de tels modèles neuronaux aient connu un succès re-

marquable, l"opacité de leurs mécanismes a récemment soulevé des inquiétudes. Il existe

plusieurs travaux contemporains qui soutiennent que de tels modèles neuronaux de bout en bout imitent certains modules spécialisés dans le traitement du langage, tandis que d"autres travaux adoptent une position plus critique. La thèse adopte également une position cri- tique et propose des nouvelles techniques pour quantifier le manque de compréhension de la syntaxe et de la sémantique conventionnelles. Dans un premier temps, nous quantifions la compréhension sémantique des modèles neuronaux dans une tâche de prédiction de di- alogue en analysant la représentation des entrées apprise de bout en bout. Les résultats mettent en évidence un manque de corrélation entre les performances du modèle neuronal

et les capacités discriminantes des représentations apprises par celui-ci. Suite à cela, nous

proposons un cadre pour évaluer la compréhension syntaxique des modèles neuronaux en analysant leurs performances sur des échantillons dépourvus de toute notion convention- nelle de syntaxe dans une tâche d"inférence de langage naturel. Observant le manque de compréhension de la syntaxe, nous explorons le coût des modèles hallucinant une entrée probablement correcte en analysant le compromis entre fidélité et robustesse en traduction automatique dans le chapitre suivant. Enfin, nous tentons de quantifier le manque de naturel

dans la compréhension du langage grâce à de nouvelles métriques qui capturent l"ordre lo-

ii cal et global des mots. L"ensemble du travail vise à construire des techniques interprétables pour la compréhension du langage dans les modèles neuronaux et, à cet effet, réalise une étude approfondie sur la quantification de la compréhension du langage dans les modèles neuronaux sur une variété de tâches linguistiques. iii

Contributions to Original Knowledge

The thesis contributes to the topic of natural language processing by proposing novel tech- niques to interpret the almost opaque, and over-parameterized neural models. Specifically, 1. W ehighlight the discrepancies in language generation by e valuatingthe specificity of the representation learnt by neural dialogue models in estimating the dialogue state through linear probing. 2. W eidentify a serious issue of systematic unnatural syntactic underst andingof neural language models in the language inference task. This work showcases that the neural models lack understanding of any conventional notion of syntax. 3. T owardspointing out at a pot entialsocial issue of such unnaturally rob ustmodels, we propose a framework to understand the trade-off between faithfulness and robustness in the domain of neural machine translation. 4. W epropose no velmetrics to quanti fythe sensiti vityto perturbations in an attempt to interpret the word-order insensitivity of neural models. We find evidence suggesting that models may be learning syntactic rules that are governed more so at the level of sub-words and characters than at the word-level. iv

Contribution of Authors

Most of Chapters 1 and 2 were written specifically for this thesis. mary student author. My contributions were on developing the tools for setting up probe tasks - devising experiments, set up amazon mechanical turk for human evaluations. The paper was jointly written with Sarath Chandar, and Joelle Pineau. Chapter 4 is based on a conference paper (Sinha et al.,2020a ). The primary observa- tion of the unnaturalness in syntax was by Koustuv Sinha. I contributed to extend the experiments on recurrent and convolutional language models. Although the experiments on transformer models were contributed by Koustuv, we present them in the thesis for better understanding of the work. The idea of using a metric to quantify the unnatural- ness, and the designing perturbations through POS mini-tree hypothesis germinated over discussions. Koustuv prototyped the metric and word shuffling operation. We attribute a huge credit to Adina Williams who advised us on the project, and contributed to the majority of writing. This work got the "Outstanding Paper" award at ACL 2021. Joelle Pineau supervised the project by providing useful comments and helping in the writing. Chapter 5 is based on a conference paper (Parthasarathi et al.,2021c ). This is an ex- tension of the previous work to language generation task in neural machine translation. In this, I was the primary contributor of the experiments taking suggestions from Kous- tuv for its design and organization. We attribute a joint credit to the conceptualization of the metrics, to the writing, and for the narrative on faithfulness vs robustness. Adina v Williams helped a lot in the writing of the paper. Joelle Pineau supervised the project by providing useful comments and helping in the writing. Chapter 6 is based on a conference paperClouatre et al. ( 2021). This is a joint work with Louis Cloûatre, Sarath Chandar and Amal Zouaq. My contribution to this work was pre- dominantly to the conceptualization of quantifying perturbations and correlating it with the performance of neural models on language understanding tasks. Louis Cloûatre took lead in the experiments, and I contributed to the writing of the paper. Louis made initial observations on drawing a connection between IDC metric and the use of positional en- coding and formalized the experiments. Sarath Chandar and Amal Zouaq supervised the project by providing useful suggestions and helping in the the writing. Chapter 7 was written specifically for this thesis. The ideas presented in the future work section are attributable to the discussions I have had with Saujas Vaduguru, Marc Alexandre-Cote, Eric Yuan, Sarath Chandar, Koustuv Sinha, Adina Williams and Joelle

Pineau.

Throughout my Ph.D. I am also fortunate to have collaborated in many other research that are not part of this thesis (

Rajendran et al.

2017

T ruonget al.

2017

P arthasarathi

and Pineau 2018

Gontier et al.

2018

Sinha et al.

2020b

P arthasarathiet al.

2020a
2021a

McRae et al.

2021
vi

Acknowledgements

Thank you Joelle, for the opportunity you provided and your insightful advice. I am im- mensely thankful for guiding me towards a career in research. Thank you Radhika for your unconditional support throughout my Ph.D. journey. I cannot thank you enough for being there for me and consistently cheering me. Thank you Amma and Appa, for instilling the confidence in me to dream bigger and to be an independent person. Thank you Anand for your support and constant motivation. Thank you Sarath, for being a friend, mentor and a well-wisher.

1,Saujas,Ma-

mal, Igor, Arjun, Srinivas, Varsha, Sruthika, Sankari, Sai Rajeshwar, Shagun, Disha and Gunshi for the many conversations and board game nights which were a welcome respite. I would like to thank all my mentors, collaborators and colleagues at McGill Univer- sity, Mila, Facebook AI Research (Meta AI) Montreal, Google Brain and Noah"s Ark Lab (Huawei) for many great discussions over coffees and lunches. I would like to also thank the McGill and Mila administration for facilitating my aca- demic journey.1 Special thanks for helping me with the French abstract. vii

1 Introduction

2 Background

2.1 Language Model

2.2 Generalized language Models Through Distributed Representations

2.2.1 Feed Forward Neural Network based LM

2.2.2 Recurrent Neural Network based LM

2.2.3 Transformer Language Model

2.3 Natural Language Processing Tasks

2.3.1 Text prediction

2.3.2 Text Classification

2.4 Overview of Metrics

2.5 Probe tasks

3 On quantifying the semantics of language encoders

3.1 Related Work

3.2 Probe Tasks

3.2.1 Datasets

3.2.2 Models

3.2.3 Motivating Semantic Probe Tasks for Dialogue Generation

3.2.4 Dialogue Probe Tasks

3.3 Experiments

3.3.1 Results

3.4 Discussion

3.5 Summary

38
viii

4 On the effective role of word-orders in natural language tasks39

4.1 Related Work

4.2 Our Approach

4.3 Methods

4.4 Results

4.5 Analyzing Syntactic Structure Associated with Tokens

4.6 Human Evaluation

4.7 Summary

5 On the effects of unchecked normativity in language generation

5.1 Related Work

5.2 Metrics

5.3 Perturbations

5.3.1 Random Shuffles

5.3.2 Part-of-Speech tag Based Perturbations

5.3.3 Dependency Tree Based

5.3.4 Distribution

5.4 Experiments

5.5 Results

5.5.1 Faithfulness vs. Robustness

5.5.2 Patterns in1and2, and Length. . . . . . . . . . . . . . . . . . 70

5.6 Discussion

5.7 Summary

6 On quantifying the perceived unnaturalness

6.1 Related Work

6.2 Proposed Metrics

6.3 Perturbation Functions

6.4 Experiments

6.5 Analysis

6.5.1 Correlation with other metrics

6.5.2 Comparison of Perturbation Functions

88
ix

6.5.3 IDC/DND vs GLUE tasks. . . . . . . . . . . . . . . . . . . . . . 89

6.5.4 Model specific analysis

6.5.5 Character-Level Experimentation

6.6 Summary

7 Final Conclusion & Future Work

100

7.1 Final Conclusion

100

7.2 Future Work

101

Bibliography

104

Acronyms

135
x

List of Figures

2.1 A Recurrent Neural Network Language Model.

2.2 A single LSTM-RNN cell.

2.3 Bidirectional Recurrent Neural Network

2.4 Transformer architecture (

Vaswani et al.

2017
13

3.1 The mean of the distribution of tie in three different experiments.

3.2 Progression of performance of models on the probe tasks in MultiWoZdataset.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.3 Progression of performance of models on the probe tasks in MultiWoZdataset (Continued).. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Downsampled encoder hidden states on MultiWoZ dataset with PCA.

4.1 Graphical representation of the Permutation Acceptance class of metrics.

4.2 Average entropy of model confidences on permutations across differentmodels.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 BLEU-2 score versus acceptability of permuted sentences across all testdatasets.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.4 POS Tag Mini Tree overlap score.

5.1 Effect of the different perturbation functions.

5.2 Tree based perturbation example.

5.3 Heatmap illustrating the average of Levenshtein distances between differ-ent perturbations.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.4 Plot showing the trend of1scores being generally higher than2.. . . . . 66

5.5 Analysing the BLEURT score as a choice of.. . . . . . . . . . . . . . . 67

5.6 Analysing the BERT-score as a choice of.. . . . . . . . . . . . . . . . . 68

5.7 Analysing the Levenshtein score as a choice of.. . . . . . . . . . . . . . 69

5.8 The robustness of the NMT systems having a strong correlation with the

performance of the machine translation system. 70

5.9 Correlation of length of the text to the two different metrics.

5.10Le. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73

5.11 General trend of SoTA NMT systems to favor robust or faithful translations.

5.12 Averaging the difference between faithfulness and robustness across lan-guages and NMT systems.. . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.1 Example of perturbation operationss applied at different granularity.

6.2 Word-level full shuffling perturbation example.

6.3 Subword-level phrase shuffling perturbation example.

6.4 Character-level full neighbor flip perturbation example.

6.5 Pairwise correlation between the different metrics on the GLUE tasks.

6.6 Relation between the different choices of metrics measuring the amount ofperturbation.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.7 Analysing different perturbation functions discussed in the literature withthe proposed metrics - IDC and DND.. . . . . . . . . . . . . . . . . . . 91

6.9 Correlation between the models" performance to perturbed samples on thedifferent GLUE tasks.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.8 Comparison of different neural architectures" performances with differentlevel of perturbation as measured by DND.. . . . . . . . . . . . . . . . . 92

6.10 Correlation between perturbations measured by different metrics and theperformance on GLUE Tasks of pre-trained Transformers.. . . . . . . . . 93

6.11 Correlation between perturbations measured by different metrics and theperformance on GLUE Tasks of different non-pretrained architectures.. . . 94

6.12 Correlation between perturbations measured by different metrics and theperformance on GLUE Tasks of ConvNets and BiLSTMs using only char-acters as tokens.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6.13 Difference in GLUE scores between a Transformer and the same Trans-former without positional embeddings.. . . . . . . . . . . . . . . . . . . . 97

xii

List of Tables

3.1 Distribution of the dialogues in the datasets.

3.2 Size of parameters of the models used in all the experiments on the two

data sets.Mfor Million.. . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.3 BLEU scores of the models from runs with different seeds on PersonaChat

and MultiWoZ dataset. 27

3.4 The difficulty levels of different tasks is measured with the average perfor-

mance of an untrained encoder. 29

3.5 Performance of neural models on the probe tasks constructed with Per-

sonaChat dataset. 31

3.6 Performance of different neural models on the probe tasks constructed with

MultiWoZ 2.0 dataset.

3.7 Aggregate F1 scores of the models on performance in probe tasks on Mul-

tiWoZ dataset. 37

4.1 Statistics for Transformer-based models trained on MNLI corpus.

4.2 Results on evaluation on OCNLI Dev set.

4.3 Human (expert) evaluation on 200 permuted examples from the MNLI

matched development set. 53
xiii

5.1 Performances in BLEU-4 () of our NMT models.. . . . . . . . . . . . . 75

5.2 Number of flips by language and NMT models.

5.3 The distribution count of flips by every perturbation functions across the

languages and models. 78
xiv 1

Introduction

Natural language, as we speak, write and plan with, has its myriad connections with evo- lution, culture, and knowledge of humans as a species. The origins of the early (proto) languages could be traced back to tens of thousands of years (

Everett

2017
). Language served the purpose of communicating information about predator and prey to an individual or a group. As humans and their societies evolved, the language too became sophisticated about describing events that occurred in the past or occurring in the present or going to oc- cur in the future. Such sophistication in description allowed humans to plan future actions or analyse the actions of the past and so on. Although planning a task and communicating an event directly benefited from the development of language, language also served other purposes like community building; where the speakers engaged in conversations to advise, argue or chat without a direct purpose. Evolution of languages for the most part of the history were endemic (

Harari

2014
Within a region, the primary learning involved through familiarizing with the rules appli- cable in the different usecases (syntax), and learning to use the vocabulary appropriately (semantics). The geographical expansion of different communities added variety to the dif- ferent tasks within natural languages like translation, understanding a common grammar, developing novel languages by fusing the vocabularies among others (

Pinker

2003
). Mod- ern day studies on natural language attempt to learn the structure dictated by the gram- mar, generate language, understanding language emergence or perform natural language understanding (NLU) tasks like - summarizing, answering questions from a passage, rec- 1

Introduction

ognizing textual entailment, classifying the sentiment, machine translation among many other tasks. Wide spread usage of modern commercial applications that are powered by models solving aforementioned language tasks have been aiding the advancement in such technologies. Need for Neural Language ModelsThe use cases in several commercial language ap- plications enabled the collection of large corpora of user interaction that could in-turn be used to learn statistical solutions. The premise of such sophisticated data-driven models is straightforward in that a sequence of projection operations applied to identify a rep- resentation space that allows maximal distinction of the possible classes in the samples. Recently, such architectures with deep representations with over-parameterized models for the task of language learning has garnered attention (

Vaswani et al.

2017
). The prospects of transformers piqued the interests of the NLP community.Pre-Trainingthe transformer architectures with large corpora of data was observed as an effective technique to learn sophisticated latent representation for several language understanding tasks

Radford et al .

2018

De vlinet al.

2018a

Liu et al.

2019d

Raf felet al.

2019
While applying powerful transformer networks on NLP tasks has shown success, ver- ifying whether its predictions are indeed an entailment to the information in the text is necessary. For example, consider a sentence that conveys some specific instruction like, "Give the blue ball to the child playing with the green toy". The syntax or the grammar rules of the English language allow comprehending the instruction to appropriately extract the information (action and entities). If the words were to be ordered in a different way, the syntactic rules cannot be effectively applied to decode themeaning. Psycho-linguistic research have observed that humans find it easier to identify or recall words presented in canonical orders than in disordered, ungrammatical sentences; this phenomenon is called the"sentence superiority effect"((Cattell,1886 ;Scheerer ,1981 ;T oyota,2001 ;Baddele y et al. 2009

Snell and Grainger

2017
2019

W enet al.

2019
), i.a.). The role of syntax in NLUGenerally, knowing the syntax of a sentence is taken to be a prerequisite for understanding what that sentence means (

Heim and Kratzer

1998
Models should have to know the syntax first then, if performing any particular NLU task that genuinely requires a humanlike understanding of meaning (cf. (

Bender and Koller

Introduction

2020
)). The null-hypothesis for what should a model do when it encounters a sentence that probably does not make sense is very much dependent on the task. For a text classification task, where the objective is to predict a label from a finite set of categories the lack of understanding could be correlated with high perplexity. On the other hand, in a generative utterance or generating a translation) it becomes less straightforward. In such scenarios, arguments can be constructed in support of the models staying robust or to be faithful to the noisy input with conviction or if possible (in case of dialogue prediction) can ask for clarification. out-of-domaindata(

LuongandManning

2015
),orwhentrainedwithnoisyinputdatacon- taining small orthographic (

Sakaguchi et al.

2017

Belink ovand Bisk

quotesdbs_dbs46.pdfusesText_46

[PDF] langage ada

[PDF] langage c exercices corrigés

[PDF] langage c somme de 2 entiers

[PDF] langage c++

[PDF] langage calculatrice ti-83 plus

[PDF] langage de programmation pdf

[PDF] langage de texto

[PDF] Langage des fonctions, algébrique et lié au graphique

[PDF] langage et mathématiques

[PDF] langage javascript cours

[PDF] langage javascript debutant

[PDF] langage mathématique de base

[PDF] langage naturel maths

[PDF] langage pascal exercices corrigés pdf

[PDF] langage pascal informatique

[PDF] Quantifying Language Understanding of Neural Language Models

Quantifying Language Understanding

Prasanna Parthasarathi

Doctor of Philosophy

School of Computer Science

McGill University

Montreal, Quebec, Canada

Doctor of Philosophy

Prasanna Parthasarathi, 2022

Abstract

Abrégé

Contributions to Original Knowledge

Contribution of Authors

Pineau.

Rajendran et al.

T ruonget al.

P arthasarathi

Gontier et al.

Sinha et al.

P arthasarathiet al.

McRae et al.

Acknowledgements

1,Saujas,Ma-

Contents

1 Introduction

2 Background

2.1 Language Model

2.2 Generalized language Models Through Distributed Representations

2.2.1 Feed Forward Neural Network based LM

2.2.2 Recurrent Neural Network based LM

2.2.3 Transformer Language Model

2.3 Natural Language Processing Tasks

2.3.1 Text prediction

2.3.2 Text Classification

2.4 Overview of Metrics

2.5 Probe tasks

3 On quantifying the semantics of language encoders

3.1 Related Work

3.2 Probe Tasks

3.2.1 Datasets

3.2.2 Models

3.2.3 Motivating Semantic Probe Tasks for Dialogue Generation

3.2.4 Dialogue Probe Tasks

3.3 Experiments

3.3.1 Results

3.4 Discussion

3.5 Summary

4 On the effective role of word-orders in natural language tasks39

4.1 Related Work

4.2 Our Approach

4.3 Methods

4.4 Results

4.5 Analyzing Syntactic Structure Associated with Tokens

4.6 Human Evaluation

4.7 Summary

5 On the effects of unchecked normativity in language generation

5.1 Related Work

5.2 Metrics

5.3 Perturbations

5.3.1 Random Shuffles

5.3.2 Part-of-Speech tag Based Perturbations

5.3.3 Dependency Tree Based

5.3.4 Distribution

5.4 Experiments

5.5 Results

5.5.1 Faithfulness vs. Robustness

5.5.2 Patterns in1and2, and Length. . . . . . . . . . . . . . . . . . 70

5.6 Discussion

5.7 Summary

6 On quantifying the perceived unnaturalness

6.1 Related Work

6.2 Proposed Metrics

6.3 Perturbation Functions

6.4 Experiments

6.5 Analysis

6.5.1 Correlation with other metrics

6.5.2 Comparison of Perturbation Functions

6.5.3 IDC/DND vs GLUE tasks. . . . . . . . . . . . . . . . . . . . . . 89

6.5.4 Model specific analysis

6.5.5 Character-Level Experimentation