[PDF] A Retrofitting Model for Incorporating Semantic Relations into Word





Previous PDF Next PDF



Utilisation du modèle de document Word pour thèses et mémoires

09-Apr-2020 D18-guide-modele-word-theses-memoires.pdf version 1.6 ... rédaction de votre thèse ou mémoire est déjà commencée le modèle de document peut ...



SAP Offline Word Template

03-Dec-2017 These are words or characters that you enter in the system exactly as they appear in the documentation. <Example>. Variable user entry. Angle ...



Instructions for Applying the Primary Article Template

https://www.acm.org/publications/taps/word-template-workflow to your machine year style references the author's name must have these character styles ...



SAP Offline Word Template

23-Jun-2014 These include report names program names



Learning Context-Sensitive Word Embeddings with Neural Tensor

However most of these methods use the same embedding vector to represent a word



N-gram Language Models

The bigram model for example



SAP Offline Word Template

14-Jun-2017 Angle brackets indicate that you replace these words and characters ... These demands on security apply likewise to SAP Landscape ...



A Retrofitting Model for Incorporating Semantic Relations into Word

These constraints are used as training data to learn a non-linear transformation function that maps original word vectors to a vector space respecting these 



A Word-to-Word Model of Translational Equivalence

For these applications we have designed a fast algorithm for esti- mating a partial translation model



Learning words from sights and sounds: a computational model

these cues to aid in segmentation (Saffran Aslin & Newport



[PDF] Utilisation du modèle de document Word pour thèses et mémoires

D18-guide-modele-word-theses-memoires pdf version 2 1 rédaction de votre thèse ou mémoire est déjà commencée le modèle de document peut 



Télécharger un modèle de document pour travaux universitaires

Rédigez vos travaux universitaires (thèses mémoires rapports) plus efficacement avec les modèles proposés par votre BU : automatisations de mise en forme 



Tutoriels et modèles Word : rédaction de thèses et mémoires

Nous vous proposons aussi des modèles de documents (feuilles de style) adaptés à votre mémoire ou votre thèse Des formations à leur utilisation sont prévues 



[PDF] Modèle de document Word pour thèses de doctorat Premières

Modèle de document Word pour thèses de doctorat Premières étapes et FAQ Premiers pas • Les modèles de documents suivants sont à votre disposition



Modèles - Écrire avec Word

21 jan 2021 · Structure d'une thèse Élements d'une thèse; Pourquoi utiliser un modèle (ou feuille de style) ? Générer un pdf avec sommaire interactif



[PDF] formation rediger sa these avec une feuille de style word 2013

Dans cette formation nous verrons ce qu'est une feuille de style et vous apprendrez à en créer une pour produire facilement des styles par niveaux des listes 



[PDF] modèle Word thèse et mémoire - Université Badji Mokhtar-Annaba

Thèse présentée en vue de l'obtention du diplôme de DOCTORAT 3ème Cycle Option Commande des Systèmes Industriels et Énergies Renouvelables



[PDF] Mémoire de Thèse - Thesesfr

Un système de recommandation se focalise normalement sur un type spécifique d'item (par exemple des CDs ou news) et en conséquence son modèle de navigation 



Modèle Word pour la thèse - Fondamentauxorg

1 déc 2011 · Vous trouverez donc ci-dessous en téléchargement un modèle standard Word pour les thèses en droit Il utilise les options automatiques de 



[PDF] La feuille de style - Sciences Po - École Doctorale

Nommer le modèle « These Sciences Po » et sélectionner le format « Modèle Word» dans Type de fichier : Cliquer sur « Installer le modèle et démarrer un 

:
A Retrofitting Model for Incorporating Semantic Relations into Word Proceedings of the 28th International Conference on Computational Linguistics, pages 1292-1298

Barcelona, Spain (Online), December 8-13, 20201292A Retrofitting Model for Incorporating Semantic Relations into Word

Embeddings

Sapan Shah

1,2, Sreedhar Reddy1, and Pushpak Bhattacharyya2

1

TCS Research, Tata Consultancy Services, Pune

2Indian Institute of Technology Bombay, Mumbai

fsapan.hs,sreedhar.reddyg@tcs.com pb@cse.iitb.ac.in AbstractWe present a novel retrofitting model that can leverage relational knowledge available in a knowledge resource to improve word embeddings. The knowledge is captured in terms of relation inequality constraints that compare similarity of related and unrelated entities in the context of an anchor entity. These constraints are used as training data to learn a non-linear transformation function that maps original word vectors to a vector space respecting these constraints. The transformation function is learned in a similarity metric learning setting using Triplet network architecture. We applied our model to synonymy, antonymy and hypernymy relations in WordNet and observed large gains in performance over original distributional models as well as other retrofitting approaches on word similarity task and significant overall improvement on lexical entailment detection task.

1 Introduction

Word embedding models (Pennington et al., 2014; Mikolov et al., 2013) are primarily inspired from the

distributional hypothesis (Harris, 1954) viz. words that appear in similar context tend to have similar

meaning. However, these models have one major drawback: they mix semantic similarity with other types of semantic relatedness (Hill et al., 2015). Consider for example,cheapandexpensive. Though

completely opposite in meaning, these words tend to occur in nearly identical contexts and end up having

similar distributional vectors. This is problematic for many end applications such as sentiment analysis,

text simplification, and so on. To address this issue, researchers have proposed various models to combine

information from external knowledge sources such as WordNet, Freebase, etc. into unsupervised learning

of word embeddings. These models mainly focus on the constraints extracted from various types of relations such as synonymy, antonymy, hypernymy, etc. At a high level, these models are categorized

into: Joint specialization models (Yu and Dredze, 2014; Liu et al., 2015; Xu et al., 2014); and Retrofitting

models (Faruqui et al., 2015; Wieting et al., 2015; Glavas and Vuli´c, 2018; Kamath et al., 2019). Joint

specialization models typically modify the optimization objective of distributional models by integrating

the constraints into the objective function. Whereas, retrofitting models update the word vectors of

distributional models in a post-processing training step using data generated from the constraints. Current

retrofitting models have one limitation. They use constraints that tend to push cosine similarity to extremes

(+1 or -1). While this works well for relations such as synonymy and anotonymy, it does not work so well

for relations such as hypernymy, holonymy, etc. We need an approach that works for all relations while

striking the right balance with distributional semantics. In this work, we present a new method to obtain constraints from all types of relations present in

a knowledge resource. The constraints are in the form of relation inequalities. The central idea is: if

an entityentais related to entityentbwith relation typereland is not related to entityentcby the same relation, thenentais semantically closer toentbthanentcin the context of the relationrel. The

This work is licensed under a Creative Commons Attribution 4.0 International Licence. Licence details:http://

creativecommons.org/licenses/by/4.0/.

1293corresponding inequality can then be stated as:simrel(enta;entb)> simrel(enta;entc). Using such

relation inequality constraints has the following advantages: 1) They are not just limited to synonymy and

antonymy relations but can also be generated from other lexical relations such as hypernymy, holonymy,

and so on. 2) They can also be generated from any relation type including relations from non-lexical knowledge graphs such as FreeBase.

We use the generated inequality constraints as training data to learn a non-linear transformation function

that maps original word vectors to a vector space respecting these constraints. The transformation

function is learned in a similarity metric learning setting using Triplet network architecture (Wang et al.,

2014; Schroff et al., 2015). We applied our model to synonymy, antonymy and hypernymy relations in

WordNet and observed large gains in performance over retrofitting benchmarks on word similarity task

and significant overall improvement on lexical entailment (LE) detection task. The main contributions of

this work are: (1) A new method to obtain constraints from all relation types in a retrofitting setting with

its demonstration on WordNet relations; (2) Using Triplet network based similarity metric learning for a

softer, more balanced integration of constraints; (3) A detailed experiment on word similarity as well as

LE detection tasks to show the effectiveness of the proposed approach.

2 Constraints from WordNet Relations

This section presents a set of rules to obtain relation inequality constraints from synonymy, antonymy

and hypernymy relations in WordNet (Miller, 1995). Each rule involves a triplet of entities(va;vp;vn)in

which we refer tova,vpandvnas theanchor,positiveandnegativeentities respectively. A constraint is

generated such that the anchor is closer to the positive than the negative entity, with a margin to indicate

minimum separation. It can be set in the range of[0;2]corresponding to minimum versus maximum separation on cosine distance.

Similarity Relationship Constraints:

We represent the set of unique words appearing in all synsets as

nodes of a graph G. We then add a labeled edgesynbetween two nodes if the corresponding pair of words

belongs to the same synset. The inequality constraints forr=synare then obtained using,

8va;vp;vn;(va;r;vp)2G;(va;r;vn)=2G;sim(va;vp)> sim(va;vn) +marginr(1)

i.e. a pair of entities associated by a specific relationrare semantically closer than a pair of entities that

are not in that relation. Specifically forr=syn, we sample negative wordsvnusing antonymy relation.

For instance, consider a triplet(bright, clever, stupid). With respect to anchor wordbright,cleveris closer

thanstupidas the former is a synonym whereas the latter is an antonym ofbright. This generates the following constraint:sim(bright;clever)> sim(bright;stupid) +marginsyn

Type Hierarchy Constraints:

In addition to the word nodes, we also create type nodes to represent

synsets inGand add edges between them using hypernymy relation to capture type hierarchy (edge label:

subtype). Moreover, we also add an edge between a word and the type corresponding to its synset (edge

label:type). We then apply the following rule to generate type hierarchy constraints.

8va;vp;vn; (va;type;t1);(vp;type;t2);(vn;type;t3);(t1;subtype;t2);(t2;subtype;t3)2G;

sim(va;vp)> sim(va;vn) +marginhier(2)

i.e. a pair of entities closer in type hierarchy are semantically closer compared to a pair farther in

type hierarchy. For instance, consider a triplet(coach, railcar, vehicle). With respect to anchor word

coach,railcaris closer thanvehicleas the former is a direct hypernym ofcoachwhereas the latter is an indirect hypernym throughrailcar. This generates the following constraint:sim(coach;railcar)> sim(coach;vehicle) +marginhier.

It should be noted that rules 1 and 2 above can be used to obtain constraints from any knowledge graph.

Rule 1 in its general form encodes any relationrwhere a triplet(va;vp;vn)is sampled such thatvaand vpare in relationshiprwhilevaandvnare not. Similarly rule 2 can obtain type hierarchy constraints.

1294Hinge Loss

distap distanMulti Layer Perceptron

Multi Layer Perceptron

Multi Layer PerceptronWa

WnWpTraining Instance:

(bright, clever, stupid, margin)

Pre-trained Embedding Lookup:

bright Wa clever Wp stupid Wn

Distance Function:

distap = cosine-distance(T(Wa), T(Wp)) distan = cosine-distance(T(Wa), T(Wn)) Hinge loss= max(0, margin+ distap - distan)Transformation Function: T(xi) = xitFigure 1: Retrofitting model for learning a non-linear transformation function

3 Transformation Function

We use the generated inequality constraints as training data (D) to learn a transformation function that

maps pre-trained word embeddings to a vector space that respects these constraints. This function is

learned using a Triplet network architecture in a similarity metric learning setting. The Triplet architecture

(Hoffer and Ailon, 2015; Wang et al., 2014) provides a way to learn transformation from input space to

representation space such that distances in the representation space approximates semantic distances in

the input space. Figure 1 shows the architecture and an example training instance to learn the transformation func- tion. LetX2Rndrepresent the pre-trained embeddings for vocabulary of sizen. For a training instance(wa;wp;wn;margin)2D, we first obtain corresponding pre-trained embeddings fromXi.e. (xwa;xwp;xwn). These embeddings are then passed as input to a transformation functionT(xi) =xti, a multi-layer feed forward neural network with weightsWT. Our model contains three identical copies of

this network with shared parameters. These copies are then joined using a distance layer that computes

two cosine distances viz. distance of the anchor wordwafrom the positive wordwpand its distance from the negative wordwni.e. dist ap=cosine-distanceT(xwa);T(xwp) dist an=cosine-distanceT(xwa);T(xwn)

These distances in the transformed vector space are then fed to a margin based hinge loss function. To

reduce overfitting, we applyL2regularization on the weightsWTof the network. The loss function

Lhingeused by our model is then,

L hinge=P (wa;wp;wn;margin)2D max(0;margin+distapdistan) +wk(WT)k2

Similar to (Mrksi´c et al., 2016; Glavas and Vuli´c, 2018), we also include a regularization termLvsrthat

penalizes vector space transformations that drastically change the topology of input vector space. L vsr=P (wa;wp;wn;margin)2D cosine-distance(xwa;T(xwa)) +cosine-distance(xwp;T(xwp)) +cosine-distance(xwn;T(xwn)) The final loss function used by our model is then:L=Lhinge+vsrLvsrwherevsris a hyper- parameter that determines how strictly the topology of original vector space is preserved. Once the

network is trained, the learned transformation functionT(xi)is applied to all the words inXto map the

pre-trained word embeddings to a new transformed vector spaceXt2Rnd.

4 Experimental Setup

To evaluate our retrofitting approach, we experimented with three pre-trained word embeddings that are

learned using different distributional models: (1) GloVe (Pennington et al., 2014): trained on Common

Crawl data; (2) Word2Vec (Mikolov et al., 2013): trained on Wikipedia dump available on polyglot project

1295(Al-Rfou" et al., 2013); (3) FastText (Bojanowski et al., 2017): trained on Wikipedia 2017. As explained

in section 2, we use WordNet to obtain two types of constraints: (1) Similarity relationship constraints: a

total of 425,732 constraints from synonymy and antonymy relations; (2) Type hierarchy constraints: a total of 100,100 constraints from hypernymy relation. The margin parameter is set to0:6and0:2for the similarity relationship constraints and the type hierarchy constraints respectively1.

We compare our model (referred asTripletNethereafter) with three state of the art retrofitting models:

(1)Counterfit(Mrksi´c et al., 2016): It defines the loss function as a weighted sum of terms that brings

synonymous words closer, pushes antonymous words apart. However, it retrofits only those words that are

present in the constraints (2)ExplRetrofit(Glavas and Vuli´c, 2018): It retrofits vectors of all words in

the vocabulary by learning a global specialization function using synonym and antonym constraints (3)

AuxGAN(Ponti et al., 2018): It learns the global specialization function using a generative adversarial

network architecture. We also compare our model with the joint specialization approach of Liu et al. (2015) that updates word2vec optimization objective (referred asSWE).SimLex-999SimVerb-3500

GloVe FastText Word2VecGloVe FastText Word2Vec

Lexical

OverlapPreTrained0.3738 0.4409 0.36250.2264 0.3558 0.2531

Counterfit0.60380.5949 0.58690.44680.47250.4505ExplRetrofit0.6252 0.5331 0.53640.5362 0.41820.5290AuxGAN0.63170.3618 0.56720.4875 0.2980 0.4589

SWE- - 0.5017- - 0.4001

TripletNet-Sim0.6014 0.5149 0.53160.5055 0.4348 0.4615 TripletNet-Type0.4288 0.4337 0.36870.3257 0.3494 0.2687 TripletNet0.6139 0.5349 0.53860.55250.4404 0.4953Lexical DisjointPreTrained0.3738 0.4409 0.36250.2264 0.3558 0.2531 Counterfit0.3702 0.4381 0.36310.2257 0.3578 0.2561 ExplRetrofit0.5265 0.526 0.39050.3553 0.4042 0.2634

AuxGAN0.5630 0.3339 0.47040.4194 0.2541 0.3540

SWE- - 0.4612- - 0.3620

TripletNet-Sim0.5725 0.5124 0.49860.51050.4199 0.4336TripletNet-Type0.4301 0.4319 0.36740.3081 0.3402 0.2650

TripletNet0.5742 0.5314 0.50260.50250.4311 0.4541

Table 1: Spearman"s correlation () scores of our model and other benchmarks for three distributional embeddings on two word similarity datasets: SimLex-999 and SimVerb-3500

4.1 Word Similarity Task

We evaluate our approach on two word similarity datasets: SimLex-999 (Hill et al., 2015) and SimVerb-

3500 (Gerz et al., 2016) using Spearman"s rank correlation (). Similar to Glavas and Vuli´c (2018), we

use two evaluation settings i.e.Lexical disjoint:To effectively evaluate the generalization capability of

retrofitting approaches, all words appearing in the evaluation datasets are removed from the training set;

Lexical overlap:In this setting, all words appearing in the evaluation datasets are retained in training set.

Table 1 shows the results of our experiments. In both the evaluation settings, our model performs

substantially better than the baseline pre-trained embeddings and the joint specialization model (SWE). In

lexical disjoint setting, the Counterfit model does not improve beyond baseline as all the words present in

the evaluation set are excluded from training. The ExplRetrofit and AuxGAN models perform better than

Counterfit as they retrofit vectors of all words using a global specialization function. Our model performs

even better as it does not put hard constraints on cosine similarity values. Instead, these values are learned1

The margin parameter for the inequality constraints are varied as:f0;0:6;1;1:5;2gfor similarity relationship constraints;

andf0:05;0:1;0:2gfor type hierarchy constraints. The performance of the GloVe retrofitted embeddings on SimLex-999

evaluation set (refer section 4.1) is used to tune this parameter.

1296such that the anchor words are relatively closer to positive words than negative words thereby striking

the right balance between distributional and relational semantics. Inlexical overlapsetting, retrofitting

models perform significantly better than the baseline. Since the Counterfit model directly updates input

word vectors pushing cosine similarity of synonyms to 1 and antonyms to -1 and many of the words in the

evaluation set are already included in the training set, it seems to be performing better overall.

We also performed ablation test on the type of constraints. The embeddings retrofitted only using the

similarity relationship constraints (TripletNet-Sim) perform significantly better than other benchmarks.

However, the embeddings retrofitted only using the type hierarchy constraints (TripletNet-Type) bring

only marginal improvement over baseline. This is on expected lines as similarity constraints are more

importantforthewordsimilaritytask. Combiningbothconstraints(TripletNet)bringsfurtherimprovement suggesting positive interaction between the embeddings of words across both types of constraints.

4.2 Lexical Entailment (LE) Detection Task

LE detection is a classification task to identify if a given pair of words is in a lexical entailment relation

such as hypernymy, causality, and so on. We evaluate our approach on four datasets: Baroni (Baroni et al., 2012), WBLESS (Weeds et al., 2014), Kotlerman (Kotlerman et al., 2010) and Turney (Turney and Mohammad, 2014). The dataset splits provided by Levy et al. (2015) are used for the experiments on Baroni, Kotlerman and Turney. Whereas for WBLESS, we randomly split the data into train (70%)

and test (30%) set. Given a pair of words (x;y) as input, we first represent them as the concatenation

of their word embeddings (i.e.~x~y) and then learn a logistic regression classifier. In addition to the

models explained earlier, we also compare our model withLEAR(Vuli´c and Mrksi´c, 2018) that uses

vector norms to define an asymmetric distance metric in order to leverage LE relations during training.PreTrained Counterfit ExpRetrofit AuxGAN LEAR TripletNetLEAR-M

Baroni0.7313 0.7282 0.7519 0.7649 0.76870.80780.903 WBLESS0.9349 0.9058 0.93460.94420.9385 0.92690.8885 Kotlerman0.7021 0.7264 0.72620.74720.7118 0.74070.599 Turney0.6982 0.6888 0.6923 0.714 0.63710.73570.6765

Table 2: Accuracy scores of our model and other benchmarks on LE detection datasets (GloVe embeddings)

Table 2 reports accuracy for various models on LE detection task. Overall, approaches that learn

specialization function perform better than other approaches. Our model performs even better on Baroni

and Turney, while comparable to AuxGAN on WBLESS and Kotlerman. We also learnt a model (LEAR- M ) based on the asymmetric distance metric to identify LE pairs. This model performed significantly better on Baroni. However, it did not perform well on other datasets.

5 Conclusions and Future work

We present a novel retrofitting model that first generates inequality constraints from relational knowledge

present in a knowledge resource. These constraints are then used as training data to learn a non-linear

transformation function in a similarity metric learning setting using Triplet network architecture. We

applied our model to synonymy, antonymy, and hypernymy relations in WordNet and observed large gains in performance over original pretrained embeddings as well as other retrofitting benchmarks on word similarity task and significant overall improvement on LE detection task.

We are currently evaluating our model on extrinsic tasks such as sentiment analysis, NER, etc. We also

plan to incorporate relational knowledge present in non-lexical knowledge resources such as Freebase to

further improve word embeddings.

References

Rami Al-Rfou", Bryan Perozzi, and Steven Skiena. 2013. Polyglot: Distributed word representations for multilin-

gual NLP. InProceedings of the Seventeenth Conference on Computational Natural Language Learning, pages

183-192, Sofia, Bulgaria, August. Association for Computational Linguistics.

1297Marco Baroni, Raffaella Bernardi, Ngoc-Quynh Do, and Chung-chieh Shan. 2012. Entailment above the word

level in distributional semantics. InProceedings of the 13th Conference of the European Chapter of the As-

sociation for Computational Linguistics, pages 23-32, Avignon, France, April. Association for Computational

Linguistics.

Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword information.Transactions of the Association for Computational Linguistics, 5:135-146. Manaal Faruqui, Jesse Dodge, Sujay Kumar Jauhar, Chris Dyer, Eduard Hovy, and Noah A. Smith. 2015.

Retrofitting word vectors to semantic lexicons. InProceedings of the 2015 Conference of the North American

Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1606-1615,

Denver, Colorado, May-June. Association for Computational Linguistics.

Daniela Gerz, Ivan Vuli

´c, Felix Hill, Roi Reichart, and Anna Korhonen. 2016. SimVerb-3500: A large-scale

evaluation set of verb similarity. InProceedings of the 2016 Conference on Empirical Methods in Natural

Language Processing, pages 2173-2182, Austin, Texas, November. Association for Computational Linguistics.

Goran Glava

s and Ivan Vuli´c. 2018. Explicit retrofitting of distributional word vectors. InProceedings of the

56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 34-45,

Melbourne, Australia, July. Association for Computational Linguistics. Zellig Harris. 1954. Distributional structure.Word, 10(23):146-162.

Felix Hill, Roi Reichart, and Anna Korhonen. 2015. SimLex-999: Evaluating semantic models with (genuine)

similarity estimation.Computational Linguistics, 41(4):665-695, December. Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. InICLR. Aishwarya Kamath, Jonas Pfeiffer, Edoardo Maria Ponti, Goran Glava s, and Ivan Vuli´c. 2019. Specializing

distributional vectors of all words for lexical entailment. InProceedings of the 4th Workshop on Representation

Learning for NLP (RepL4NLP-2019), pages 72-83, Florence, Italy, August. Association for Computational

Linguistics.

Lili Kotlerman, Ido Dagan, Idan Szpektor, and Maayan Zhitomirsky-geffet. 2010. Directional distributional

similarity for lexical inference.Natarul Language Engineering, 16(4):359389, October.quotesdbs_dbs33.pdfusesText_39
[PDF] guide de présentation des mémoires et des thèses udem

[PDF] mémoire de maîtrise en ligne

[PDF] présentation d'une thèse soutenance

[PDF] cours sur les mémoires informatiques pdf

[PDF] mémoire de maîtrise en ligne uqam

[PDF] avis de dépôt mémoire udem

[PDF] théses et mémoires gratuits

[PDF] phrase de morale sur le respect

[PDF] maxime morale definition

[PDF] liste de morales

[PDF] leçon de morale ? l'école autrefois

[PDF] programme première es histoire

[PDF] programme première es maths

[PDF] programme première es sciences

[PDF] formulation d'objectifs généraux et spécifiques