Prefix Embeddings for In-context Machine Translation PDF

3 nov. 2019 and English-French datasets and demonstrate the large impact of the data augmentation for context-aware NMT models in terms of BLEU.

Improving the Transformer Translation Model with Document-Level

over Transformer on Chinese-English and French-. English translation respectively by exploiting document-level context. It also outperforms a.

Prefix Embeddings for In-context Machine Translation

16 sept. 2022 Experiments were conducted across 3 en-fr domains (subsection 4.1) and from English into three languages French (fr) Portugese (pt)

Dynamic Context-guided Capsule Network for Multimodal Machine

4 sept. 2020 dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN. Our code is available on.

Do Context-Aware Translation Models Pay the Right Attention?

1 août 2021 tions) a new English-French dataset compris- ing supporting context words for 14K trans- lations that professional translators found use-.

Confidence Based Bidirectional Global Context Aware Training

22 mai 2022 WMT'14 English-to-French respectively. 1 Introduction. In recent years

Translating Deleuze: On the Uses of Deleuze in a Non-Western

Translating Deleuze in a Non-Western Context 321 Deleuze in French the same as Deleuze in English in Japanese and even in Chinese?

The Louisiana Civil Code in French:Translation and Retranslation

27 oct. 2016 that of the common law. This article discusses the translation of the. Louisiana Civil Code from English to French in the context of the.

The text and context of EU directives: implications for translators

translator in the law making process in the EU are discussed. Key words: statutory language legal translation

PrefixEmbeddings for

In-context MachineT ranslation

Suzanna Siassia1@jhu.edu

KevinDuhkevinduh@cs.jhu.edu

Johns HopkinsUni versityAbstract

The classof large generativepretrained (GPT)languagemodelshave demonstratedthe ability to translatewith in-context examples,aphenomena knownasfe w-shotprompting. Howe ver, theyha venotachieved state-of-artresults fortranslatingoutofEnglish.In thisw ork,we in- vestigateane xtremelylightweight fixed-parametermethod forconditioning alargelanguage model tobetter translateinto thetar getlanguage. Ourmethod introducesadditionalembed- dings, referedto asprefix embeddingswhich donot interferewith thee xistingweights ofthe model. Usingunsupervised andweakly supervisedmethods thattrain only0.0001% ofthe model parameters,the simplemethod improv esup toaround5BLEU pointso verthebase- line whena singleprompt example ispro vided,andupto around2BLEU pointswhen 20 prompt examplesarepro videdacross 3domainsand3 languages.W eanalyze theresulting embeddings" trainingdynamics, wherethe ylie intheembeddingspace, andsho wthat these conditional prefixescanbe usedfor bothin-conte xttranslation anddi verse generationofthe monolingual targetsentence.

1 Introduction

Under theparadigm ofin-conte xtlearning,

1largelanguage modelsha ve beenshowntogenerate

translations whenpro videdwithsev eralpriming examples,eachofwhichconsists ofa source sentence andthe translatedtar getsentence. Theseexamples,also known as"prompts", are prefixedto thetest sourcesent ence,which thenconditions themodeltogenerate thetest target sentence. Table1sho wsan exampleofthis format,where [S1]and[S2]are separatortok ens prefixing sourceand target sentencerespectively . This prompt-and-translateph enomena,orin-contexttr anslation, presentsitselfasane w paradigm forMachine Translation applications.First,theability toadapt todif ferenttask spec- ifications usingprompts suggestthat thesame modelcan beused inmultiple settings anddo- mains. Whilethere hav ebeenseveralmultilingual translationmodels (Fanetal.,2021;Xue et al.,2021; Maet al.,2021), theability toperform unrelatedtasks suchas Question-Answering in additionto Translation isrelatively new .Thisalsopresentsaninterest ingshiftfromsuper- vised NeuralMachine Translation (NMT)intermsof datarequirements. Thesemodels are trained onmassi veamountsofwebte xtwhich arenot explici tlyparallel.

2In contrast,mod-

ern NMTmodels aretrained withmillions oflines ofparallel text. Unsuprisingly, thelack of supervision comesat acost. Translating outof Englishforin-context modelsstill lagsbehind state-of-art possiblydue tolo wdata qualityand/ordisproportionateamounts ofEnglish. 1

This hasalso beentermed 'few-shot prompting"(Bro wnetal.,2020), but thefieldisincreasingly conv erging on

'in-contextlearning" (Bommasaniet al.,2021).

2This doesnot precludethe possibilitythat parallelsentences maye xistin various formsin thecra wledwebtext. Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas

Orlando, USA, September 12-16, 2022. Volume 1: Research Track 45 [S1]So atthis point,music div erged [S2]Donc`a partirde l`a, lamusique adi ver g´e. [S1]The actualrigging onthe reinson the horse aremade fromthe samesort ofthing. [S2]Les attachessur lesrennes duche val sont faitesdu m

eme genrede choses.

[S1]And thatw asdonewitha particle.[S2]Table1: Asingle continuousinput sequencepresented tothe modelfor decodinga singletest source sentence" Andthatwas donewith aparticle".Giv enthe entiresequence asinput, themodel proceedsto generatethe target sequenceafter thefinal[S2].[∆∆∆]refers tose veralmore

[S1]en[S2]fr pairs. In thisw ork,weproposethe trainingof targetlanguage preΘx embeddingstoimpro ve in-context translation▷Targettingspecificlanguages hasbeen explored inNMT modelsY ang et al.(2021) but muchlesssofor thein-conte xtsetting. Incontrast tofine-tuning, wedo not change existingmodelweights. Thisf allsinto theclass of'fixed-parameter"methods wherethe original parametersof themodel areheld fixed andadditional parametersare introducedwhich influence theacti vationstatesofthemodel. Ourproposed methoddif fersfrom thev ariousap- proaches to"prefix tuning"(Li andLiang, 2021;Qin andEisner, 2021;Asai etal., 2022;Lester et al.,2021) inthat theseall requiree xplicittask supervision.Learning theweights ofthesepre- fix embeddingsis technicallystraightforw ardusing gradientdescentoptimisationmachinery . Wesho wthattheseembeddings canbe trainedunsupervised (subsection3.2)

3and alsoe xplore

the useof av erysmall setofbitext sentencesfor weaklysupervised training(subsection 3.3). Experiments wereconducted across3 en-frdomains (subsection4.1) andfrom Englishinto three languagesFrench (fr),Portugese (pt),and German(de) (subsection 4.2).Ov erall,for a verysmall amountof engineering,data collection,and storageef fort,training prefixembed- dings cangi veupto5BLEU pointsfor the1-prompt setting,and upto around2BLEU points on the20-prompt settingwith av erysmall amountof bitext(weused 100parallel sentences).

2 RelatedW ork

Largelanguage modelswhich perform in-contexttranslation FollowingGPT3(Bro wn et al.,2020) whichfirst reportedthe in-context translationphenomena, subsequentautore gres- siveTransformer decoderonlyarchitecturessuch asXGLM (Linet al., 2021)and mGPT(Shli- azhkoet al.,2022) hav ee xplicitlytrainedin-contextmodelsto bemultilingual.Howe ver de- coding outof Englishstill performsmore poorlythan decodinginto English.Hence wefocus on thefirst scenarioof decodingout ofEnglish. PreΘxT uningUnlikepre viousworkwhich directlyprefixesthe taskby prependingto the input (Liand Liang,2021; Qinand Eisner,2021; Asaiet al.,2022; Lesteret al.,2021), we substitute thetrained prefixes forthedelimitersthroughout theprompts beforethe target lan- butsignificant difference allowsmonolingualtraining forthetarget languagewithout explicit translation tasksupervision. Embedding TuningvsPr eΘxT uningacrossall layersWeadopt theembedding lev eltuning approach whichw asshownto becompetitive withmodel tuningwith anincreasingnumberof parameters onSuperGLUE tasks(Lester etal., 2021).The focuson trainingprefix embeddings instead oftraining additionalparameters todirectly influenceacti vations acrossall layersis a design choiceprimarily toaccommodate forv erylar gemodels. LiandLiang(2021) report3

Unsupervised inthe terminologyof MachineT ranslationmeans withoutparallel bitextsentences.Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas

Orlando, USA, September 12-16, 2022. Volume 1: Research Track 46 using 250K-500Kof parametertraining vs.a 345MRoberta model(Liu etal., 2019),which is

4-7% ofthe parameterspace. Ifwe hadapplied thesame parameterratio toour currentmodel

of 2.7Bparameters, thisw ouldbe equivalentto having totrain195Mparameters-whi chis in the sameorder ofmagnitude ofthe Robertamodel. We doackno wledgethat embeddingtuning is lesse xpressivebyvirtueofhaving fewer entrypoints toinfluence themodel"sacti vations,and leaveamiddle groundsolution suchas combiningwith adaptorlayers (Houlsbyet al.,2019) to future work. Language IDtok entrainingis atypical methodin multilingualmodels, tocondition the model forthe sourceand target language.Ho weverthese tokens aretypicallytrainedtogether with therest ofthe modelparameters andis adesign choicethat needsto bemade upfront. In contrast,we usea genericlar gelanguage modelthat waspretrained withminimal design choices, andthen posthoctrain alanguage specificprefix tocondition themodel togenerate sentences inthe target language,withthegoal ofimpro vingin-conte xttranslation.

3 Methods

Our approachis motiv atedbytheknowledgethat forv erylar gelanguagemodelstrained on web corpora,there isa weaker target language(beingtranslatedinto) becauseEnglishisthe dominant languageon theweb .This trendpersistsev enfor explicitly multilinguallanguage models (Linet al.,2021). Ourmethod thereforeaims tocondition thelanguage modelto de- code theweak ertargetlanguage, bylearningalanguage-specific prefix.W efirst describethe in-contexttranslation setupat testtime (subsection3.1), followed byunsupervised training (subsection 3.2)and weaklysupervised training(subsection 3.3)of thetar getlanguage prefix embedding. Atinference time,the correspondingprefix willbe usedas theseparator token between sourceand target language.Figure1illustrates thisprocess.

3.1 In-contextT ranslation

Let(x;y)2 D⌊be aset oftranslation pairsthat themodel hasaccess toat inferencetime, wherexrefers tothe sourcesentence andyrefers tothe target sentence.Given theseparator

tokens∪S§⊎⇔ ∪S†⊎and thetest sourcesentence x⊔⌉∫⊔, wecan definea promptlayout format

u(x⊔⌉∫⊔;D⌊;∪S§⊎⇔ ∪S†⊎)(Table2), where∪↙↙↙⊎refers tose veralsimilarlyformattedx,y

examplesfrom Db. Thedef aultin-contextlearning modelautoregressiv elygenerates thetar get sequence bygreedily decoding ^y= argmaxyp(yju(x;D⌊;∪S§⊎⇔ ∪S†⊎)). Ourgoal isto

learn atar getspecificprefix∪S⇑⊎that achieveshigherp(yju(x;D⌊;∪S§⊎⇔ ∪S⇑⊎))for the

correct sequencey. Weuse" Λ" toindicate thatthe prefixcan beof any length.4∪S§⊎x∞∪S†⊎y∞

∪S§⊎x⊔⌉∫⊔∪S†⊎ ⊥Table2: Theprompt layoutformat fromu(x⊔⌉∫⊔;D⌊;∪S§⊎⇔ ∪S†⊎).

3.2 UnsupervisedTraining (monolingual)

The primarystrate gyissimple,train ∪S⇑⊎such thatit conditionsthe modelto generatese- quencesyfrom thetar getlanguage.We expand thetokenizerand thecorrespondingembedding4

In practice,we usespecial tokens suchas ∪′⊎⇔∪∞⊎∆∆∆⇔ ∪\⊎for aprefix oflength nand verifythatthese do

not haveacollisionin thetok enizernamespace. Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas

Orlando, USA, September 12-16, 2022. Volume 1: Research Track 47 matrix bythe numberof prefixtok ens,and thenprepend thespecialtoken [S*]to monolin- gual sentencesduring training.A singletraining sequenceis giv enby " [S *]y", wherey is typicallya sentenceor paragraph.Gi ven msequences froma target languagetrainingset y

1;∆∆∆;y⇕2 D†, wetrain theembedding parametersȷℑEmbed⇐[S]⇒, where[S]in-

dexestheadditional rows inthe embeddingmatrix.We usecross-entrop yloss asis standard with languagemodeling, andfreeze theparameters ofthe entirenetw orke xceptfor ȷ. ,QIHUHQFH ,QFRQWH[W

7UDQVODWHFigure 1:Prompt formatfor trainingand inferencetime. Training lossis computedon the

sequences inblue. Thetok en[S*]corresponds toadditional row(s) oftheembeddingmatrix which arethe onlyparameters trainedby backpropagation. Atinference time,we replace[Sy] with thetrained [S*]to conditionallygenerate ytest. Notethat [S*]can alsobe usedto generate sequencesin thetar getlanguage directly(subsection5.4).

3▷3 WeaklySupervised Training↼In-contextT ranslatewithBitext↽

In thepre vioussection,trainingof ȷuses onlythe target language,withoutany bilingualsu- pervision forthe in-context translationtask.To guideȷto abetter localoptima, weinclude a verysmall amountof bitext, 100parallel sentencepairswhichare asubset ofthe trainingset. Weadopt aweakly supervisedsetup wherewe initialisethe prefixembeddings usinge xisting tokens,and alsowhere theprefix embeddingsare initialisedfrom themonolingual trainedprefix embeddings (referredto asmono-trained-langin section4). 5An alternativetrainingapproachis a multi-tasksetup wherelosses fromthe monolinguallanguage modelingand translationtasks are minimisedin alternativ ebatches,however thisw asthoughttobelesseffectiv edue tothe extremedata imbalanceof thesetting thatwe consider(30k monolingualsentences to100 bi- textpairs), whichmight requirearbitary reweighting schemes.Since theend-goal istranslation, directly tuningto wardsthisisamore straightforward approach. Figure 1sho wsasinglein-conte xttranslate trainingsample forthemodel.Note thatloss is computedonly forthe lasttar getsent enceytraink. Inall oure xperimentswe usekℑ ̸for training, i.e.,5 priminge xamples.F oreachdatapoint,we randomlysamplefromthe Dtrain to constructthe promptset, sothat theparameters of[S*]do noto verfittoanyparticular choice ofprompt set.Note thatfor large languagemodels, jDtrainjis lessthan thenumber of parameters beingtrained; asingle prefixtok enhas alreadyo ver2000dimensions. We donot 5

Note thatsince monolingualdata hadbeen usedto initialisethe prefix,this canbe interpretedas acontinual semi-

supervised learningset up.Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas

Orlando, USA, September 12-16, 2022. Volume 1: Research Track 48 expectDtrainto allowthemodel tolearn amapping fortranslations, andits roleis merelyto weakly supervisethe trainingof themonolingual prefixto wards lossbas insthat arecompatible with theprompt-translate paradigm.

4 Experiments

Weor ganiseourexperimentsin vestig atingtheeffects ofprefixembeddingtuning1)across three en-frdomains, medical,social media,and TEDtalks, 2)across threelanguages inTED

Talks.

6In boths etsofexperiments, wee xplorethreebasicinitializations (describedin Prefix

Embedding Initialisations

). Wealsouse priminge xamplesof various sizestoinv estigate if the effectspersistacross different promptsizes. Toaccountfor promptselection andordering effects,all inferenceruns wererepeated with5 randomlysampled promptsets fromthe training data, whereeach ofthe sourcesentences inthe prompte xamplesare between10 to20 words long. Scoresare reportedusing SacreBLEU(Post,2018). 7 ModelWeuse GPTNeo2.7B(32 layers,20 heads)(Black etal., 2021)which hasbeen pre- trained onThe Pile(Gao etal., 2020).The Pilecontains Europarlwhich hasbeen fed intothe model ata documentle vel andnotasentencelev el.

8Toour knowledge, therehasnotbeen any

reports ofsentence lev elparallelcorporainthetraining datasetof thismodel. Notethat unlike most dedicatedMachine Translation modelswhichhav ean encoder-decoder architecture,this model istrained autoregressi velyandisdecoderonly. 9 DataWeadopt threedatasets; multilingualTED talks(Duh, 2018),MED (Bawden etal.,

2019), andMTNT (Micheland Neubig,2018). We use30,000 monolingualsentences forunsu-

pervised trainingof theprefix embeddings( monoin resultstable). For TED,MEDandMTNT , the monolingualsentences areobtained fromtheir bitext trainingdata. We use100bitext sen- tence pairsfor theweakly supervisedcase ( bitextin resultstable). Thesebite xtsentence pairs servedasa self-containedprompt setand trainingdata instancesas describedin subsec- tion 3.3.In boththe unsupervisedand weaklysupervised scenarios,During testing,we sample sentence pairsfor promptse xamplesfrom thetrainingset.The sentencepairs inweakly super- vised training,v alidation,andinferencetime promptselection areall separatesplits; thereis no overlapbetweenprompt setsseen acrossthese phases. for theMED andTED domains,without compromisingon theability tocop yor generatedigits. Werun alangid checkand restricttraining sentencelength to3 to25 words toa void trivial sequences andout-of-memory errors. PrefixEmbedding InitialisationWein vestigatetwosimpleformsof initialisation randomrefers tothe default behaviorofthe modelwhenaddingne wparameters tothe

embedding. ForGPTNeomodel, thisis drawn fromN⇐′∅′¬′∈⇒as themodel usesGELU

activationunits(Hendrycks andGimpel, 2016).W ereport resultsfor randomusing the best outof 3trained prefixembeddings basedon thede vset. languses existingwords fromthevocab ularywhich isrelated tothelanguageandthe domain. Forfr, pt,de,weinitialise withthe words "French,Portuguese, German",for 6 Code athttps://github.com/suzyahyah/prefixes˙incontext˙machinetranslation.

9Wereport SOT Aresultsonthedatasetsalthough thisis notdirectly comparablebecause ofcompletely different

training datasetup ofthe basemodel. TEDen-fr: 35.9en-pt: 38.3en-de :28.1 (Renduchintalaet al.,2019) MED:39.5

(Bawdenet al.,2019) MTNT: 29.7(Michel andNeubig,2018)Proceedings of the 15th Biennial Conference of the Association for Machine Translation in the Americas

Orlando, USA, September 12-16, 2022. Volume 1: Research Track 49 MTNT,MED andTED weuse "social,medical, talks"respecti vely .This meansthat for French MED,we would initialisethefirstprefix withthe embeddingcorresponding to" French" andthe secondprefix withthe embeddingcorresponding to"medical". 10 let point)for further(weakly) supervisedtraining using100 additionalparallel sentences. ValidationLoss Forboth monolingualtraining andweakly supervisedbite xttraining, weuse the prompt-translationparadigm ast he validationloss.Thisav oidsoverfitt ingto themonolin- gual targetsentenceat thee xpenseof beingable totranslateinthe in-context setup.The set of translationprompts forthe validation setare randomlydrawnfrom withinthat setitself, re- movingdependenc yonany particularprompt setusedatinference time.It maybe possibleto achievebetterperformance ifpractitioners wereto usethe sameprompt setat train,v alidation and testtime. TrainingDetails Weapply earlystopping withpatience ov er5 epochsand threshold0.001 loss. Weadopt4 gradientaccumulation stepswith abatch sizeof 8for anef fectiv ebatch sizeof

32 forthe monolingualtraining, and4 gradientaccumulation stepswith abatch sizeof 2for an

effectivebatchsizeof 8for theweakly supervisedbite xttraining toa void out-of-memory errors. All experimentscanbe runwith asingle NVIDIA-TITAN RTX GPUs(24GB). Monolingual training takesabout1 hourper epochand canrange from8-20 hoursfor conv ergence. PromptF ormat(u)Wetried sev eralmanualvariantsof[Sx]and[Sy]butdid notoptimise overthise xtensiv ely.Ourpreliminaryexperimentsshowedthatusing untrainedlangtokensin the separatorperformed slightlybetter ,i .e.,usingthet oken"French"as [Sy]performed better than aseparator choices uchas 'A:". Wealsoe xperimentedwith prependingtheentireprompt sequence withNatural LanguageInstructions: "Translate Englishto French"butfound thatthis did nothelp consistentlyacross datasets,hence weopted toe xcludeit tosimplify designchoices and isolatethe effects ofthetrainedprefix.

4.1 Resultsf orPerformance AcrossDomains[Table3]

Wepresent theresults for1-prompt and20-prompt settingin Table 3.The 1-prompte xample showsthe extreme caseofhaving nobite xtdata. Whilethisisperhaps ano verlyrestrictiv e assumption especiallyin industrialsettings, thegoal ofthis experimental settingis toillustrate the effectofthe extreme monolingualscenario. The20-promptsettingsimulates a"saturated" prompt setting,which wealso inv estigate withmorepromptintervalsinFigure2. Unsupervised(monolingual) Prefix Traininghelps1-pr omptsetting.Across alldomains, unsupervised ( mono ) prefixtraining tendsto improv eBLEU score.Thisimprovementis much more prominentin the1-prompt setting,with improv ementsof around5BLEU pointsacross the threedata domainsof MED,TED andMTNT .Recall thatthe monotrainedlanginitialised tokenembedding hasno knowledge oftranslation andonlyserves tocondition themodel to generate thetar getlanguage. Weaklysuper vised(bitext)Prefix Training helpsthe20-prompt "saturated"setting.A verysm allamountofsupervision with100 examples canbe usedto dobetter thanthebaseline 0

3to1.3BLEU pointg ains).11It isnot alw aysclearwhetherinitialisingfroma mono-trained-

langembedding helpsas the performanceisthesame forTED andMED, but slightlybetter (0.5 gains)for MTNT. Lookingatthe1-prompt casefor bitext,mono-trained-langalwaysdoes10quotesdbs_dbs14.pdfusesText_20

[PDF] translate english to kinyarwanda words

[PDF] translating statements into symbolic form calculator

[PDF] translation model

[PDF] travel article about new york

[PDF] travel in london report 12

[PDF] travel trends by income

[PDF] travelex paris sas 92100 boulogne billancourt

[PDF] traveloka expedia investment

[PDF] travelway inn sudbury

[PDF] treatment of tuberculosis: guidelines for national programmes

[PDF] treble clef worksheet pdf

[PDF] tree volume calculator

[PDF] tremblay en france nombre d'habitants

[PDF] trendy restaurants in riverside

[PDF] tres bon restaurant paris 9eme

[PDF] Prefix Embeddings for In-context Machine Translation