[PDF] [PDF] English-Hindi Transliteration using Multiple Similarity Metrics

This allows a given Hindi word written in Devanagari to be transliterated into the Roman currently used in an English-Hindi word alignment system for aligning words For example [f] can be f, or ph (e g frame, photo) This transliteration 



Previous PDF Next PDF





[PDF] Hindi varnamala chart with pictures pdf download - Squarespace

Hindi Alphabet Charts and Pictures Pdf Www Image Results for Hindi some of the alphabets do not use constructed words or letters in Hindi Language



[PDF] Alphabet Book(PDF) - Akhleshcom

Hindi Alphabet Page: 1 akhlesh com Page 2 Hindi Alphabet Page: 2 akhlesh com Page 3 Hindi Alphabet Page: 3 akhlesh com Page 4 Hindi Alphabet



[PDF] Hindi Workbook: Basic Devanagari Script cepuneporg

il y a 7 jours · [PDF] Hindi Workbook: Basic Devanagari develop your Hindi script, grammar, vocabulary, means of a picture of a Hindi word beginning



[PDF] Hindi Alphabet Book - Gnaana

Aruna HATTI Illustrations by KALYANI GANAPATHY Young readers will learn over 80 common Hindi words and phrases They will also learn about the foods 



[PDF] Hindi - Playaway

HINDI Devanagari Alphabet — Consonants Transliteration Sound Letter क within a word or at the end of a word like the k in skit; sometimes in the beginning  



[PDF] प्रबोध पाठ्यक्रम किट : 1 - KIT : 1 पाठ : 1

Now go through the vocabulary portion and try to pronounce the sound combination built up into words Remember that the Hindi words and sentences occurring 



[PDF] LEARN HINDI Through English Medium - WordPresscom

Learn Hindi through English Medium - Course Level I ph,f (f') ph, f in photo g THIS EXERCISE IS ONLY FOR READING AND WRITING HINDI WORDS



[PDF] Convert JPG to PDF online - Anu Academy

(a) Hindi to English dictionary - It will help you to understand the mean of any new word of Hindi It is very easy to find any Hindi word and its mean in dictionary 16 इंजन Engine Photo Keyboard 17 Printer फोटो स्टैपलर



[PDF] English-Hindi Transliteration using Multiple Similarity Metrics

This allows a given Hindi word written in Devanagari to be transliterated into the Roman currently used in an English-Hindi word alignment system for aligning words For example [f] can be f, or ph (e g frame, photo) This transliteration 



[PDF] 03Teach-yourself-Hindi-2003pdf - Themenplattform EZ

the illustrations to the dialogues are by Kavita Dutta word) To Hindi-speakers, English consonants sound more retroflex than dental; so they'll pronounce 

[PDF] hinds county ms zoning map

[PDF] hindu calendar 2019 pdf

[PDF] hindu code bill book in hindi pdf

[PDF] hinge region of antibody

[PDF] hintikka descartes

[PDF] hip exercises pdf

[PDF] hip strengthening exercises for runners

[PDF] hip strengthening exercises for runners pdf

[PDF] hip hop and poverty

[PDF] hip hop black culture

[PDF] hisd calendar 2018 2019

[PDF] hisd calendar 2020 2021

[PDF] hispanic population in iowa

[PDF] hispanic population projections 2020

[PDF] histoire ce2 cm1 2018

English-HindiTransliteration usingMultipleSimilarityMetrics

NirajAswani,Robert Gaizauskas

DepartmentofComputer Science

UniversityofSheffield

RegentCourt,Sheffield, S14DP, UK

Abstract

Inthispaper ,wepresent anapproachtomeasurethetransliteration similarityofEnglish-Hindi wordpairs. Ourapproach hastwo

components.Firstwe proposeabi-directional mappingbetweenone ormorecharac tersin theDev anagariscript andoneor more

charactersinthe Romanscript (pronouncedasin English).Thisallo wsagi venHindi wordwritten inDe vanagari tobetransliterated

intotheRoman scriptandvice-v ersa.Second, wepresentan algorithmforcomputing asimilaritymeasurethatisa variantof Dice's

coefficientmeasureandtheLCSR measureandwhich alsotakes intoaccountthe constraintsneeded tomatchEnglish-Hindi transliterated

words.Finally,bye valuatingvarioussimilaritymetrics individuallyand togetherunderamultiplemeasureagreementscenario, weshow

thatitis possibletoachie vea 0.92f-measurein identifyingEnglish-Hindiw ordpairsthataretransliterations.Inorderto assessthe

portabilityof ourapproachto othersimilarlanguages weadaptour systemtothe Gujaratilanguage.

1.Introduction

Transliterationisdefinedasthe taskoftranscribing aword temsuchthat thepronunciationof theword remainssame andaperson readingthetranscribed word canreadit inthe originallanguage.Cognates (thewords derived froman- otherlanguage)and NamedEntiti es(NE)such astheper - sonnames,names ofplaces,or ganizationsare thetypesof wordsthatneedtranscribing intotheanother writingsys- tem.InIndia, Englishisone ofthemost popularforeign languages.Follo wingthistrend,itisbecomingvery pop- ularforpeople tousew ordsfromboth, theEnglishand theHindiv ocabulary inthesamesentence.Accordingto Clair(2002),use ofsucha mixedlanguage, alsoknown as Hinglish,is saidtoha veprestige, astheamount ofmixing correspondswiththe level ofeducationand isanindicator ofmembershipin theelite group.

Althoughtheuse ofsucha mixedcode languageisbecom-

ingvery common,theEnglishandthe Hindilanguagesre- mainwidelydif ferentinboth thestructureandthestyle. AccordingtoRao etal. (2000)thesedif ferencescanbe cat- egorizedintwobroad categoriesnamely structuraldiffer- encesand styledifferences. Theseincludedif ferencessuch asthedif ferenceinw ordorder,placementsofmodifiers, absenceofarticles anddif ferenttypes ofgendersin Hindi.

Forexample,inHindi mostofthetimesverbs areplaced

atthe endofa sentenceandpostposition areusedinstead ofprepositions.Similarly ,whilstt hemodifiersofanobject canoccurboth beforeand aftertheobject inEnglish,mod- ifiersonlyoccur beforetheobject theymodify inHindi.In contrastto theEnglishlanguage wherethereare threegen- ders:masculine,feminine andneuterfor pronouns,Hindi hasonlytw o:masculineand feminine. Apartfromthese structuraldifferences therearese veral otherdif ferencesinthealphabetsofthetwolanguages. For example,theEnglishalphabet hasfiv evo welswhereasin twentyoneconsonants andtheDe vanagari hasthirtythree. Therearethree compoundletters inDev anagarifor which thereisno equivalent soundinEngl ish.Therearecertain soundsin English(fore xamplesinpleasure),which are notpresentin Hindi.Itis commontoha veconsonants clus- tersatthe beginning orendof wordsinEnglishthanHindi whichleadsto errorsinthe pronunciationof wordssuch Suchdif ferencescanresultintoaninaccurate translitera- tion.Fortunately ,unliketheChineselanguagewhichhas anideographicwriting systemwhere eachsymbolis equiv- alenttoa conceptratherthan toasound (e.g.Beethov en inenglishis representedin Pinyin(Sw offord,2005) asbej- do-fen,Hindi doesnotha vean ideographicwritingsystem (Pouliquenetal., 2005).Therefore,it ispossibleto come upwitha listofpossible phoneticmappingsin Englishfor eachsoundin Hindi.Using thesephoneticmappings, one cantranscribe agiv enHindiw ordintooneormoreEnglish wordsorvice-versa andthencompare thestringsusingvar- iousstringsimilarity metrics. Inthispaper ,we presentaTransliterationSimilaritymetric (TSM)thatis basedonthe lettercorrespondencesbetween thewritingsys temsofthe EnglishandtheHindilanguages. Itisa partofour effort todev elopageneral framework for textalignment(Aswani andGaizauskas,2009) whereitis currentlyusedin anEnglish-Hindiw ordalignmentsystem foraligningw ordssuchas propernamesandcognates.We giveamappingforoneor morecharacters intheDe vana- gariscriptintooneor morecharacters intheRoman script.

GivenaHindiword,thismapping allowsone ormorecan-

didatetransliterated formsinthe Romanscriptto beob- tained.To choosewhichofthesecandidatesmost closely matchesacandidate targetw ordrequires astringsimilarity measure.We reviewsomeofthe wellknownstringsim- ilaritymetrics andproposean algorithmforcomputing a similaritymeasure.W eev aluatetheperformanceofthese similaritymetricsindi viduallyand invariouscombinations todiscov erthebestcombinationofsimilaritymetricsand athresholdv aluethatcan beusedtomaintaintheoptimal balancebetweenaccurac yand coverage.Totest theporta- bilityofour approachtoot hersimilarlanguages weadapt 1786
oursystemto theGujaratilanguage.

2.RelatedW ork

SinhaandThakur (2005)discussthe mixedusage ofthe

Englishandthe Hindilanguages.The ypro videvarious ex- amplesofmix edusageof thetwolanguagesandpresent anMTsystem thatiscapable ofdealing withtext written insucha mixedcode language.They showthatalthough therearecertain constraintsthat shouldbeimposed onthe usageofthe Hinglishlanguage,people donotfollo wthem strictly.Givingmoredetails onthesame,they explainthat therearethree typeofconstraints thatarementioned inthe literatureandshould beimposed ontheusage theHinglish language:thefree morphemeconstraint 1 ;theclosed class constraint 2 ;and finallytheprinciple ofthedual structure 3

Theyshowby examplesthatoutofthe threeconstraints

themorphemeconstraint doesnothold trueandthere are alarge numberofEnglishwordsthat areusedin Hindi sentencesaccording tothegrammar rulesofthe Hindilan- guage.For examplecomputer[on](wheretheenglish noun iscomputerand[on]isusedfor indicatingmore thanone marriageguestand esisto indicatepluralof theHindi noun[barati]).Thelatter examplesho wsthat althoughthe TSapproaches canmatchfirst partsofthese words,special mappingsareneeded tomatchthe suffixes (suchas[on]in computer[on]andesin[barati]es).They alsodiscussthe situationwhereby theirsystemhas todealwith sentences inwhichthe Devanag ariscriptis usedforwritingenglish words.

TSapproachescan bev eryhelpfulin identifyingnamed

entitiesandcognates. Kondraket al.(2003)sho wthatthe cognatesnotonly helpin improvingresults inword align- mentbut theycanbevery usefulwhenmachine-readable bilingualdictionariesare nota vailable.T olocatecognates inthete xt,they experimentedwiththreesimilaritymet- rics:Simardscondition, Dicescoef ficientandLCSR. They createafile withallpossible one-to-onetranslation pairs fromeachaligned sentencepairand calculatesimilaritybe- tweeneachpair .Thepairs abovethecertainthreshhold are thenconsidered aspossiblecognates andalignedwith each other.Theyreport10% reductionintheerror-rateas are- sultofinjecting cognatepairsinto theiralignmentsystem. Oneapproachto identifyNEsis touseprecompiled listsof namedentities (H.Cunninghametal.,2002).Ho wever ,pre- compiledlistsmight notwork onunseenne wdocuments andthereforelocating na medentitiesneed morethanjust usingprecompiledlists. Huangetal. (2003)suggestthat equivalentNEpairsinbilingualtexts canbefound bya 1

Morphemeconstraintmeans thatthew ordsfromone lan-

guagecannotbe inflectedaccordingto thegrammarrules ofthe otherlanguage. 2 Closedclassconstraint meansthatthe wordscate gorizedas closedclassof grammarsuch aspossessiv es,ordinals,determin- ers,pronounsetc. arenotused fromtheEnglish whenthehead nounusedin sentenceis inHindi. 3 Principleofthe dualstructuremeans thattheinternal struc- tureofthe Englishconstituentneed notconform totheconstituent structurerules oftheHindilanguageprovided theplacementof theEnglishphrase obeys therulesof theHindilanguage wayofsurfacestring transliteration.SimilarlyBik eletal. (1997)explain thatitispossibleto detectsourcelanguage namedentitiesby projectingtarget languagenamedentities cross-linguallyiftheir phoneticsimilarityor transliteration costareobtained.Similartothis,Kumarand Bhattacharyya (2006)describean approachthat identifiesnamedenti- tiesinthe Hindite xtusingthe MaximumEntropy Markov model.Thefeatures theyuse fortraininga modelcanalso betrainedusing theTS approach.To learnamodel they defineaboolean functionthatcaptures variouslabels from theannotatednamed entities.Theselabels containinfor - mationsuch astheirposition intheconte xtandprefix or suffixoftheannotatednamed entities.For example,a word isa nameofa personifit isprecededby ahindiw ord[shri] or[shirimati] (i.e.Mror mrs).The yusev ariousfeatures includingword features(i.e.ifaNE startsorends witha specificaffix), contextfeatures(i.e.commonw ordsinthe context),dictionaryfeatures(e.g.if itappearstobeaproper nouninthe dictionary)andcompound features(i.e.if the nextwordisa propernoun).SinceTSapproaches, given abilingualte xt,canidentify possiblecandidatesfornamed entities,theseapproaches canbeused, inafully automatic orasemi automaticway ,to gatherthe trainingdata(e.g. thewords incontext,commonsuffix es,theirPOS tagsand compoundfeatures etc.).Having obtainedthisinformation amodelcan betrainedand usedona monolingualcorpus. Huangetal. (2003)deriv eatransliteration modelbetween theRomanizedHindi andthe Englishletters.The yapply thismodelto aparallelcorpora andextract Hindi-English namedentitypairs basedontheir similaritiesin written form.They achieved91.8%accurac yinidentifyingnamed entitiesacrossthe paralleltexts. Anotheruseof aTSis ex- plainedbyBalajapally etal.(2008). Theydescribe abook readertoolthat allowspeople toreadbooks indifferentlan- guages.In ordertoachie vethis, theyuse sentence,phrase, word-to-wordandphoneticdictionaries.Incase whenthey cannotfinda matchinan yofthe sentence,phrase orword- to-worddictionariestheyuse thephoneticdictionary toob- tainaphonetic transliteration.Their transliterationsystem (knownasOM),giv enwords inonelanguage, provides theirequi valenttransliterationsinalanguagethattheuser hasrequestedto readthebook in.TheOM transliteration systemgiv esaunifiedpresentationforIndianlanguages whichissimilar totheITRANS encoding 4 .OM exploits thecommonalityof thealphabetof Indianlanguagesand thereforetherepresentation ofaletter issame acrossthe manylanguages.Sincethephonetic mappingsare based onthesound eachletterproduces, iftheirsearch fails in locatingamapping fora specificcharacter, they consider anothercharacterthat soundssimilarto theoriginalcharac- ter. Pouliquenetal. (2005)highlightv ariousapproachesthat havebeenemployedbyresearchersto recognizeNEsin the text(Kumarand Bhattacharyya,2006).Theseapproaches includealookup procedureina listofkno wnnames,anal- ysisoflocal lexical contexts,use ofawellknownw ord whichispart ofthenamed entityand apartof speechtags whichsuggestthat thewords mightbeforming aNE.The y 4 http://www.aczoom.com/itrans/ 1787
mentionthatthe existingtransliteration systemseitheruse hand-craftedlinguisticrules, orthey usemachinelearning methods,ora combinationofboth. Similartothe Kumar andBhattacharyya(2006), they collecttriggerw ordsfrom variousopensourcesystemsand writesimplelocal patterns inPERLthat recognizenamesin thetext. Onceobtained thesedata,the yanalyzethe wordsinleftand rightcontexts offoundNEs andcollectthe frequentlyoccurringw ords tobeused foridentifyingNEs intheunseen data.Before theymatchstringsintw odif ferentlanguages,the yperform anormalizationprocess ononeof thetwo words.F orthis theyuseaset ofapproximately30 substitutionrulessuch asreplacingaccented characterwithnon-accented equiva- lents,doubleconsonants withsingleconsonant, wl(atthe beginningoftheword) withvl,ph withf, andsoon.All possiblestringsobtained asaresult ofthisprocess arethen comparedwiththe sourcestring andifan yofthem hasa similarityabov easpecifiedthreshold,itisconsideredas apossiblematch. Tocalculate asimilarity score,theyuse threedifferent similaritymeasuresandthea verageofthe threeis consideredasa similarityscore.These measures arebasedon lettern-gramsimilarity ,wherethe firsttwo measuresarethe cosineof bigramsandtrigrams andthe thirdmeasureis thecosineof bigramswithno vowels in thetext. Inthefollowingsection, wegiv edetailsof some ofthepopular stringsimilaritymetrics andho wthey are calculated.

3.StringSimilarity Metrics

Inthissection welookat someof thevarious methods

thathav ebeenemployedbyresearcherstocomparestrings. Theseincludemethods suchas Dice'sCoef ficient,Match- ingCoef ficient,OverlapCoefficient,La venshteindistance

Algorithm,LongestCommon SubsequenceRatio(LCSR),

Soundexdistancemetric,Jaro-Winkler metric(Jaro,; Win- kler,1999)andn-grammetric. Thereare several variantsof thesemethodsor combinationsofv ariantsofthese meth- odsthatare mentionedinthe literature.F orexample, the similaritymetricused inthePouliquen etal.(2005) isan exampleofacombinations ofthreev ariantsofthe n-gram metric. Matchingcoefficient isthesimplestofallwhere onlythe countofcharacters thatmatchis consideredasa similar- itymeasure.Higher thescore, morethestrings aresimilar. strings.Animmediate variant ofthematching coefficient isthedices coefficient.It allowscomparing variablelength strings.Thesimilarity isdefined astwicethe numberof matchingcharactersdi videdbythe totalnumberofchar- actersinthe twostrings. Anothervariant ofthematching coefficientistheo verlapcoef ficientwherethe similarityis calculatedasthe numberofidentical charactersinthe two stringsdivided bytheminimumlengthof thetwo strings. Itisbased ontheassumption thatifa strings1isasubset ofthestring s2oracon versethen thesimilarityisafull match.LCSRis ananotherv ariantofthe dice-coefficient algorithmwherethe ratioof twow ordsiscomputed bydi- vidingthele ngth oftheirlongestcommonsubsequenceby thelengthof thelonger word.F orexample LCSR(colour, couleur)=5/7 astheirlongest commonsubsequenceis c- o-l-u-r.Suchapproaches, wherethenumber ofmatching charactersis moreimportant,positions ofthecharacters is nottaken intoconsiderationandthereforethey canwrongly identifywords suchasteacherandcheater.

Gravanoetal.(2001)explain anapproachwhich isbased

onthen-grams similaritymetric.F orexample whilecom- paringthetw ostringsteacherandcheater,awi ndowof 2 characterscanbe consideredand allpossiblebigrams can becollected forthetw ostrings.F orexample, te,ea,ac, ch,he,erandch,he,ea,at, te,er.Inthis casethe five bigramste,ea,ch,andheanderarefoundto beidentical givingresultof2*5/ 12=0.83. Eventhough thestringsare different,becausetheyuse samecharacters,the similarity figureishigh. Onecanchange thewindows sizetohigher values.Fore xamplebychangingthewindow sizeto3,we getasimilarity of0.1only .Experimentscarried outbyNa- trajanet al.(1997)on aHindisong databaseshow thatthe windowsizeof3is theoptimum valuefor then-gramalgo- rithm.Intheir experiments,users submittedtheir queryin RomanizedHindiscript whichwere thenmatchedwith the hindidatabase. ThebasicLa venshteinedit distancealgorithmwasintro- ducedbyLe venshtein(1996). Itisusedforcalculatingthe minimumcostof transformingonestring intotheother . Thecostof deletingonecharacter ,insertinga newone, orcostof substitutingone characterforanother is1.The distanceismeasured between0and 1,0equating tothe identicalstringsand 1being nomatch.F oreachcharac- ter,theoperationwithminimum costisconsidered among allotherpossibilities. Theadvantage ofthis methodisthat italsotak esintoaccount thepositionsofcharactersandre- turnstheminimum costthat isrequiredto changeonestring intotheother .Oneof thevariantsofthe Lavenshteins edit distancealgorithmis Needleman-Wunchdistance orsellers algorithm.Itallo wsaddinga variablecostadjustmentto thecostof insertionand deletion. Jaro-Winklermetricisameasure ofsimilaritybetween two strings.Themetric isseen moresuitablefor shortstring names.Thescore isnormalizedsuch that0means nosim- ilarityand1 meansthe equalstrings.Gi ventw ostringss1 ands2,theirdistance iscalculatedasd(s1,s2)=1/3(m/|s1| +m/|s2|+(mt)/m) wheremisthe numberofcharacters thatarecommon intwo strings.To beconsideredas acom- moncharactera characteratposition iinthestring s1has tobewithin theHwindowoftheequiv alentj th character inthestring s2.HereH=max( |s1|,|s2|)/21.Similarlytis equalstothe numberofcharacters matchedfromwindo w butnotatthesame indexdi videdby2. Soundexisthealgorithmthat groupsconsonantsaccording totheirs oundsimilarity. Itisaphoneticalgorithmwhich isusedfor indexing namesbysound aspronouncedinEn- glish.Thebasic ideahereis toencodethe wordsthat are pronouncedin asimilarw aywiththe samecode.Each wordisgiven acodethat consistsofaletterandthreenum- bersbetween0 and6,e.g. AswaniisA215.Thefirst step inthealgorithm istopreserv ethefirst letterofthe word andremov eallthevowelsandconsonants( h,wandy)un- lessthe yappearatthe beginningof aword. Alsothecon- secutivelettersthatbelongtothe samegroupare removed 1788
exceptthefirstletter. LettersB,F, PandVbelongtothe group1, C,G, J,K, Q,S,XandZtothegroup 2,DandTto thegroup3, Ltothegroup 4,MandNtothegroup 5and theletterRbelongstothe group6.In astandardsounde x algorithm,onlythe firstletter andthreefoll owingnumbers areusedfor indexing. Ifthereare lessthanthreenumber, theremainingplaces arefilledwith zerosandotherwise onlythefirst threenumbersare consideredforinde xing. Therearese veralother implementationsofthesoundexal- gorithm.Althoughit isvery helpfulforfuzzy searching, therearecertain limitationsofthe algorithm suchasthe highernumberof falsepositi vesdue toitsrelianceoncon- sonantgrouping andinaccurate handlingofw ordsthatstart withsilentletters.

4.OurA pproach

Figure1lists lettercorrespondencesbetween thewriting systemsofthe twolanguages whereone ormoreHindi charactersareassociated withoneor moreEnglish char- acters.F orexample[f]canbef,orph(e.g.frame,photo). Thistransliterationmapping (TM)w asderiv edmanually andpro videsatwoway lookupfacility .Thefollowingil- lustratione xplainshowtousethe TMtoobtainpossible transliterationsforthe Hindiword [kensar]whichmeans cancerinEnglish.F orHindiletter atthei th positionin theHindi wordHW (wherei=1..nandn=|HW|(i.e. thelengthof HW)),we defineaset TS i thatcontainsall possiblephoneticmappings forthatletter .

Inorderto optimizetheprocess, weremov efromthe TS

i allmappedcharacters thatdo notexist inthecandidate tar- getstring.Belo w,we listmappingsforthelettersofthe word[kensar].Themappings whichneedto beremov ed fromtheTS i areenclosed inroundbrack ets:[k]=[c, (k), (ch)];[e]=[e ,a,(ai)] ;[n]=[n] ;[s]=[c,(s)] ;[r]=[r ].

Fromthesemappings wedefinea setTSofn-tuplessuch

thatTS=TS 1

×TS

2

×...×TS

n (i.e.TSisaCarte- sianproductof allthepre viouslydefinedsets (TS i=1..n foreachletter intheHindiword). Eachn-tupleinTSisone possibletransliterationof theoriginalHindi word.In total thereare|TS|transliteratedstrings. Intheabo vee xample thev alueof|TS|is2 (1x2 x1x 1x1) (i.e.Cencrand

Cancr).Eachtransliterated string( S

j=1..|TS|?TS )iscom- paredwiththe Englishword usingoneof thestringsimilar - itymetrics (explainedin thenextsubsection).Ifthe English wordandanyof thetransliteratedstrings hasasimilarity scoreabov easpecifiedthreshold,thestringsaredeemed to betransliterations.

4.1.StringSimilarity Metrics

Inthecase ofEnglish-Hindi stringsitw asobserved during ourexperiments thatforthetwostrings tobesimilar the firstand thelastcharacters fromboththe strings-the En- glishword (E)andthetransliteratedstring(T), mustmatch. Thisensuresthat thew ordshav esamephonetic starting andsameending. Howev ersomeEnglish wordsstartor endwithsilent vo wels(e.g.pinpsychologyandeinpro- gramme).Thereforein suchcasesthe firstcharacterof the transliteratedstringshould becomparedwith secondchar- acterofthe Eandsimilarly thelastcharacter ofthe translit- eratedstringshould becompared withthesecond lastchar- acterofthe E.Oure xperimentsshow thatunlessthe length ofthe shorterstringis atleast65% ofthelength oftheother string,they areunlikelytobe phoneticallysimilar. Thesimilarityalgorithm (seetable1) takestw ostrings,S andT, asinputwhereS i=1..n andT j=1..m referto char- actersatposition iandpositionjinthetw ostringswith j=1,characterS i iscomparedwith charactersT j ,T j+1 andT j+2 .IfS i matcheswith oneofthe T j ,T j+1 andT j+2 thepointeriadvancesonepositionandthe pointerjissetto onepositionafter theletterthat matcheswithS i .Ifthere is nomatch,the pointeriadvancesandjdoesnot.W eaw ardquotesdbs_dbs21.pdfusesText_27