[PDF] Minimum&Edit& Distance



Previous PDF Next PDF







Implementation of Levenshtein Distance Algorithm for E

A Levenshtein Algorithm Phonetic distances between Standard Danish (Lyngby) and each of the other 17 Scandinavian language varieties were calculated by means of the Levenshtein algorithm When phonetic transcriptions of two pronunciations are compared with each other, Levenshtein distance is equal to the number of



Thread-cooperative, Bit-parallel Computation of Levenshtein

distance, by less than a value specified by the user In the context of biological sequence alignment one often employs Levenshtein distance, i e the minimum number of edit operations needed to transform the query into the match Each operation can be either a substitution, or an insertion, or a deletion of a single character Levenshtein



Plagiarism Detection Using Levenshtein Distance With Dynamic

sentence per sentence using a modified levenshtein distance solver implemented with dynamic programming The result of the distance solver will be passed on to another function which will determine whether or not a potential plagiarism is detected in a certain sentence by taking into factor the distance of sentence and the number of words B



Structural Off-line Handwriting Character Recognition Using

distance, where the character’s graph feature converted into string representation, and the edit distance between strings measured The string edit distance algorithm using Levenshtein distance method19 and related work using levenshtein distance for on-line handwriting recognition is also mentioned20



RESEARCH NOTES Distributions of cognates JOB SCHEPENS in

normalized Levenshtein distance function can efficiently and reliably simulate bilingual orthographic similarity ratings Orthographic similarity distributions of cognates and non-cognates were identified across pairs of six European languages: English, German, French, Spanish, Italian, and Dutch



DNA Multiple Sequence Alignment by a Hidden Markov Model and

The Fuzzy Levenshtein Distance of each of the sequences in the chromosome is then computed from the consensus sequence The Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other



Minimum&Edit& Distance

DanJurafsky Where did the name, dynamic programming, come from? & The 1950s were not good years for mathematical research [the] Secretary of

[PDF] lever de soleil

[PDF] lever l'indétermination d'une limite

[PDF] lever une forme indéterminée

[PDF] lever une forme indéterminée 0*infini

[PDF] léviathan 2014

[PDF] levier inter resistant

[PDF] levinas altérité

[PDF] levinas ethique et infini pdf

[PDF] levinas le temps et l'autre pdf

[PDF] levinas visage citation

[PDF] levure ade2

[PDF] levure respiration fermentation

[PDF] Levures dans 4 milieux différents

[PDF] LEWIS ET CLARK LEUR EXPEDITIONT ,AIDE MOI SVP SES POUR DEMAIN !

[PDF] lex de physique "la réfraction"

MinimumEditDistance

Defini'onofMinimum

EditDistance

DanJurafsky

Howsimilararetwostrings?

• Spellcorrec'on• Theusertyped"graffe"Whichisclosest? • graf• graB• grail• giraffe • Computa'onalBiology• Aligntwosequencesofnucleo'des• Resul'ngalignment: • AlsoforMachineTransla'on,Informa'onExtrac'on,SpeechRecogni'on AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

DanJurafsky

EditDistance

• Theminimumeditdistancebetweentwostrings • Istheminimumnumberofedi'ngopera'ons • Inser'on• Dele'on• Subs'tu'on • Neededtotransformoneintotheother

DanJurafsky

MinimumEditDistance

• Twostringsandtheiralignment:

DanJurafsky

MinimumEditDistance

• Ifeachopera'onhascostof1• Distancebetweentheseis5 • Ifsubs'tu'onscost2(Levenshtein) • Distancebetweenthemis8

DanJurafsky

AlignmentinComputa9onalBiology

• Givenasequenceofbases • Analignment:• Giventwosequences,aligneachleXertoaleXerorgap

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

DanJurafsky

OtherusesofEditDistanceinNLP

• Evalua'ngMachineTransla'onandspeechrecogni'on

R Spokesman confirms senior government adviser was shot!H Spokesman said the senior adviser was shot dead! S I D I!

• NamedEn'tyExtrac'onandEn'tyCoreference

• IBMInc.announcedtoday• IBMprofits• StanfordPresidentJohnHennessyannouncedyesterday• forStanfordUniversityPresidentJohnHennessy

DanJurafsky

HowtofindtheMinEditDistance?

• Searchingforapath(sequenceofedits)fromthestartstringtothefinalstring:• Ini9alstate:thewordwe'retransforming• Operators:insert,delete,subs'tute• Goalstate:thewordwe'retryingtogetto• Pathcost:whatwewanttominimize:thenumberofedits

8

DanJurafsky

MinimumEditasSearch

• Butthespaceofalleditsequencesishuge!• Wecan'taffordtonavigatenaïvely• Lotsofdis'nctpathswindupatthesamestate.• Wedon'thavetokeeptrackofallofthem• Justtheshortestpathtoeachofthoserevistedstates.

9

DanJurafsky

DefiningMinEditDistance

• Fortwostrings • Xoflengthn• Yoflengthm • WedefineD(i,j) • theeditdistancebetweenX[1..i]andY[1..j] • i.e.,thefirsticharactersofXandthefirstjcharactersofY • TheeditdistancebetweenXandYisthusD(n,m)

MinimumEditDistance

Defini'onofMinimum

EditDistance

MinimumEditDistance

Compu'ngMinimum

EditDistance

DanJurafsky

DynamicProgrammingforMinimumEditDistance

• Dynamicprogramming:Atabularcomputa'onofD(n,m) • Solvingproblemsbycombiningsolu'onstosubproblems.• BoXom-up

• WecomputeD(i,j)forsmalli,j• AndcomputelargerD(i,j)basedonpreviouslycomputedsmallervalues• i.e.,computeD(i,j)foralli(0

DanJurafsky

DefiningMinEditDistance(Levenshtein)

• Ini'aliza'on

D(i,0) = i!D(0,j) = j

• RecurrenceRela'on:

For each i = 1...M!! For each j = 1...N

D(i-1,j) + 1! D(i,j)= min D(i,j-1) + 1! D(i-1,j-1) + 2; if X(i) ≠ Y(j) ! 0; if X(i) = Y(j)!

• Termina'on:

D(N,M) is distance

DanJurafsky

N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N

TheEditDistanceTable

DanJurafsky

N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N

TheEditDistanceTable

DanJurafsky

N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N

EditDistance

DanJurafsky

N 9 8 9 10 11 12 11 10 9 8 O 8 7 8 9 10 11 10 9 8 9 I 7 6 7 8 9 10 9 8 9 10 T 6 5 6 7 8 9 8 9 10 11 N 5 4 5 6 7 8 9 10 11 10 E 4 3 4 5 6 7 8 9 10 9 T 3 4 5 6 7 8 7 8 9 8 N 2 3 4 5 6 7 8 7 8 7 I 1 2 3 4 5 6 7 6 7 8 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N

TheEditDistanceTable

MinimumEditDistance

Compu'ngMinimum

EditDistance

MinimumEditDistance

Backtracefor

Compu'ngAlignments

DanJurafsky

Compu9ngalignments

• Editdistanceisn'tsufficient• WeoBenneedtoaligneachcharacterofthetwostringstoeachother

• Wedothisbykeepinga"backtrace"• Every'meweenteracell,rememberwherewecamefrom• Whenwereachtheend,

• Tracebackthepathfromtheupperrightcornertoreadoffthealignment

DanJurafsky

N 9 O 8 I 7 T 6 N 5 E 4 T 3 N 2 I 1 # 0 1 2 3 4 5 6 7 8 9 # E X E C U T I O N

EditDistance

DanJurafsky

MinEditwithBacktrace

DanJurafsky

AddingBacktracetoMinimumEditDistance

• Basecondi'ons:Termina'on: D(i,0) = i D(0,j) = j D(N,M) is distance • RecurrenceRela'on:

For each i = 1...M!

! For each j = 1...N

D(i-1,j) + 1!

D(i,j)= min D(i,j-1) + 1! D(i-1,j-1) + 2; if X(i) ≠ Y(j) ! 0; if X(i) = Y(j)! LEFT! ptr(i,j)= DOWN! DIAG!!

DanJurafsky

TheDistanceMatrix

SlideadaptedfromSerafimBatzoglou

DanJurafsky

ResultofBacktrace

• Twostringsandtheiralignment:

DanJurafsky

Performance• Time:

O(nm) • Space: O(nm) • Backtrace

O(n+m)

MinimumEditDistance

Backtracefor

Compu'ngAlignments

MinimumEditDistance

WeightedMinimumEdit

Distance

DanJurafsky

WeightedEditDistance

• Whywouldweaddweightstothecomputa'on?• SpellCorrec'on:someleXersaremorelikelytobemistypedthanothers• Biology:certainkindsofdele'onsorinser'onsaremorelikelythan

others

DanJurafsky

Confusionmatrixforspellingerrors

DanJurafsky

DanJurafsky

WeightedMinEditDistance

• RecurrenceRela'on: D(i-1,j) + del[x(i)]!D(i,j)= min D(i,j-1) + ins[y(j)]! D(i-1,j-1) + sub[x(i),y(j)]! • Termina'on:

D(N,M) is distance

DanJurafsky

Where did the name, dynamic programming, come from?

...The 1950s were not good years for mathematical research. [the] Secretary of Defense ...had a pathological fear and hatred of the word, research...

I decided therefore to use the word, "programming". I wanted to get across the idea that this was dynamic, this was multistage... I thought,

let's

take a word that has an absolutely precise meaning, namely dynamic... it's impossible to use the word, dynamic, in a pejorative sense. Try thinking of some combination that will possibly give it a pejorative meaning. It's impossible.

Thus, I thought dynamic programming was a good name. It was something not even a

Congressman could object to."

Richard Bellman, "Eye of the Hurricane: an autobiography" 1984.

MinimumEditDistance

WeightedMinimumEdit

Distance

MinimumEditDistance

MinimumEditDistance

inComputa'onalBiology

DanJurafsky

SequenceAlignment

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC

DanJurafsky

Whysequencealignment?

• Comparinggenesorregionsfromdifferentspecies • tofindimportantregions• determinefunc'on• uncoverevolu'onaryforces • AssemblingfragmentstosequenceDNA• Compareindividualstolookingformuta'ons

DanJurafsky

Alignmentsintwofields• InNaturalLanguageProcessing • Wegenerallytalkaboutdistance(minimized) • Andweights • InComputa'onalBiology• Wegenerallytalkaboutsimilarity(maximized) • Andscores

DanJurafsky

TheNeedleman-WunschAlgorithm

• Ini'aliza'on:D(i,0) = -i * d!D(0,j) = -j * d • RecurrenceRela'on: D(i-1,j) - d!D(i,j)= min D(i,j-1) - d! D(i-1,j-1) + s[x(i),y(j)]! • Termina'on:

D(N,M) is distance

DanJurafsky

TheNeedleman-WunschMatrix

SlideadaptedfromSerafimBatzoglou

(NotethattheoriginisattheupperleB.)

DanJurafsky

Avariantofthebasicalgorithm:

• MaybeitisOKtohaveanunlimited#ofgapsinthebeginningandend:

SlidefromSerafimBatzoglou

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG-------------- • If so, we don't want to penalize gaps at the ends

DanJurafsky

Differenttypesofoverlaps

SlidefromSerafimBatzoglou

Example: 2 overlapping"reads" from a sequencing project Example: Search for a mouse gene within a human chromosome

DanJurafsky

TheOverlapDetec9onvariant

Changes:

1. Ini'aliza'on

For all i, j,!!F(i, 0) = 0!!F(0, j) = 0!

2. Termina'on

! max i

F(i, N)!F

OPT = max !! max j

F(M, j)

SlidefromSerafimBatzoglou

DanJurafsky

Giventwostringsx=x

1 ......x M y=y 1 ......y N

SlidefromSerafimBatzoglou

TheLocalAlignmentProblem

DanJurafsky

TheSmith-Watermanalgorithm

Idea:Ignorebadlyaligningregions

Modifica'onstoNeedleman-Wunsch:Ini9aliza9on:F(0, j) = 0!!!!F(i, 0) = 0!0!!Itera9on:F(i, j) = max F(i - 1, j) - d!!!!! F(i, j - 1) - d!!!!! F(i - 1, j - 1) + s(x

i , y j

SlidefromSerafimBatzoglou

DanJurafsky

TheSmith-Watermanalgorithm

Termina9on:

1. Ifwewantthebestlocalalignment...

F OPT =max i,j

F(i,j)FindF

OPT andtraceback

2. Ifwewantalllocalalignmentsscoring>t

??Foralli,jfindF(i,j)>t,andtraceback?

Complicatedbyoverlappinglocalalignments

SlidefromSerafimBatzoglou

DanJurafsky

Localalignmentexample

A !T!T!A!T!C!0!0!0!0!0!0!0!A!0!T!0!C!0!A!0!T!0!X = ATCAT!Y = ATTATC!

DanJurafsky

Localalignmentexample

A

!T!T!A!T!C!0!0!0!0!0!0!0!A!0!1!0!0!1!0!0!T!0!0!2!1!0!2!0!C!0!0!1!1!0!1!3!A!0!1!0!0!2!1!2!T!0!0!2!0!1!3!2!X = ATCAT!Y = ATTATC!

DanJurafsky

Localalignmentexample

A

!T!T!A!T!C!0!0!0!0!0!0!0!A!0!1!0!0!1!0!0!T!0!0!2!1!0!2!0!C!0!0!1!1!0!1!3!A!0!1!0!0!2!1!2!T!0!0!2!0!1!3!2!X = ATCAT!Y = ATTATC!

DanJurafsky

Localalignmentexample

A

!T!T!A!T!C!0!0!0!0!0!0!0!A!0!1!0!0!1!0!0!T!0!0!2!1!0!2!0!C!0!0!1!1!0!1!3!A!0!1!0!0!2!1!2!T!0!0!2!0!1!3!2!X = ATCAT!Y = ATTATC!

MinimumEditDistance

MinimumEditDistance

inComputa'onalBiologyquotesdbs_dbs47.pdfusesText_47