[PDF] Clustering Writing Styles with a Self-Organizing Map





Previous PDF Next PDF



Allograph-Based Categorization of Handwriting Styles

Allograph-Based Categorization of Handwriting Styles. Nils Rosengren. BA thesis in General Linguistics /. C-uppsats i allmän språkvetenskap. May 2002.



Automatic Generation of Large-scale Handwriting Fonts via Style

5 dic 2016 then automatically generate a handwriting font library in the user's personal style with huge amounts (e.g. 27533) of Chinese characters.



SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and Out

23 feb 2022 Specifically we propose a style bank to parameterize the specific handwriting styles as latent vectors



ReIReS

15 dic 2018 Christoph Winterer: Handwriting styles as clues for the dating of medieval manuscripts. Document reference: REIRES-WP5-D5.2b-MAINZ ...



Judges Writing Styles (And Do They Matter?)

Posner "Judges' Writing Styles (And Do They Matter?)



Handwriting Styles: Benchmarks and Evaluation Metrics

22 oct 2018 Handwriting Styles: Benchmarks and Evalua- tion Metrics. IEEE International Workshop on Deep and Transfer Learning (DTL 2018) Oct 2018



Content and Style Aware Generation of Text-line Images for

12 abr 2022 In the case of documents containing handwritten text the inter- and intra- writer variability of handwriting styles hinder the recognition ...



Early Mental Health Risk Assessment through Writing Styles Topics

25 sept 2020 rent Neural Networks (RNNs)) and an approach based on writing styles. For the second task related to early detection of depression



Handwriting Research: Style and Practice

The Concern for Handwriting Style: Manuscript and Cursive questions regarding style of writing catego- ... with which the two handwriting styles may.



Clustering Writing Styles with a Self-Organizing Map

be applied in the analysis of different handwriting styles. The analyzed handwriting samples Clusters of different personal writing styles can be found.

ClusteringWritingStyleswithaSelf-OrganizingMap

VuokkoVuori

LaboratoryofComputerandInformationScience

HelsinkiUniversityofTechnology

P.O.Box9800,FIN-02015HUT,Finland

Abstract

ThisworkshowshowaSelf-OrganizingMap(SOM)can

beappliedintheanalysisofdifferenthandwritingstyles.

Theanalyzedhandwritingsampleshavebeencollectedin

on-linefashionwithspecialwritingequipmentssuchas pressuresensitivetablets.Thehandwritingstyleofanin- dividualsubjectisrepresented byavector,componentsof whichreflectthetendenciesofthewritertousecertain prototypicalstylesforisolatedalphanumericcharacters. Thisstudyshowsthatcorrelationsbetweendifferentwriting styles,bothcharacter-wiseandwriter-wisecanbefound. Clustersofdifferentpersonalwritingstylescanbefound bystudyingtheU-matrixviasualizationoftheSOMtrained withdatacollectedfromover700subjects.Anexamination ofthecomponentplanesoftheSOMrevealssomeinterest- ingcorrelationsbetweentheprototypicalcharacterstyles.

1.Introduction

Inthiswork,naturalwritingstylesofseveralhundreds

ofwritersareanalyzed.Theaimofthestudyistofind arepresentationforpersonalwritingstyleswhichenables theircomparisonanddetectionofpossibleclusters.Inad- dition,correlationsbetweenthewritingstylesofcharacters ofdifferentclassesaresearchedfor.Thisworktriestofind answerstoquestionssuchas:"IfIknowhowyouwritelet- ter'a',canIinfersomethingaboutthewayyouwriteletter 'd'basedonwhatIknowaboutotherwriters?".Thiskind ofinformationmightbeusefulinautomaticrecognitionof handwrittencharacters[7]byhelpingtodistinguishconfus- ingcharacterswithoutusinganylinquisticorgeometrical contextofthecharacters,dictionary,oranyotherlanguage model.Inaddition,itmightbeusefulbyspeedingupthe recognitionprocesswhenusedinthepruningorordering oftheprototypesetrepresentingthedifferentwritingstyles ofthecharacters.Forearlierstudiesonautomaticcharac- terizationofhandwritingstyles,see[1],[3],[10],[11],and [17].Thewritingstyleofasinglewriterisrepresentedbya vector,componentsofwhichindicatethewriter'stenden- ciestousethewritingstylesidentifiedbythecharacterpro- totypes.Theprototypeshavebeenselectedbyhandfrom theresultsoffourdifferentclusteringalgorithmsappliedto adatabaseofhandwrittencharactersamplescollectedina on-linemodefromover700subjects[14].Inordertofind correlationsbetweenandwithin thewritingstylesofdiffer- entwriters,thewritingstylevectorsareanalyzedandvisu- alizedwithaSelf-OrganizingMap(SOM)[6].TheSOM- algorithmperformsanonlinearmappingwhichpreserves thelocaltopologicalpropertiesofthedataset.Clustersof thewritingstylevectorscanbefoundbystudyingtheU- matrix[12]ofaSOM.Theclusterscanbeexplainedbyex- aminingthecomponentplanesoftheSOM.Also,correlated writingstylesforisolatedcharacterscandetectedeasilyas theyproducesimilarcomponentplanes.

2.Writingstylevectors

Thewritingstyleofanindividualwriterisrepresented

byavectorcalledhereawritingstylevector.Eachcom- ponentofawritingstylevectorcorrespondstoaspecific characterprototypeandindicatesthetendencyofthewriter tousethatparticularstyleforwritingcharactersoftheclass oftheprototype.Thenextsectionswillexplainindetailthe stepswhichhavebeentakeninordertoformthewriting stylevectorsforthewriters.First,thedissimilarity mea- surebetweenthecharactersamplesisdescribed.Next,the clusteringalgorithmsandthefinalprototypeselectionpro- cedureareexplained.Finally,thetransformationfroma dissimilaritymeasureintoasimilaritymeasureispresented andtheformationofthewritingstylevectorsfromaveraged similaritymeasuresisexplained.

2.1.Dissimilaritymeasure

Thedissimilaritymeasureusedinthecharactercompar-

isonsisbasedontheDynamicTimeWarping(DTW)algo- rithm[9],whichisanonlinearcurvematchingmethod.The connectedpartsofadrawncurveinwhichthepenispressed downonthewritingsurfaceareconsideredasstrokes.The dissimilaritymeasureisdefinedonstrokebasissothatit isinfinitebetweentwocharactershavingdifferentnumbers ofstrokes.Thestrokesanddatapointsarematchedinthe sameorderastheywereproducedandthefirstandlastdata pointsofthetwocurvesarestrictlymatchedagainsteach other.TheDTW-algorithmfindsthepoint-to-pointcorre- spondencebetweenthecurveswhichsatisfiesthesecon- straintsandyieldstheminimumsumofthecostsassociated withthematchingsofthedatapoints.Acostformatching twodatapointsistheirsquaredEuclideandistance.

Prototype-basedclassifiersusingDTW-baseddistances

havebeenshowntobewellsuitedforthehandwriting recognitiontaskbyseveralresearchers,andgoodrecogni- tionaccuraciescanbeobtainediftheprototypesethasa goodcoverageofthedifferenthandwritingstyles[15].In thiswork,theDTW-baseddissimilaritymeasureisusedin theclusteringalgorithmsasadistancemeasure.

2.2.Clusteringandprototypeselection

Thecharacterdatabasewasclusteredinordertofindall

thedifferentwritingstylesforeachcharacterclassandto selectasetofprototypeswhichcapturesthewithin-class stylevariationswell.Allthecharacterclassesandstroke numbervariationsweretreatedseparately.Thisapproach doesnottakeinaccountthebetween-classvariationsand thefoundprototypesarenotoptimizedinthesenseoftheir classificationcapacity.Forsomepreviousworksonproto- typeselection,see[2],[8],and[18].

Fourdifferentalgorithmswereusedfortheclusteringof

thecharactersamples:TreeClust,MinSwap,andtwovaria- tionsoftheC-meansalgorithm[4],namedhereCMeans1 andCMeans2.Allthefourclusteringalgorithmswereag- glomerativeandhierarchical.Clusterswererepresentedby prototypeswhichwerethesampleshavingtheminimum sumofdistancestotheothersamplesinthesamecluster.

TreeClust,MinSwap,andCMeans2startedformasitua-

tioninwhichallthesampleswereprototypes,i.e.formed theirownclusters,whileinthebeginningoftheCMeans1- algorithm,onlyarandomsubsetofthesampleswasselected tobetheinitialprototypeset.

Astheclusteringalgorithmsproceeded,thenumberof

clusterswasreducedbymergingofclusters.InTreeClust-,

CMeans1-,andCMeans2-algorithmsthosetwoclusters

whoseprototypesweremostsimilartoeachotherwere mergedintoone.MinSwap-algorithmtriedseveralalterna- tivemergings,firsttheclusterswiththemostsimilarproto- typepair,thentheclusterswiththenextsimilarpairetc.

Anewprototypewasselectedamongthesamples

whichbelongedtothenewcluster.Afterthat,MinSwap, CMeans1,andCMeans2reassignedthesamplesintotheclustersaccordingtotheclosestprototypesandthenres- electedtheprototypes.Thiswascontinueduntilastable divisionwasfound.MinSwapdidthesamethingbutalso calculatedhowmanyofthesampleswereswappedoutfrom thenewclusterintotheotherclusters,orviceversa,andse- lectedthealternativemergingwhichgaverisetothemini- mumnumberoftheseswappings.

Thenumberofclusterswasfirstdeterminedautomati-

callybyusingtwoclusteringindices.However,itturned outthatmuchbetterresultscouldbeobtainedbyselecting theprototypesbyhandamongtheclustercentersfoundby thefourclusteringalgorithmsbecausetheresultsobtained withthedifferentclusteringalgorithmsandindicesvaried considerably[14].Thisguaranteedthateachdifferentwrit- ingstylefoundwithanyoftheclusteringalgorithmswas presentinthefinalprototypesetandthattheprototypes werenottoosimilartoeachother.Thetotalnumberof selectedprototypeswas2591.Someoftheselectedproto- typescanbeseeninFigure2.Evenifsomeoftheproto- typeslookverysimilartoeachother,saytheprototypesof letter'I'inthe5thand6throwsorprototypesofdigit'5' inthelastrow,theydohavedifferentnumbersofstrokes, differentdrawingorders,ordirectionsforthestrokes.

2.3.Transformingdissimilarityintosimilarity

ThedissimilaritymeasureobtainedwiththeDTW-

algorithmhasarangefromzerotoinfinityanditdependson thenumbersofdatapointsandstrokes.Therefore,thedis- similaritiesbetweenstrokeshavebeennormalizedbythe numberofdatapointmatchingsandthetotaldissimilarities havebeendividedbythenumberofstrokes.Afterthesenor- malizations,thedissimilarities( ?)havebeentransformed intosimilaritymeasures( ?)inthefollowingway: (1)

Thesimilaritymeasureisadecreasingfunctionofthe

normalizeddissimilaritymeasureanditsrangeisbetween zeroandone.Thevalueofparameter wasselectedsothatthedistributionofthesimilaritymea- suresbetweencharactersamplesandtheirbestmatching correctprototypesisapproximatelyeven.Inpractice,this wasachievedbyfittingalinearfunction,whichwasdefined byparameter ?,intheminimumsquarederrorsensetothe logarithmofthecumulativeprobabilityfunctionofthedis- similaritymeasures.

2.4.Formingthewritingstylevectors

Writer'stendenciestousetheprototypicalstylesforiso- latedcharactersaremeasuredbyaveragesimilarityvalues. Theaveragesimilarityvalueofaprototypeiscalculatedby:

1)evaluatingthesimilarityvaluesbetweentheprototype

andallthewriter'scharactersamplesofthesameclassand havingthesamenumberofstrokes,2)summingupthesim- ilarityvalues,and3)finallydividingthesumbythenumber ofitsterms.Theaveragesimilarityvaluesareconcatenated intoawritingstylevector.Thedimensionalityofawriting stylevectoristhesameasthesizeoftheprototypeset.Ifa subjecthadnosamplesatallforsomeclass,alltheaverage similarityvaluescorrespondingtothatclasswereconsid- eredtobemissingfromthewritingstylevectoranddidnot haveanyeffectinthetrainingoftheSOM.Ifawriterhad onlyonecharactersampleforsomeclass,hisorherten- denciestousetheprototypicalstylesofthatclasswerees- timatedbysinglesimilarityvaluesinsteadofaveragedsim- ilarityvalues.Insuchcases,thewriter'stendenciestouse theprototypicalstylesconsistingofadifferentnumberof strokesthanthecollectedsamplearezero.Inaddition,a singlesampleleadstoanassumptionthatthewriteruses onlythewritingstylecorrespondingtothebestmatching prototypeasthesimilarityvaluesbetweenthesampleand theotherprototypesareinmostcasesveryclosetozero.

Forthesamereason,thesumoftheaveragesimilarityval-

uescalculatedforprototypesofthesameclassandhaving thesamenumberofstrokesisrarelyoverone.

3.Data

Theexperimentswereperformedwithtwopublic

databases:IRONOFF[13]andUNIPENtrain_r01_v07[5].

Onlyisolateddigitsandupperandlowercaseletterswere

usedintheexperiments.Thetwodatabaseswerecombined intoone,allthecharactersamplesweremanuallychecked andobviouslyerroneousoneswereremoved.Mostofthe erroneoussampleswereincorrectlysegmented.Intotal,

3174erroneoussampleswerefound.Thetotalnumberof

samplesinthecleaneddatabasewas130831.Thesesam- pleswerewrittenby728subjects.Thesubjectswereof variousagesandfromseveralcountriesandbothhanded- nessgroupswererepresented.Inmyopinion,itisjustified toassumethatthedatabasehasarathergoodcoverageof theexistingwritingstyles.

Thecharactersampleshavebeencollectedwith

pressure-sensitivedisplaysortabletswhichareableto recordthex-andy-coordinatesofamovingpenpoint.As therewereseveralcontributorsandthereforemanydifferent collectionsoftwaresanddevices,allthecharactersamples werepreprocessedsothattheirdatapointsweresimilarly distributed.Itwasdonebyfirstinterpolatingstraightlines betweentheoriginaldatapointsandthenresamplingnew datapointswhichwereequallyspacedontheestimatedpen trace.InordertomaketheDTW-basedcomparisonofthe charactersamplesreasonable,thesizeandlocationvaria-

tionsofcharacterswerebenormalized.Themasscentersofthecharacterweremovedtotheoriginofthecoordinatesys-

tem.Thecharacterswerescaledsothatthelongersidesof thesmallestboxesdrawnaroundthecharactersandaligned withthecoordinateaxeshadaconstantvalue.Thescaling ofthecharacterswasperformedpriortotheresampling.No otherfeatureswereusedforrepresentingpentracesbutthe x-andy-coordinates.

4.CreatingaSOMofdifferentwritingstyles

ASOMisaneuralnetworkinwhichtheneuronsarecon-

nectedtoeachothersothattheyformaregularlattice.Each neuronactsbothasaninputandoutputneuronandisasso- ciatedwithareferencevector.Thereferencevectorsare comparedwiththenetwork'sinput.Theoutputsoftheneu- ronsdependonhowsimilartheinputandreferencevectors are.Theneuron,referencevectorofwhichismostsimi- lartotheinputvector,iscalledthebest-matchingmapunit (BMU).Duringthetrainingofthenetwork,thereference vectorsoftheBMUsandtheirneighboringneuronsareup- datedsothattheybetterrepresenttheinputvectors,inthis workthewritingstylevectors.Duetosuchtraining,differ- entneuronswillspecializeinrepresentingdifferentareasof theinputspace.Inaddition,neuronsneartoeachotherin theneuronlatticetendtocorrespondtoareasclosetoeach otherintheinputspace.Therefore,aSOMcanbeseen asanonlinearmappingfromtheinputspacetothelower- dimensionallatticespace.TheSOM'sabilitytorepresent thetrainingdatafaithfullydependsonthetruedimension- alityofthedatasetandonthesizeanddimensionalityof theneuronlattice.

Asthemaininterestofthisworkistofindcorrelations

betweenthewriters,allthestylesusedbyonlyasingle writerwereomittedfromthewritingstylevectors.So,all theprototypesforwhichtheaveragesimilaritywasabove

0.05onlyforasinglewriterwereconsideruninteresting.

Thisway,thedimensionalityofthewritingstylevectors

wasreducedfrom2591downto1764.Thekeptprototypes wereusedby146subjectsontheaverage.Approximately

11%oftheaveragesimilarityvaluesweremissingfromthe

writingstylevectors.The1764-dimensionalwritingstyle vectorswerefurtheranalyzedwithaSOMinhopeoffind- inginterestingstructuressuchasclustersofwriters.

VariousalternativesfortheSOM'ssize,lattice,neigh-

borhoodfunction,trainingalgorithm,trainingparameter andepochs,initialization,andupdatingrulewereexperi- mentedwith.DifferentSOMswerecomparedwitheach otherbyusingtwoqualitymeasures:quantizationerrorand abilitytopreservethetopologyofthedata.Theformer measureistheaveragedistancebetweeneachwritingstyle vectoranditsBMU.Thelatteroneistheproportionofall datavectorsforwhichthefirstandsecondBMUsarenot adjacentunits.

Figure1.U-matrixformedforthe1764-

dimensionalwritingstylevectors.

ThesizeoftheSOMwasfixedto20

?10neuronunits whichisapproximately30%ofthenumberofwriters.The topologyofthemapwasselectedtobeasheetwithhexag- onallatticeandGaussianneighborhoodfunction.Alinear initializationalongthefirsttwoprincipaldirectionsofthe dataprovedtoproducebetterresultsthanarandominitial- ization.ThebatchtrainingalgorithmwasappliedwithEu- clideanmetricastheircombinationprovidedmuchfaster andreliableconvergencethananon-linetrainingalgorithm orametricbasedontheanglebetweentwovectors.Thetrainingwascarriedoutinthreephases.Inthefirst phase,roughtraining,theradiusoftheneighborhoodwas linearlydecreasedfrom10to6during10trainingepochs.

Inthesecondphase,theradiuswasdecreasedfrom5to

3during50epochs.Finally,inthefine-tuningphase,the

radiuswasdecreasedfrom2to1during100epochs.An epochmeansthattheBMUsarefoundforallthetraining samplesandtotalerrorsarecalculatedforallthemapunits, bothfortheBMUsandtheirneighboringmapunitsonthe hexagonallattice.Theneighborhoodfunctionanditsradius determinehowtheerrorsaredistributedtothemapunits aroundtheBMUs.Afterfindingthetotalerrors,allthemap unitsarethenupdatedsimultaneouslyonthebasisofthe totalerrorssothattheybetterrepresentthetrainingsam- ples.Thenumberoftheepochsinthefine-tuningphaseis perhapsunnecessarilylargebuttherewasnoneedtoopti- mizeitasthebatchtrainingwasratherfasttakingaboutten minutesintotal.Theproportionofallwritingstylevectors forwhichthefirstandsecondbest-matchingmapunitswere notadjacentwas0.01.Therefore,itcanbesaidthatthemap preservesthelocaltopologicalrelationsofthewritingstyle vectorsratherwell.

5.Analysisofthewritingstylemap

TheU-matrixofaSOMishelpfulindetectingclusters

onthemap.Itscoloringisbasedonthedistancesbetween neighboringmapunits.Areasinwhichtheneighboring mapunitsaresimilartoeachotherarecoloredwithdark gray,whereaslightshadesindicatethatthedifferencesbe- tweentheneighboringunitsaremoresignificant.There- fore,clustersofpersonalwritingstylescanbeseenonthe

U-matrixasdarkareassurroundedbylighterareas.The

SOMcanalsobevisualizedwithimagescoloredaccording

tothevaluesofthecomponentsofthereferencevectors.

Theseimagesarecalledcomponentsplanes.Component

planesshowhowthetendenciestousethecorresponding prototypicalcharacterstylesvaryoverthemap.

TheU-matrixandsomeinterestingcomponentplanesof

theconstructedSOMareshowninFigures1and2.Itcanbe seenfromtheU-matrixoftheSOMthatthewritingstyles canroughlybedividedintoseveralclusters.Therearesmall clustersintheleftandrightlowercornersoftheSOMsur- face,aslightlybiggeroneabovethemontheverticalmiddle lineofthemap,threesmallclustersontherightedgeofthe map,atriangular-shapedclusterneartheupperedgeofthe map,andthreeclustersontheleftedgeonandabovethe horizontalmiddlelineofthemap.

Theinterestingcomponentplanesarethosewhichshow

significantvariancebetweenthemapunits.Here,thecom- ponentplaneswhoserangeisatleast0.30havebeense- lectedforfurtherexamination.Inthesecases,itcanbe claimedthattherereallyaresomedifferencesintheten-

Figure2.Someinterestingcomponentplanes

withthecorrespondingprototypes.Thequotesdbs_dbs19.pdfusesText_25
[PDF] hangar 15 quai joannes couvert

[PDF] hannah arendt condition de l'homme moderne pdf

[PDF] hannah arendt le système totalitaire chapitre 4

[PDF] hannah arendt les origines du totalitarisme ebook

[PDF] hannah arendt les origines du totalitarisme fnac

[PDF] hannah arendt pdf

[PDF] hannah arendt the origins of totalitarianism

[PDF] happiness is the key to success

[PDF] happy new year 2018 art

[PDF] happy new year 2018 youtube

[PDF] haraka men gov ma

[PDF] harding university 10 year reunion 2006

[PDF] harga mio injection 123 2015 second

[PDF] haricot blanc in english

[PDF] harris interactive