[PDF] [PDF] 21 K-means clustering

have, in the case of a binary classification problem, the values ≠1 and 1 In cluster is smaller than the distance between two points that are from differ- ent clusters This will tering algorithms, among which are the K-means clustering algorithm, learn scikit-learn is a Python module for machine learning built on top of



Previous PDF Next PDF





[PDF] CAH et K-Means sous Python - Université Lyon 2

protéines, lipides, etc ; 9 variables) L'objectif est d'identifier des groupes de fromages homogènes, partageant des caractéristiques similaires Nous utiliserons 



[PDF] Méthode des K-means - Lyon 2

Algorithme K-Means – Méthode des centres mobiles 3 Cas des variables actives qualitatives 4 Fuzzy C-Means 5 Typologie, apprentissage non- supervisé, clustering Variables « actives », servent à la constitution des groupes Souvent (mais pas profils Cf le cours d'Analyse des Correspondances Multiples Chien



[PDF] 21 K-means clustering

have, in the case of a binary classification problem, the values ≠1 and 1 In cluster is smaller than the distance between two points that are from differ- ent clusters This will tering algorithms, among which are the K-means clustering algorithm, learn scikit-learn is a Python module for machine learning built on top of



[PDF] Practical implementation of k-means clustering - AWS Simple

DataCamp Customer Segmentation in Python Summary statistics of each cluster Run k-means segmentation for several k values around the recommended 



[PDF] K-Means Clustering - CEDAR

K-means seen as non-probabilistic limit of EM applied to mixture Srihari 5 Two Updating Stages • First choose initial values for µ k • First phase: – minimize J 

[PDF] k means convergence proof

[PDF] k means gradient descent

[PDF] k means sklearn

[PDF] k parmi n

[PDF] k touré

[PDF] kahoot troubleshooting

[PDF] kamus larousse

[PDF] kanji 300 pdf

[PDF] kanji practice sheets pdf

[PDF] kansas city federal court

[PDF] kaplan schweser cfa question of the day

[PDF] karush kuhn tucker conditions example

[PDF] kawasaki dakar rally bike

[PDF] kegel exercise pdf download

[PDF] keller kiliani test is used for identification of

Contents

1Int roduction2

1.1Quickr efresher:Sup ervisedlearning..............2

1.2Thebas icideaofUns upervisedlea rning... ...... ...3

2Cl usteranalysis5

2.1K-mea nsclustering....... .................6

2.1.1Variat ionsoftheK-meansalgorithm.... ..... .9

2.1.2Gaussian MixtureModels........ ........11

2.1.3Pythonimple mentations....... ..........15

2.2Hierar chicalclustering............. .........16

2.2.1BasicAgg lomerativeHierarc hicalClusteringAlgorithm16

2.2.2DeÞningPr oximitybetwe enClusters......... .17

2.2.3GroupAv erage...... ................18

2.2.4WardÕsme thod......... ........... ..1 8

2.2.5Aninstruct ivee xample................. .18

2.3DBSCAN ............ .. ... ... ... ... ... 18

2.3.1TheDBSCAN algorit hm........... ......20

3Ano malydetection22

3.1Loca lOutlierFactor-LO F.......... .........22

3.1.1Python implementation....... ..........24

Chapter1

Introduction

1.1Quickr efresher:Superv isedlearning

Inthepre viousseme ster,wehadtheoppo rtunitytostudytheconceptsof supervisedlearning,themathematical structuresunderneath itandanaly ze variousmodelslikethePer ceptron,Ada line,SVMands oon.Theba sicidea wastousethe Npre-labeleddatapoints(x (i) ,y (i) 1|i|N andbuildahy per - planethatw ouldseparate (intheÒonev ersustherestÓcase)onecla ssof datapointsf romtheotherclass( es).Thepointsx (i) |R n+1 werecalleddata points,ea chhavingncharacteristicfeatures(wheretheÞrstcompo nentwas chosentobe1byco nventio n),where asy (i) werecalledlabels.The labelsca n have,inthecaseo fabina ryclas siÞcationproblem,the values"1and1.In themulticla ssregime,thelabelscant akeavaluefromalarg errang eofval- ues.Afterwa rds,wewouldinsertotherdatapoints(x (i) i>N andexpe ctour algorithmtoclassifythemc orrec tly,i.e.labelthemwiththec orrectva lue (y (i) i>N So,wefo rmulated thebinaryclassiÞcationproblems:given pre-labe leddata, wewant toÞndahyperplane (whichisc hara cte rizedbyavectornormal totheplane ,whichwe usuallycalledw)whic hseparate saccuratelyourpre- labeleddata,andalso successfullysepar atesne wdata.Weto okastepfurther byintro ducingacertainclassoffunc tions ,whichwecalledactivationfunc- tions(Adaline),whichwouldhelpusimprov eouralgorit hm,andafterw ards allowingustousethe too lsofoptimiz ationthe ory,de Þninglossfunctions, minimizingthem,a ndsoon(likew edidfor theSVM).

Introduction3

Figure1.1:The Perceptron modelinac tion

1.2Thebasic ideaofU nsupervisedle arning

Unsupervisedlearningisabranchofmachine learningthatlearns fro mte st datathathas notbeenlabe led,class iÞedorca tegorized.Inst eadofrespond- ingto feedbac k,unsupervisedlearningidentiÞesco mmonalitiesinthedata andreac tsbasedonthepresenceo rabsenceofsuch commonalitie sineach newpieceo fdata. Thedatag iventounsupe rvisedalgorithma renotla belled,whichmeansonly theinputvar iables ,i.e.thedatapointsx (i) aregivenw ithnocorresponding outputvariables .Sothemachinehastolearnonitso wnt oÒsepar atedataÓ, withoutanytraining .ThisistheÞrst andmostimportantdi erenceincom- parisontosupervised learning. TheÞrstide athatonecould comeupt owouldbetovisualizethedata, andthepe rsonwo uldbeinchargeforÞndingstruc ture/patter nsinthedat a points.Thisis,however,a problemif thenumbe rofdatapointsgetshuge, andnotto mentiont heproblemt hatthefeaturespace coldhave adimension largerthan3.Butsuppo sethatthen umberof dimensionsist heonlyprob- lem.Another methodthatcouldwo rkiscalleddimensionalreduction. Usingthismet hod,wef ocusonlyontherele vantdimens ions.However,w e wouldthennee dacriterion fordeÞningÒrele van tÓandÒirrelevantÓdimen- sions. Sothe problemm ustbetackledwit hsomeotherme thods.L etustherefore lookatasimplee xampleofno n-lab eleddat apoints:

Introduction4

Figure1.2:Tw oclustersofunlabele ddatapoints

Weclear lyseethatthereare( atleast)t woÒbunchesÓof data,so-called clusters.But howcanw equantit ativelyconc ludethatthe rearetwoclus- ters?Ononehand,foro necluster ,there isacert aindomainofthe featur e spacecharacter isticforit.Ontheotherhand,wecanmeasurethedistance betweenthepoints;thedista ncebetw eenthepointsthatbelong tothesa me clusterissmallerthant hedist ancebetweentw opointstha tarefr omdi er- entclusters .Thiswillbeaguidingprincipleforbuildingupalg orithms. Someofthe mostpo pularalgor ithmsforunsuper visedlearningincludeclus- teringalgorithms ,amo ngwhicharethe K-meansclusteringa lgorithm, hierarchicalclustering,DBSCAN,mixture modelsandsoon.Otheralgo- rithmsworko ntheprincipleofanomalydetection(liketheLoca lOutlier Factoralgorithm).Ofcour se,therearefarmorealgor ithmsthanw ehave mentionedwithotherunderlying principles,whichw ewillbrießydiscuss later. Thealgor ithmsforthesemethodscanb efoundintheP ythonmodulescikit- learn.scikit-lear nisaPythonmoduleformachinelearningbuiltontop of

SciPy:https://github.com/scikit-learn

Chapter2

Clusteranalysis

Clusteranalysisgroupsda taobjectsbasedo nlyoninformationfo undinthe datathatdes cribestheob jectsandtheirrelationships.The goalist hatthe objectswithinagroupbesimilar (orrelated) too neanotheranddi erent from(orunrelat edto)theob jectsinothergroups.The greaterthe similar- ity(orho mogeneity)w ithinagroupandthegreaterthedi erencebetween groups,thebetteror moredistinc ttheclustering. Anen tirecollectionofclus tersiscommonlyreferredtoa sacluste ring,andin thissect ion,wedistinguishvarioustyp esofclus terings:hierarchical (nested) versuspartitional(unne sted),exclusiveversusover lappingversusfuzzy,and completeversuspartia l. Apartitionalclusteringissimplyadivisionoft hese tofdata ob jects intonon-over lappingsubsets(clusters)suchthatea chdataobjectisinex - actlyonesubset.I fweper mitclusterstohavesub clusters ,thenwe obtaina hierarchicalclustering,whichisasetofnest edclusterst hata reo rganized asatre e.E achnode(cluster)int hetree(exceptf orthelea fnodes)isthe unionofits children(s ubcluster s),andtheroot ofthetreeisthecluster containingalltheobjects .Exclusiveclustersassigneachob jecttoasingle cluster.Therearemanysit uationsinwhichap ointcouldrea sonablybe placedinmorethano neclus ter,andthe sesituationsa rebett eraddressed bynon-exclusiveclustering.Inthemostgenerals ense,a noverla ppingor non-exclusiveclusteringisusedtoreße ctthefactthatanobjec tc ansimulta- neouslybelongtomor ethanonegroup( class). Inafuzzyclustering,every objectbelongstoever yclusterwithamember shipweig htthatisbetween0 (absolutelydoesnÕtbelong)and1(a bsolutelybelongs) .Withsuchacluster- ing,weaddr esssit uationswheredatapoint scouldbelongtomorecluste rs. Usually,thefuzzyclus teringisc ombinedwithpartit ionalclustering,byas- signingdatapo intstotheclus terswiththehighestpro bability.

Clusteranalysis6

Nowthatw ehavedeÞnedthe basicte rms,letusinve stigatesome algorit hms.

2.1K-means clustering

Inthisse ction,we willmainlyfollowBishop(2006 )

1

Supposewehaveadat asetof Ndatapoints(x

(i) 1|i|N whichliveinthe d-dimensionalfeaturespace.O urgoalistopartitiontheda tasetintosome numberKofclust ers,whereweshallsupposeforthe momentthatthevalue ofKisgiv en.Wecanformaliz ethisnot ionbyÞrstint roducingasetof n-dimensionalvectorsµ k ,whe rek=1,...,K,inw hichµ k isaprototype associatedwiththek th cluster.Asweshallseeshor tly ,wecan thinkofthe k asrepre sentingthecentresoftheclusters.O urgoalis thentoÞndan assignmentofdatapointstocluste rs,as wellasa setofvecto rsµ k suchthatthes umofthesquar esoft hedistance sofeac hdatapo inttoits closestvectorµ k ,isa minimum. Itiscon venie ntatthispointtodeÞnesomenotation todescr ibetheas sign- mentofdata pointst oclusters.F oreachdatapointx (i) ,we introduce acor- respondingsetofbinaryindicato rvariablesr nk {0,1},whe rek=1,...,K (i) isas signedto,sothat ifdat apointx (i) isas signedtoclusterk,the nr nk =1,and r nj =0for j=k.This isknowna sthe 1-of-Kcodingscheme.W ecanthendeÞnea lossfunction, sometimescalledadistor tionmeasure,givenby: J= N n=1 K k=1 r nk #x n k 2 (2.1) whichrepres entsthesumofthesquaresofthedistanc esofea chdatap oint toitsas signedv ectorµ k .Our goalis toÞndvaluesforthe r nk andthe k }soasto minimizeJ.We candot histhroughanit era tiveproc edurein whicheachite rationinvolves twosuccessivestepscorres pondingtosuccessive optimizationswithrespecttot her nk andtheµ k

Firstwechoo sesomeinit ialvaluesfortheµ

k .The nintheÞrstphas ewe minimizeJwithrespec ttother nk ,ke epingtheµ k

Þxed.Inthese cond

phaseweminimizeJwithrespe cttotheµ k ,ke epingr nk

Þxed.Thistwo-

stageoptimizationisthe nrepeateduntilconverge nce. Weshallsee thatthese twostages ofupdatingr nk andupdating µ k correspondrespectivelytothe 2 ,and to 1 ChristopherM.Bishop-PatternRecognitionan dMachineLearning,Sp ringerVerlag 2 TheEMalgorithmisamuchbroader andmore gene ralalgorithm,but wewillnot dealwitht hisalgorithmind etail

Clusteranalysis7

emphasizethisweshalluset heterms EstepandMstepintheconte xto f theK-meansalgorithm.

ConsiderÞrstthedetermina tionofther

nk .Be causeJin2.1 isalinear functionofr nk ,this optimizatio ncanbeperformedeasilyto giveaclosed formsolution.The termsinvolvingdi erentnareindependent andsowecan optimizeforeachnseparatelybychoosingr nk tobe 1forwhicheve rvalueof kgivestheminimumvalue of#x n k 2 .In otherwo rds,wesimplyassign then th datapointto theclosestcluste rcent re.Mor eformally,thiscanbe expressedas r nk

1ifk=ar gmin

j %x n j 2

0otherwise.

Nowconside rtheoptimizationofthe µ

k withther nk heldÞxed.T heloss functionjisaq uadra ticfunctionofµ k anditcan be minimizedbysett ing itsderiva tivewithrespecttoµ k tozero .Thiswillgiveus: 2 N n=1 r nk (x n k )=0 whichwecane asilysolv eforµ k k n r nk x n n r nk (2.2) Thedenominat orinthisexpressionisequa lto thenumb erofpointsas- signedtoclusterk,and sothisre sulthasas impleinter pretation,namelyse t k equaltothemeano falloft hedatap ointsx n assignedtotheclusterk. Forthisreas on,thepro cedureiscalledK-meansalgorithm. Thetwopha sesofdd-ass igningdatapointst ocluster sandre-computing thecluste rmeansarerepeate dinturnuntilt hereisnofurthe rchangeinthe assignments(oruntilsomemaximumnum berofiter ationsisexce eded).Be- causeeachphaser educesthevalueof theobjectiv efunctionJ,co nvergence ofthealg orithmisa ssured.However,itmayco nve rgetoalocalrathertha n globalminimumofJ.The conver gencepropertiesoftheK-meansalgorithm werestudiedbyMa cQueen(1967) 3 .Le tusnowsummarize howt healgo- rithmlookslik eandalsoprovethe conver genceo fthealgorithm. 3 J.MacQu een-Somemethodsfo rclassiÞcationanda nalysisof multivariateobserva-

Clusteranalysis8

K-meansalgorithm(pseudocode)

1fork=1 toKdo

2µ k $somerandomlocation randomlyinitializemean forkthcluster

3endfor

4repeat

5forn =1to Ndo

6r nk $argmin k #x n k 2

E-step:assignn-thdatapoin ttoclosest center

7endfor

8fork =1 toKdo

9µ k $MEAN k (x n ,r nk )M-step:re-estimatem eanofclusterkwith2.2

10endfor

11untilconverged

12returnr returnclusterassignm ents

Theorem(K-meansConve rgenceTheorem):Fo ranydatase tD andanyn umberofclus tersK,the K-meansalgorithmco nvergesinaÞnite numberofiterations, whereco nvergenceismeasuredbyJceasingthechange. Wefollow theproofgivenby HalDaumŽ IIIinhisbookACourse inMachine

Learning.

Proof:The proofwo rksasfollows. Thereareonlytwopoin tsinwhich theK-meansalgorithmcha ngesthevaluesofµ k orr nk :lines 6and9.W e willshowt hatbothoft heseopera tionscanneverincr easethe valueofJ. Assumingthisistr ue,there sto ftheargumentisa sfollows .AftertheÞrst passthrought hedata,thereareonly Þnitelymany possibleassignment sto r nk andµ k ,be causer nk isdisc reteandbecauseµ k canonlytake onaÞnite numberofvalues:means ofsomes ubsetofthedata.Fur thermore,Jislow er- boundedbyzero(s inceJisjus tasumofEuc lidean (L 2 )dist ances).Together, thismeanst hatJcannotdecrease morethanaÞnitenumberoftimes.T hus, itmus tstopdecre asingatsomepoint ,andatthatpointthealgorithm has converged.Itremainstoshowthatlines6 and9decr easeJ.Fo rline6,when lookingatexamplen,supp osethatthevalueofr na was1,andnow r nb is

1,andr

na isze ro.Itmustbethecas ethat#x n b #%#x n a #.Th us, changingfromatobcanonlydecre aseJ.Fo rline9,conside rthe second formofJ.Line 9compute sµ b asthemea noftheda tapointsf orwhich r nb =1,whic hareprecise lythepointst hatminimizethesquareddistance s.

Thus,thisupdateto µ

k canonlydecr easeJ.| Therearesever alaspects ofK-meansthatareunf ortunate.Firs t,the convergenceisonlytoalocaloptimumofJ.In practice ,thismeansthatyou

Clusteranalysis9

shouldusuallyrunit 10timeswithdi erentinitializationsandpic ktheone withminimalresult ingJ.Sec ond,onecanshowtha tthere areinputdatase ts andinitializa tionsonwhichitmighttakeanex ponentia lamountoftime to converge.Fortunately,thesecase salmostneverhappeninpractice,andin factithasrecen tlyb eensho wnthat(roughly)ifyoulimittheßoat ingpoint precisionofyourmachine,K-meanswillconve rgeinpo lynomialtime(though stillonlytoalo caloptim um),usingt echniq uesofsmoothedana lysis. 4

2.1.1Variationso ftheK-meansalgorithm

Adire ctimplementationo ftheK-meansalgorithma sdiscussedherecanbe relativelyslow,becauseineac hEstepitisnecess ary toc omputetheEu- clideandistancebe tweentheprototype vectorandeverydatapoint .Various schemeshavebeenpro posedfors peedinguptheK-meansalgorithm,so me ofwhicha rebasedon precomputingadat astructuresuch asatre esuch thatnearbyp ointsareinthesames ubtree(Ramasubramaniana ndPaliwal,

1990;Moore,2000 ).Otherapproachesmakeus eofthetriangleinequality

fordistance s,therebyavoidingunnecessarydis tancecalculations(Hodgson,

1998;Elkan,2003).

Thebigges tpracticalissueinK-meansisinitialization. Ift heclustermeans areinitializedp oorly,youofteng etconvergencetouninter estingsolutio ns. Itcana lsohappenthat wehavetwo centroidsconcen tratedinone cluster . Ause fulheuristicisthefur thest-Þrstheuristic. Thisg ivesawa ytoperform ase mi-randominitializationthatattempt stopickinitialmeansasfarfrom eachotheraspo ssible.Theheuristic issketc hedbelow:

¥Pickarandom exa mplemandsetµ

1 =x (m) ¥Fork=2,..,K:Findt heex amplemthatisasfa ras possiblef romall previouslyselectedmeans;na mely:m=argmax m minquotesdbs_dbs17.pdfusesText_23