Cluster Analysis

1 © A. Kassambara 2015 Vultivariate Nnalysis R

Alboukadel Kassambara A-Practical Guide To Cluster Analysis in R

Edition 1 sthda.com

Unsupervised Machine Learning

Large amounts of data are collected every day from satellite images, bio-medical, security, marketing, web search, geo-spatial or other automatic equipment. Mining knowledge from these big data far exceeds human's abilities. Clustering is one of the important data mining methods for discovering knowledge in multidimensional data. The goal of clustering is to identify pattern or groups of similar objects within a dataset of interest. In the literature, it is referred as "pattern recognition" or "unsupervised machine learning" - "unsupervised" because we are not guided by a priori ideas of which variables or samples belong in which clusters. "Learning" because the machine algorithm "learns" how to cluster.



Largeamountsofdat aarecollectedever ydayfr omsatelliteima ges,bio-medica l, security,marketing,websea rch,geo-spatialorotherauto maticequipment .Mining knowledgefromthesebigdat afarexceedshuma n'sabilities. Clusteringisone oftheimpo rtantda taminingmet hodsfordiscoveringknowledge inmult idimensionaldata.Thegoalofclust eringistoidentifypa tterno rgroupsof similarobject swithinadatasetofinterest . Inthelitt eratur e,itisreferredas"patternrecog nition" or" unsupervisedmachine learning"-"unsupervised" bec ausewearenotguidedbyaprioriideasof which variablesorsamplesbelonginw hichclust ers."Learning"b ecausethemachine algorithm"learns"howtoc luster.

Clusteranalysisispopular inmanyfields,including:

•Incancerresearchforclassify ingpatientsintosubgroupsaccor dingtheirgene expressionprofile.Thiscanbeuseful foridentifyingthe molecularpr ofileof patientswithgoodorbadpr ognostic,a swellasforunders tandingthedis ease. •Inmarketingformarketsegmentationbyidentif yingsubgroupsofcustomersw ith similarprofilesand whomightberece ptivetoa particular formofadve rtising. •InCity-planningforidentifying groupsofhousesaccordingt otheirtype,va lue andlocat ion. Thisbook providesapractica lguidetounsupervisedma chinele arningorcluster analysisusingRsoftwar e.Additio nally, wedeveloppedanRpackagenamedfactoextra tocrea te,easily,aggplot2-basede legantplotsofcluster analy sisresults.Factoextra o cialonlinedocumen tation:h ttp://www.sthda.com/english/rpkgs/fact oextra 4

0.2Aboutthe author

AlboukadelKassambaraisaP hDinBioinformaticsandCancerBiology.Hew or kss ince manyyearso ngenomicdataanalysisa ndvisualiza tion.Hecreateda bioinformatics toolnamedGenomicSc ape(www.ge nomicscape.com)whichisaneasy-to -usewebtool forgeneexpr essiondata analysisandvisualization. Hedev elopedalsoawebsitecalledSTHD A(Statistica lT oolsforHigh-throughputDa ta Analysis,www.sthda.com/e nglish),whichcontainsmanytutorialsondataanalysis andvisualiz ationusingRsoftwareandpackage s. Heist heaut horoft heRpackagessurvminer(foranalyzinganddr awingsurvival curves),ggcorrplot(fordrawingcor relationmatrixusing ggplot2)andfactoextra (toeasilye xtractandvisualiz etheresultsofmultivariateana lysissuch PCA,CA, MCAandcluste ring).Yo ucanlearnmoreaboutthesepacka gesat:ht tp://www. sthda.com/english/wiki/r-packages Recently,hepublishedtwobooksonda tav isualization:

1.GuidetoCr eateBe autifulGraphicsinR(at:ht tps://goo.gl/vJ0OYb).

2.CompleteGuideto3DPlots inR(at :https:/ /goo.gl/ v5gwl0).


0.1Pref ace................. ... ... .. ... ... ... .3

0.2Aboutt heauthor.... ..... ............ ... ... ..4

0.3Keyf eaturesoft hisbook............... ...... ....9

0.4Howthis bookis organize d?........... .. ..........10

0.5Book website....... .............. ... ... ... .16

0.6Exec utingtheRcodesfromthePD F..... .. ...........16

IBa sics17

1Int roductiontoR18

1.1Insta llRandRStudio........ .. ...... ......... .18

1.2Insta llingandloadingRpackages ...... ......... ... ..19

1.3Gett inghelpwithfunctionsinR. ...... ......... .....20

1.4Impor tingyourdataintoR..... ......... ......... 20

1.5Demoda tasets... ........ ............ ... ... .22

1.6Close yourR/RStudioses sion...... .............. ..22

2Da taPreparation andRPackages23

2.1Data preparation... ....................... ... 23

2.2Requir edRPackages...... ... ..................24

3Cl usteringDistanceMeasures25

3.1Metho dsformeasuringdistance s........ ............25

3.2Whatty peofdist ancemeasuressho uldwechoo se?..........27

3.3Data standardization.. .........................28

3.4Dista ncematrixcomputation.... ........... .......29

3.5Visualizing distancematrices. .............. ........32

3.6Summary. ......... ... .. ... ... ... ... ... ... 33



IIPart itioningClustering34

4K-M eansClustering36

4.1K-mea nsbasicideas....... ......... ........ ... 36

4.2K-mea nsalgorithm....... .................. ... 37

4.3Computing k-meansclusteringin R............. ......38

4.4K-mea nsclusteringadvantag esanddisadvantages.... .......46

4.5Alterna tivetok-meansclustering.... ...... ..........47

4.6Summary. ........ ... ... ... ... ... .. ... ... .47

5K-M edoids48

5.1PAMco ncept.. ............... ... .. ... ... ... 49

5.2PAMalg orithm.. ................. ... ... ... ..49

5.3Computing PAMinR........ ..... ... ... ... ... .50

5.4Summary. ......... ... .. ... ... ... ... ... ... 56

6CL ARA-ClusteringLar ge Applications57

6.1CLARA concept...... .............. ... ... ... 57

6.2CLARA Algorithm....... .............. ... ... .58

6.3Computing CLARAinR...... ........... ... ... ..58

6.4Summary. ......... ... .. ... ... ... ... ... ... 63

IIIHierar chicalClustering64

7Ag glomerativeClustering67

7.1Algorit hm............. ... ... ... .. ... ... ... 67

7.2Stepst oagglomerat ivehiera rchicalclustering.............68

7.3Verif ytheclustertree.. ...... ............... ....73

7.4Cutthe dendrogra mintodi

erentgroups..... ...........74

7.5Cluste rRpackage...... ... .................. .. 77

7.6Applicatio nofhierarchicalclust eringtog eneexpressiondataanalysis77

7.7Summary. ......... ... .. ... ... ... ... ... ... 78

8Co mparingDendrograms79

8.1Data preparation.... ....................... ..79

8.2Compar ingdendrograms...... ...................80

9Vi sualizingDendrograms84

9.1Visualizing dendrograms.... .................... .85

9.2Case ofdendrogramwit hlargeda tasets.............. ..90


9.3Manipulat ingdendrogramsusingdendext end..............94

9.4Summary. ......... ... .. ... ... ... ... ... ... 96

10Heat map:StaticandInteractive9 7

10.1RPacka ges /functionsfordrawingheatmaps..............97

10.2Datapre paration...... ....................... 98

10.3Rbasehe atma p:heatmap().. .............. .......98

10.4Enhancedhe atmaps:heatmap.2().. ......... ........101

10.5Pretty heatmaps:pheatmap()..... ......... ........102

10.6Inter activeheatmaps:d3heatmap()........... ........103

10.7Enhancinghea tmapsusingdendextend ............... ..103

10.8Complexhea tmap........ ............... ... ... 104

10.9Applicationt ogeneexpressionmat rix.... ......... .....114

10.10Summary.............. .. ... ... ... ... ... ..116

IVCl usterValidation117

11Asse ssingClusteringTendency119

11.1Require dRpackages......... ... ...............119

11.2Datapre paration...... ....................... 120

11.3Visualinsp ectionofthe data................. ...... 120

11.4Whyasse ssingclus teringtendency?.......... .........121

11.5Methodsf orassessingcluster ingtendency... ............123

11.6Summary... ......... .. ... ... ... ... ... ... .127

12Dete rminingtheOptimalNumberofCluster s128

12.1Elbowme thod......... ........... ... ... ... ..129

12.2Averag esilhouettemethod........ ................130

12.3Gapsta tisticmethod. .................... ......130

12.4Computingt henumberofcluste rsusingR. .............. 131

12.5Summary... ......... .. ... ... ... ... ... ... .137

13Clust erValidationStatist ics138

13.1Interna lmeasuresforclustervalidatio n................ .139

13.2Externa lmeasuresforclusteringvalidat ion............... 141

13.3Computingclus tervalidationst atisticsinR....... .......142

13.4Summary... ......... .. ... ... ... ... ... ... .150

14Choos ingtheBestClusteringA lgorit hms151

14.1Measure sforcomparingclusteringalgorit hms........ .....151


14.2Comparec lusteringalgorithmsinR. .................. 152

14.3Summary... ......... .. ... ... ... ... ... ... .155

15Comput ingP-valueforHierarc hicalClustering156

15.1Algorithm ............... ... ... .. ... ... ... .156

15.2Required packages.......... ..................157

15.3Datapre paration...... ....................... 157

15.4Computep- valueforhierarchic alclustering... ......... ...158

VAdv ancedClustering161

16Hier archicalK-MeansClustering163

16.1Algorithm. .............. ... ... ... ... ... ... 163

16.2Rcode.. ... ........ ... ... ... ... ... ... ... .164

16.3Summary... ......... .. ... ... ... ... ... ... .166

17FuzzyCl ustering 167

17.1Required Rpackages......... ... ...............167

17.2Computingfuz zyclustering... ............ ........168

17.3Summary... ......... .. ... ... ... ... ... ... .170

18Model -BasedClustering171

18.1Concept ofmodel-basedclustering ...... ..............171

18.2Estimating modelparameters...... ......... .......173

18.3Choosingt hebestmodel...... ...... ............ .173

18.4Computingmo del-basedclustering inR.................173

18.5Visualizingmode l-basedclustering ...................175

19DBSCA N:Density-BasedClust ering177

19.1WhyDBSCAN? ...... ........... ... ... ... ... .178

19.2Algorithm. .............. ... ... ... .. ... ... .180

19.3Advanta ges....................... ... ... ... 181

19.4Parame terestimation.............. .............182

19.5ComputingDB SCAN......... ............ ... ... 182

19.6Methodf ordeterminingtheoptimale psvalue..... ........184

19.7Clusterpr edictionswithDBSCANalg orithm.............. 185

20Refe rencesandFurtherReading186


0.3Keyfea turesofthis book

Althoughthereares everalgoodb ooksonunsupe rvisedmachinelearning/clustering andrelat edtopics,wefeltthatman yofthemareeithert oohigh-lev el,theoretic al ortooa dvanced.Ourgo alwastowriteapractic alguideto clusteranalysis, elegant visualizationandinterpretation.

Themainpar tsofthe bookinclude:

•distancemeasures , •partitioningclustering, •hierarchicalclustering, •clustervalidationmetho ds,as wellas, •advancedclusteringmethodssuchasfuzzy clustering ,density-basedc lustering andmodel-ba sedclustering. Thebook presentsthebasic principlesofthesetasksandprov idemanyex amplesin

R.Thisbo oko

erssolidguida nceindataminingfo rstudentsandre sea rchers.

Keyfeature s:

•Coversclusteringalgorithm andimplementation •Keymathemat icalconceptsarepresented •Short,self-cont ainedchapterswithpracticalexamples.Thismeanstha t,you don'tneedt oreadthedi erentchaptersinseq uence. Attheend ofeac hchapter ,wepres entRlabsectio nsinwhichwesystematically workthroughapplica tionsofthevariousmet hodsdiscussedinthatchapter.


0.4Howthis bookisorg anized?

Thisbook contains5parts. PartI(Chapter1-3)pro vides aquickintroductionto R(c hapter1)andpresentsre quiredRp ackag esanddataformat(Chapter2)f or clusteringanalysisandvisualiza tion. Theclass ificationofobjects,intoclusters, requires somemethodsformeasuringthe distanceorthe(dis)similar itybet weenthe objects.Chapter3coversthec ommon distancemeasuresused forassessingsimilaritybe tweenobser vations. PartIIstarts withpart itioningclusteringmethods,which include: •K-meansclustering(Chapt er4), •K-MedoidsorPAM(partitioning aroundmedo ids)algor ithm(Chapter5)and •CLARAalgorithms( Chapter6). Partitioningclusteringapproachess ubdividethedatasetsintoas etofkgroups,where kist henum berof groupspre-specifiedby theanaly st.




























