K-means clustering (Chapter 4), • K-Medoids or PAM (partitioning around medoids) algorithm (Chapter 5) and • CLARA algorithms (Chapter 6) Partitioning
View & Download This PDF
K-means clustering (Chapter 4), • K-Medoids or PAM (partitioning around medoids) algorithm (Chapter 5) and • CLARA algorithms (Chapter 6) Partitioning
IV Clustering 141 8 HCPC: Hierarchical Clustering on Principal Components Previously, we published a book entitled “Practical Guide To Cluster Analysis in
Datanovia Machine Learning Articles Sthda Practical Guide To Cluster Analysis In R Book R Bloggers Customer Reviews Practical Guide To Cluster
The user can choose from nine clustering algorithms in existing R pack- ages, including hierarchical, K-means, self-organizing maps (SOM), 1 Page 2 and model
[PDF] hierarchical clustering python scikit learn
[PDF] hierarchical clustering python scipy example
[PDF] hierarchical inheritance in java
[PDF] hierarchical network
[PDF] hierarchical network design pdf
[PDF] hierarchical regression table apa
[PDF] hierarchical structure journal article
[PDF] hierarchy java example
[PDF] hierarchy of law reports
[PDF] hifly a321
[PDF] hifly a380 interior
[PDF] hifly a380 model
[PDF] high appellate court definition
[PDF] high court
[PDF] high efficiency boiler
1 © A. Kassambara 2015 Vultivariate Nnalysis R Alboukadel Kassambara A-Practical Guide To Cluster Analysis in R
Edition 1 sthda.com Unsupervised Machine Learning 2 Copyright©2017by AlboukadelKassambara. Allrightsreserved. PublishedbySTHDA (http://www.sthda.com),AlboukadelKassambara Nopartof thisp ublicationm ayb ereproduced,storedinaretrievalsystem ,ortransmittedinanyform orby anymeans, electronic,mechanical,photocopy ing,recording,scanning,orotherwise,withouttheprior writtenpermission ofthePublisher.Req ueststothe Publisherf orpermissionsh ould bead dressedtoSTHDA(http://www. sthda. com). LimitofLiabilit y/Dis claimerofWarranty:Whilethepublisherandauthorhaveusedtheirbe ste ortsin preparingthisb ook ,theymakenorepresentationsor warrantieswithrespecttotheaccuracyor completenessofthe contents ofthis bookandspecificallydisclaiman yimplied warranties of merchantabilityorfitnessforaparticularpurpose. Now arrant ymaybecreated orextended bysales representativesorwrittensalesmaterials. NeitherthePubli shernor theauthors,contributors,oreditors, assumeany liabilityforan yinjuryand/or damage topersons orpropertyas amatterof productsliability, negligenceorotherwise, orfrom anyuse oroperation ofany methods,products,instructions, orideascontainedin thematerialherein. Forgene ralinformationcontac tAlboukadelKassambara
. 0.1.PREFAC E3 0.1Preface Largeamountsofdat aarecollectedever ydayfr omsatelliteima ges,bio-medica l, security,marketing,websea rch,geo-spatialorotherauto maticequipment .Mining knowledgefromthesebigdat afarexceedshuma n'sabilities. Clusteringisone oftheimpo rtantda taminingmet hodsfordiscoveringknowledge inmult idimensionaldata.Thegoalofclust eringistoidentifypa tterno rgroupsof similarobject swithinadatasetofinterest . Inthelitt eratur e,itisreferredas"patternrecog nition" or" unsupervisedmachine learning"-"unsupervised" bec ausewearenotguidedbyaprioriideasof which variablesorsamplesbelonginw hichclust ers."Learning"b ecausethemachine algorithm"learns"howtoc luster. Clusteranalysisispopular inmanyfields,including: •Incancerresearchforclassify ingpatientsintosubgroupsaccor dingtheirgene expressionprofile.Thiscanbeuseful foridentifyingthe molecularpr ofileof patientswithgoodorbadpr ognostic,a swellasforunders tandingthedis ease. •Inmarketingformarketsegmentationbyidentif yingsubgroupsofcustomersw ith similarprofilesand whomightberece ptivetoa particular formofadve rtising. •InCity-planningforidentifying groupsofhousesaccordingt otheirtype,va lue andlocat ion. Thisbook providesapractica lguidetounsupervisedma chinele arningorcluster analysisusingRsoftwar e.Additio nally, wedeveloppedanRpackagenamedfactoextra tocrea te,easily,aggplot2-basede legantplotsofcluster analy sisresults.Factoextra o cialonlinedocumen tation:h ttp://www.sthda.com/english/rpkgs/fact oextra 4 0.2Aboutthe author AlboukadelKassambaraisaP hDinBioinformaticsandCancerBiology.Hew or kss ince manyyearso ngenomicdataanalysisa ndvisualiza tion.Hecreateda bioinformatics toolnamedGenomicSc ape(www.ge nomicscape.com)whichisaneasy-to -usewebtool forgeneexpr essiondata analysisandvisualization. Hedev elopedalsoawebsitecalledSTHD A(Statistica lT oolsforHigh-throughputDa ta Analysis,www.sthda.com/e nglish),whichcontainsmanytutorialsondataanalysis andvisualiz ationusingRsoftwareandpackage s. Heist heaut horoft heRpackagessurvminer(foranalyzinganddr awingsurvival curves),ggcorrplot(fordrawingcor relationmatrixusing ggplot2)andfactoextra (toeasilye xtractandvisualiz etheresultsofmultivariateana lysissuch PCA,CA, MCAandcluste ring).Yo ucanlearnmoreaboutthesepacka gesat:ht tp://www. sthda.com/english/wiki/r-packages Recently,hepublishedtwobooksonda tav isualization: 1.GuidetoCr eateBe autifulGraphicsinR(at:ht tps://goo.gl/vJ0OYb). 2.CompleteGuideto3DPlots inR(at :https:/ /goo.gl/ v5gwl0). Contents 0.1Pref ace................. ... ... .. ... ... ... .3 0.2Aboutt heauthor.... ..... ............ ... ... ..4 0.3Keyf eaturesoft hisbook............... ...... ....9 0.4Howthis bookis organize d?........... .. ..........10 0.5Book website....... .............. ... ... ... .16 0.6Exec utingtheRcodesfromthePD F..... .. ...........16 IBa sics17 1Int roductiontoR18 1.1Insta llRandRStudio........ .. ...... ......... .18 1.2Insta llingandloadingRpackages ...... ......... ... ..19 1.3Gett inghelpwithfunctionsinR. ...... ......... .....20 1.4Impor tingyourdataintoR..... ......... ......... 20 1.5Demoda tasets... ........ ............ ... ... .22 1.6Close yourR/RStudioses sion...... .............. ..22 2Da taPreparation andRPackages23 2.1Data preparation... ....................... ... 23 2.2Requir edRPackages...... ... ..................24 3Cl usteringDistanceMeasures25 3.1Metho dsformeasuringdistance s........ ............25 3.2Whatty peofdist ancemeasuressho uldwechoo se?..........27 3.3Data standardization.. .........................28 3.4Dista ncematrixcomputation.... ........... .......29 3.5Visualizing distancematrices. .............. ........32 3.6Summary. ......... ... .. ... ... ... ... ... ... 33 5 6CONTENTS IIPart itioningClustering34 4K-M eansClustering36 4.1K-mea nsbasicideas....... ......... ........ ... 36 4.2K-mea nsalgorithm....... .................. ... 37 4.3Computing k-meansclusteringin R............. ......38 4.4K-mea nsclusteringadvantag esanddisadvantages.... .......46 4.5Alterna tivetok-meansclustering.... ...... ..........47 4.6Summary. ........ ... ... ... ... ... .. ... ... .47 5K-M edoids48 5.1PAMco ncept.. ............... ... .. ... ... ... 49 5.2PAMalg orithm.. ................. ... ... ... ..49 5.3Computing PAMinR........ ..... ... ... ... ... .50 5.4Summary. ......... ... .. ... ... ... ... ... ... 56 6CL ARA-ClusteringLar ge Applications57 6.1CLARA concept...... .............. ... ... ... 57 6.2CLARA Algorithm....... .............. ... ... .58 6.3Computing CLARAinR...... ........... ... ... ..58 6.4Summary. ......... ... .. ... ... ... ... ... ... 63 IIIHierar chicalClustering64 7Ag glomerativeClustering67 7.1Algorit hm............. ... ... ... .. ... ... ... 67 7.2Stepst oagglomerat ivehiera rchicalclustering.............68 7.3Verif ytheclustertree.. ...... ............... ....73 7.4Cutthe dendrogra mintodi erentgroups..... ...........74 7.5Cluste rRpackage...... ... .................. .. 77 7.6Applicatio nofhierarchicalclust eringtog eneexpressiondataanalysis77 7.7Summary. ......... ... .. ... ... ... ... ... ... 78 8Co mparingDendrograms79 8.1Data preparation.... ....................... ..79 8.2Compar ingdendrograms...... ...................80 9Vi sualizingDendrograms84 9.1Visualizing dendrograms.... .................... .85 9.2Case ofdendrogramwit hlargeda tasets.............. ..90 CONTENTS7 9.3Manipulat ingdendrogramsusingdendext end..............94 9.4Summary. ......... ... .. ... ... ... ... ... ... 96 10Heat map:StaticandInteractive9 7 10.1RPacka ges /functionsfordrawingheatmaps..............97 10.2Datapre paration...... ....................... 98 10.3Rbasehe atma p:heatmap().. .............. .......98 10.4Enhancedhe atmaps:heatmap.2().. ......... ........101 10.5Pretty heatmaps:pheatmap()..... ......... ........102 10.6Inter activeheatmaps:d3heatmap()........... ........103 10.7Enhancinghea tmapsusingdendextend ............... ..103 10.8Complexhea tmap........ ............... ... ... 104 10.9Applicationt ogeneexpressionmat rix.... ......... .....114 10.10Summary.............. .. ... ... ... ... ... ..116 IVCl usterValidation117 11Asse ssingClusteringTendency119 11.1Require dRpackages......... ... ...............119 11.2Datapre paration...... ....................... 120 11.3Visualinsp ectionofthe data................. ...... 120 11.4Whyasse ssingclus teringtendency?.......... .........121 11.5Methodsf orassessingcluster ingtendency... ............123 11.6Summary... ......... .. ... ... ... ... ... ... .127 12Dete rminingtheOptimalNumberofCluster s128 12.1Elbowme thod......... ........... ... ... ... ..129 12.2Averag esilhouettemethod........ ................130 12.3Gapsta tisticmethod. .................... ......130 12.4Computingt henumberofcluste rsusingR. .............. 131 12.5Summary... ......... .. ... ... ... ... ... ... .137 13Clust erValidationStatist ics138 13.1Interna lmeasuresforclustervalidatio n................ .139 13.2Externa lmeasuresforclusteringvalidat ion............... 141 13.3Computingclus tervalidationst atisticsinR....... .......142 13.4Summary... ......... .. ... ... ... ... ... ... .150 14Choos ingtheBestClusteringA lgorit hms151 14.1Measure sforcomparingclusteringalgorit hms........ .....151 8CONTENTS 14.2Comparec lusteringalgorithmsinR. .................. 152 14.3Summary... ......... .. ... ... ... ... ... ... .155 15Comput ingP-valueforHierarc hicalClustering156 15.1Algorithm ............... ... ... .. ... ... ... .156 15.2Required packages.......... ..................157 15.3Datapre paration...... ....................... 157 15.4Computep- valueforhierarchic alclustering... ......... ...158 VAdv ancedClustering161 16Hier archicalK-MeansClustering163 16.1Algorithm. .............. ... ... ... ... ... ... 163 16.2Rcode.. ... ........ ... ... ... ... ... ... ... .164 16.3Summary... ......... .. ... ... ... ... ... ... .166 17FuzzyCl ustering 167 17.1Required Rpackages......... ... ...............167 17.2Computingfuz zyclustering... ............ ........168 17.3Summary... ......... .. ... ... ... ... ... ... .170 18Model -BasedClustering171 18.1Concept ofmodel-basedclustering ...... ..............171 18.2Estimating modelparameters...... ......... .......173 18.3Choosingt hebestmodel...... ...... ............ .173 18.4Computingmo del-basedclustering inR.................173 18.5Visualizingmode l-basedclustering ...................175 19DBSCA N:Density-BasedClust ering177 19.1WhyDBSCAN? ...... ........... ... ... ... ... .178 19.2Algorithm. .............. ... ... ... .. ... ... .180 19.3Advanta ges....................... ... ... ... 181 19.4Parame terestimation.............. .............182 19.5ComputingDB SCAN......... ............ ... ... 182 19.6Methodf ordeterminingtheoptimale psvalue..... ........184 19.7Clusterpr edictionswithDBSCANalg orithm.............. 185 20Refe rencesandFurtherReading186 0.3.KEYFEAT URESOFTHISB OOK9 0.3Keyfea turesofthis book Althoughthereares everalgoodb ooksonunsupe rvisedmachinelearning/clustering andrelat edtopics,wefeltthatman yofthemareeithert oohigh-lev el,theoretic al ortooa dvanced.Ourgo alwastowriteapractic alguideto clusteranalysis, elegant visualizationandinterpretation. Themainpar tsofthe bookinclude: •distancemeasures , •partitioningclustering, •hierarchicalclustering, •clustervalidationmetho ds,as wellas, •advancedclusteringmethodssuchasfuzzy clustering ,density-basedc lustering andmodel-ba sedclustering. Thebook presentsthebasic principlesofthesetasksandprov idemanyex amplesin R.Thisbo oko erssolidguida nceindataminingfo rstudentsandre sea rchers. Keyfeature s: •Coversclusteringalgorithm andimplementation •Keymathemat icalconceptsarepresented •Short,self-cont ainedchapterswithpracticalexamples.Thismeanstha t,you don'tneedt oreadthedi erentchaptersinseq uence. Attheend ofeac hchapter ,wepres entRlabsectio nsinwhichwesystematically workthroughapplica tionsofthevariousmet hodsdiscussedinthatchapter. 10CONTENTS 0.4Howthis bookisorg anized? Thisbook contains5parts. PartI(Chapter1-3)pro vides aquickintroductionto R(c hapter1)andpresentsre quiredRp ackag esanddataformat(Chapter2)f or clusteringanalysisandvisualiza tion. Theclass ificationofobjects,intoclusters, requires somemethodsformeasuringthe distanceorthe(dis)similar itybet weenthe objects.Chapter3coversthec ommon distancemeasuresused forassessingsimilaritybe tweenobser vations. PartIIstarts withpart itioningclusteringmethods,which include: •K-meansclustering(Chapt er4), •K-MedoidsorPAM(partitioning aroundmedo ids)algor ithm(Chapter5)and •CLARAalgorithms( Chapter6). Partitioningclusteringapproachess ubdividethedatasetsintoas etofkgroups,where kist henum berof groupspre-specifiedby theanaly st. 0.4.HOWTHISB OOKISO RGANIZED?11 Alabama Alaska Arizona Arkansas California ColoradoConnecticut Delaware Florida Georgia Hawaii Idaho Illinois Indiana Iowa Kansas KentuckyLouisiana Maine Maryland Massachusetts Michigan Minnesota Mississippi Missouri Montana Nebraska Nevada New Hampshire New Jersey New Mexico New York North Carolina North Dakota Ohio Oklahoma Oregon Pennsylvania Rhode Island South Carolina South Dakota Tennessee Texas Utah Vermont Virginia Washington West Virginia Wisconsin Wyoming -1 0 1 2 -202 Dim1 (62%) Dim2 (24.7%) clusteraaaa1234 Partitioning Clustering Plot InPart III,weconsidera gglomerativehier archical clusteringmethod,whichis an alternativeapproachtopartitionningc lusteringforidentifyinggroups inadata set. Itdoesno trequiretopr e-spe cifythenumberofclust erstob egenerated.Theresult ofhierar chicalclusteringisatree-basedre presentationoftheobj ect s,w hichisalso knownasdendrogram(seethefigurebe low). Inthispar t,wede scribehowtocomput e,visualiz e,interpretandcomparede ndro- grams: •Agglomerativeclustering(Chapter7) -Algorithmandsteps -Verifytheclustertr ee -Cutthedendr ogram intodi"erentgroups •Comparedendrograms(Cha pter8) -Visualcomparis onoftwodendrograms -Correlationmatrixbetweenalistofde ndrograms 12CONTENTS •Visualizedendrograms (Chapter9) -Caseofsmalldata sets -Caseofdendrogra mwithlar gedatasets:zoom,sub-tree ,PDF -Customizedendrogramsusingdendex tend •Heatmap:staticandint eractive(Chapter10 ) -Rbas eheatmaps -Prettyheatmaps -Interactiveheatmaps -Complexheatmap -Realapplication:ge neexpressiondata Inthisse ction,yo uwilllearnhowtogenerat eandinterpre tthefollo wingplots. •Standarddendrogramwit hfilledrectanglearoundcluste rs: Alabama Louisiana Georgia Tennessee North Carolina Mississippi South Carolina TexasIllinois New York Florida Arizona MichiganMaryland New Mexico Alaska Colorado California Nevada South Dakota West VirginiaNorth Dakota Vermont Idaho Montana Nebraska MinnesotaWisconsin Maine Iowa New Hampshire Virginia WyomingArkansasKentuckyDelaware Massachusetts New JerseyConnecticut Rhode Island Missouri Oregon Washington Oklahoma IndianaKansas Ohio Pennsylvania Hawaii Utah 0 5 10 Height Cluster Dendrogram 0.4.HOWTHISB OOKISO RGANIZED?13 •Comparetwodendrograms: 3.02.01.0 0.0 Maine Iowa Wisconsin Rhode Island Utah Mississippi Maryland Arizona Tennessee Virginia 0123456 Maryland Arizona Mississippi Tennessee Virginia Maine Iowa Wisconsin Rhode Island Utah •Heatmap: carbwthpcyldispqsecvsmpgdratamgear Hornet 4 Drive Valiant Merc 280 Merc 280C Toyota Corona Merc 240D Merc 230 Porsche 914 2 Lotus Europa Datsun 710 Volvo 142E Honda Civic Fiat X1 9 Fiat 128 Toyota Corolla Chrysler Imperial Cadillac Fleetwood quotesdbs_dbs17.pdfusesText_23
×
if you Get
No preview available Click on (Next PDF)
Next PDF