[PDF] [PDF] tclust: An R Package for a Trimming Approach to Cluster Analysis

For example, robust clustering techniques can be used to handle “clusters” of highly concentrated outliers which are especially dangerous in (non-robust) 



Previous PDF Next PDF





[PDF] Practical Guide To Cluster Analysis in R - Datanovia

variables or samples belong in which clusters “Learning” because the machine algorithm “learns” how to cluster Cluster analysis is popular in many fields, 



[PDF] Cluster Analysis of Medical Research Data using R - CORE

This paper discuss some very basic algorithms like K-means, Fuzzy C-means, Hierarchical clustering to come up with clusters, and use R data mining tool The  



[PDF] Practical Guide To Cluster Analysis in R - GitHub Pages

R lab sections with many examples for cluster analysis and visualization The following R packages will be used to compute and visualize partitioning clustering:



[PDF] Package cluster

15 fév 2021 · SuggestsNote MASS: two examples using cov rob() and mvrnorm(); Matrix tools for Cluster analysis divides a dataset into groups (clusters) of 



[PDF] tclust: An R Package for a Trimming Approach to Cluster Analysis

For example, robust clustering techniques can be used to handle “clusters” of highly concentrated outliers which are especially dangerous in (non-robust) 



[PDF] HAC and K-MEANS with R - Université Lyon 2

This tutorial describes a cluster analysis process We deal Agglomerative Clustering algorithm (hclust) ; and the K-Means algorithm (kmeans) The data file  



[PDF] An R Package for Nonparametric Clustering Based on Local Shrinking

26 fév 2010 · Cluster analysis, an organization of a collection of patterns into For example, fpc chooses a figure which corresponds to the optimal average

[PDF] cluster analysis pdf download

[PDF] cm 1 to m 1

[PDF] cm 1 to s 1

[PDF] cmd command for system information

[PDF] cms moodle whittier

[PDF] cnes/spot image digitalglobe

[PDF] cngof recommandations

[PDF] coach outlet promo code

[PDF] coaster jessica platform bed assembly instructions

[PDF] cocktail history trivia

[PDF] codage de l'information exercices corrigés

[PDF] codage informatique définition

[PDF] code postal 78 france

[PDF] code postal france 93290

[PDF] code postal france 94000

tclust: AnRPackage for a Trimming Approach to Cluster

Analysis

Heinrich Fritz

Department of Statistics

and Probability Theory Vienna University of TechnologyLuis A. Garca-Escudero

Departamento de Estadstica

e Investigacion Operativa

Universidad de ValladolidAgustn Mayo-Iscar

Departamento de Estadstica

e Investigacion Operativa

Universidad de ValladolidAbstract

This introduction to theRpackagetclustis a (slightly) modied version ofF ritzet al. 2012
), published in theJournal of Statistical Software.

Outlying data can heavily in

uence standard clustering methods. At the same time, clustering principles can be useful when robustifying statistical procedures. These two reasons motivate the development of feasible robust model-based clustering approaches. With this in mind, an R package for performing non-hierarchical robust clustering, called tclust, is presented here. Instead of trying to \t" noisy data, a proportionof the most outlying observations is trimmed. Thetclustpackage eciently handles dierent cluster scatter constraints. Graphical exploratory tools are also provided to help the user make sensible choices for the trimming proportion as well as the number of clusters to search for.

Keywords: Model-based clustering, trimming, heterogeneous clusters.1. Introduction to robust clustering and tclust

Methods for cluster analysis attempt to detect homogeneous clusters with large heterogeneity among them. As happens with other (non-robust) statistical procedures, clustering methods may be heavily in uenced by even a small fraction of outlying data. For instance, two or more clusters might be joined articially, due to outlying observations, or\spurious"non-informative clusters may be composed of only a few outlying observations (see, e.g.,

G arca-Escuderoand

Gordaliza

1999
Gar ca-Escuderoet al.2010). Therefore, the application of robust methods in this context is very advisable, especially in fully automatic clustering (unsupervised learn- ing) problems. Certain relations between cluster analysis and robust methods (

Rocke and

Woodru

2002

Har dinand R ocke

2004

G arca-Escuderoet al.2003;W oodruan dRe iners

2004
) also motivate interest in robust clustering techniques. For example, robust clustering techniques can be used to handle\clusters"of highly concentrated outliers which are especially dangerous in (non-robust) estimation. G arca-Escuderoet al.(2010) provides a recent survey of robust clustering methods. Thetclustpackage for theRenvironment for statistical computing (RDevelopment Core Team 2010
) implements dierent robust non-hierarchical clustering algorithms where trimming plays a key role. This package is available athttp://CRAN.R-project.org/package=tclust.

2tclust: AnRPackage for a Trimming Approach to Cluster Analysis

When trimming allows the removal of a fractionof the \most outlying" data, the strong in uence of outlying observations can be avoided. This trimming approach to clustering has been introduced in Cue sta-Albertoset al.(1997),G allegos( 2002),G allegosan dRi tter( 2005) and G arca-Escuderoet al.(2008). Trimming also serves to identify potentially interesting anomalous observations. Trimming is not a new concept in statistics. For instance, the trimmed mean for one- dimensional data removes a proportion=2 each of the largest and smallest observations before computing the mean. However, it is not straightforward to extend this philosophy to cluster analysis, because most of these problems are of multivariate nature. Moreover, it is often the case that \bridge points" lying between clusters ought to be trimmed. Instead of forcing the statistician to dene the regions to be trimmed in advance, the procedures imple- mented intclusttake the whole data structure into account in order to decide which parts of the sample should be discarded. By considering this type of trimming, these procedures are even able to trim outlying bridge points. The \self-trimming" philosophy behind these pro- cedures is exactly the same as adopted by some well-known high breakdown-point methods (see, e.g.,

Rou sseeuwan dLe roy

1987
As a rst example of this trimming approach, let us consider the trimmedk-means method introduced in Cu esta-Albertoset al.(1997). The functiontkmeansfrom thetclustpackage implements this method. In the following example, this function is applied to a bivariate data set based on the Old Faithful geyser calledgeyser2that accompanies thetclustpackage. The code given below creates Figure 1

R > library ("tclust")

R > data ("geyser2")

R > clus <- tkmeans (geyser2, k = 3, alpha = 0.03)

R > plot (clus)

In the data setgeyser2, we are searching fork= 3 clusters and a proportion= 0:03 of the data is trimmed. The clustering results are shown in Figure 1 . Among this 3% of trimmed data, we can see 6 anomalous \short followed by short" eruptions lengths. Notice that an observation situated between the clusters is also trimmed. The package presented here adopts a \crisp" clustering approach, meaning that each obser- vation is either trimmed or fully assigned to a cluster. In comparison, mixture approaches estimate a cluster pertinence probability for each observation. Robust mixture alternatives have also been proposed where noisy data is tried to be tted through additional mixture components. For instance, packagemclust(Fraleyan dR aftery2012 ;Ban eldan dRaf tery 1993

F raleyan dRaf tery

1998
) and the Fortran programemmix(McLachlan1999 ;M cLach- lan and Peel 2000
) implement such robust mixture tting approaches. Mixture tting results can be easily converted into a \crisp" clustering result by converting the cluster pertinence probabilities into 0-1 probabilities. Contrary to these mixture tting approaches, the pro- cedures implemented in thetclustpackage simply remove outlying observations and do not intend to t them at all. Packagetlemix(seeNey tchevet al.2012;Ne ykovet al.2007) also implements a closely related trimming approach. As described in Section 3 , thetclustpack- age focuses on oering adequate cluster scatter matrix constraints leading to a wide range of clustering procedures depending on the chosen constraint, and avoiding the occurrence of spurious non-interesting clusters. Heinrich Fritz, Luis A. Garca-Escudero, Agustn Mayo-Iscar3Classification k=3, a=0.03

Eruption length

Previous eruption length

1.52.02.53.03.54.04.55.0

1.5 2.0 2.5 3.0 3.5 4.0 4.5

5.0Figure 1: Trimmedk-means results withk= 3 and= 0:03 for the bivariate Old Faithful

Geyser data. Trimmed observations are denoted by the symbol \" (a convention followed in all the gures in this work). The outline of the paper is as follows: In Section 2 w ebr ie yr eviewt heso- called\sp urious outliers" model and show how to derive two dierent clustering criteria from it. Dierent constraints on the cluster scatter matrices and their implementation in thetclustpackage are addressed in Section 3 . Section 4 pr esentst hen umericalou tputr eturnedb yt hisp ack- age. Section 5 p rovidessome b riefcom mentscon cerningth eal gorithmsi mplemented,an da comparison oftclustand several other robust clustering approaches are given in Section6 .

Section

7 sh owssom egrap hicalou tputst hath elpad viseth ec hoiceof t hen umberof cl usters and the trimming proportion. Other useful plots summarizing the robust clustering results are shown in Section 8 . Finally, Section 9 ap pliest hetclustpackage to a well-know real data set.

4tclust: AnRPackage for a Trimming Approach to Cluster Analysis

2. Trimming and the spurious outliers model

Gallegos

2002
) and

G allegosand Ri tter

2005
) propose the \spurious outliers model" as a probabilistic framework for robust crisp clustering. Letf(;;) denote the probability density function of thep-variate normal distribution with meanand covariance matrix . The \spurious-outlier model" is dened through \likelihoods" like kY j=1Y i2Rjf(xi;j;j)Y i2R0g i(xi) (1) withfR0;:::;Rkgbeing a partition of the set of indicesf1;2;:::;ngsuch that #R0=dne. R

0are the indices of the \non-regular" observations generated by other (not necessarily nor-

mal) probability density functionsgi. \Non-regular"observations can be clearly considered as \outliers" if we assume certain sensible assumptions for thegi(see details inG allegos2002 ;

Gallegos and Ritter

2005
). Under these assumptions, the search of a partitionfR0;:::;Rkg with #R0=dne, vectorsjand positive denite matrices jmaximizing (1) can be sim- plied to the same search (of a partition, vectors and positive denite matrices) by just maximizing kX j=1X i2Rjlogf(xi;j;j):(2) Notice that observationsxiwithi2R0are not taken into account in (2). Maximizing (2) withk= 1 yields the Minimum Covariance Determinant (MCD) estimator (Rousseeuw1985 ).

Unfortunately, the direct maximization of (

2 ) is not a well-dened problem whenk >1. It is easy to see that ( 2 ) is unbounded without any constraint on the cluster scatter matrices j. Thetclustfunction from thetclustpackage approximately maximizes (2) under dierent cluster scatter matrix constraints which will be shown in Section 3

The maximization of (

2 ) implicitly assumes equal cluster weights. In other words, we are ideally searching for clusters with equal sizes. The functiontclustprovides this option by setting the argumentequal.weights = TRUE. The use of this option does not guarantee that all resulting clusters exactly contain the same number of observations, but the method hence prefers this type of solutions. Alternatively, dierent cluster sizes or cluster weights can be considered by searching for a partitionfR0;:::;Rkg(with #R0=dne), vectorsj, positive denite matrices jand weightsj2[0;1] maximizing k X j=1X i2Rj(logj+ logf(xi;j;j)):(3) The (default) optionequal.weights = FALSEis used in this case. Again, the scatter matrices also have to be constrained such that the maximization of ( 3 ) becomes a well-dened problem.

Note that (

3 ) simplies to ( 2 ) when assumingequal.weights = TRUEand all weights are equally set toj= 1=k. Heinrich Fritz, Luis A. Garca-Escudero, Agustn Mayo-Iscar5equal.weights = TRUEequal.weights = FALSE restr = "eigen"k-means Cuesta-Albertoset al.(1997)Garca-Escuderoet al.(2008)restr = "deter"Gallegos( 2002)This work restr = "sigma"Friedman and Rubin(1967)

Gallegos and Ritter

2005
)This work Table 1: Clustering methods handled bytclust. Names in cursive letters are untrimmed (= 0) methods.

3. Constraints on the cluster scatter matrices

As already mentioned, the functiontclustimplements dierent algorithms aimed at approx- imately maximizing ( 2 ) and ( 3 ) under dierent types of constraints which can be applied on the scatter matrices j. The type of constraint is specied by the argumentrestrof the tclustfunction. Table1 gi vesan o verviewof t hedi erentc lusteringapp roachesi mplemented by thetclustfunction depending on the chosen type of constraint. Imposing constraints is compulsory because maximizing ( 2 ) or ( 3 ) without any restriction is not a well-dened problem. Notice that an almost degenerated scatter matrix jwould cause trimmed log-likelihoods ( 2 ) and ( 3 ) to tend to innity. This issue can cause a (ro- bust) clustering algorithm of this type to end up nding \spurious" clusters almost lying in lower-dimensional subspaces. Moreover, the resulting clustering solutions might heavily de- pend on the chosen constraint. The strength of the constraint is controlled by the argument restr.fact1 in thetclustfunction. The largerrestr.factis chosen, the looser is the restriction on the scatter matrices, allowing for more heterogeneity among the clusters. On the contrary, small values ofrestr.factclose to 1 imply very \equally scattered" clusters. This idea of constraining cluster scatters to avoid spurious solutions goes back to

Hat haway

1985
), who proposed it in mixture tting problems. Also arising from the spurious outlier model, other types of constraints have recently been introduced by

G allegosand Ri tter

2009
2010
). These (closely related) constraints also serve to avoid degeneracy of trimmed likelihoods but they are not implemented in the current version of thetclustpackage.

3.1. Constraints on the eigenvalues

Based on the eigenvalues of the cluster scatter matrices, a scatter similarity constraint may be dened. Withl(j) as the eigenvalues of the cluster scatter matrices jand M n= maxj=1;:::;kmaxl=1;:::;pl(j) andmn= minj=1;:::;kminl=1;:::;pl(j) (4) as the maximum and minimum eigenvalues, the restrictionrestr = "eigen"constrains the ratioMn=mnto be smaller or equal than a xed valuerestr.fact1. A theoretical study of the properties of this approach withequal.weights = FALSEcan be found inG arca-

Escuderoet al.(2008).

This type of constraint limits the relative size of the axes of the equidensity ellipsoids de- ned through the obtained jwhen assuming normality. This way we are simultaneouslyquotesdbs_dbs17.pdfusesText_23