DeBaCl: A Python Package for Interactive DEnsity-BAsed CLustering PDF

Basics of hierarchical clustering

scipy.cluster.hierarchy.linkage(observations WITH SCIPY. Hierarchical clustering with complete method ... Dendrograms help in showing progressions as.

Hierarchical Clustering / Dendrograms

The agglomerative hierarchical clustering algorithms available in this In this example we can compare our interpretation with an actual plot of the data ...

Hierarchical clustering

1 Introduction. 2 Principles of hierarchical clustering. 3 Example. 4 Partitioning algorithm : K-means. 5 Extras. 6 Characterizing classes of individuals.

DeBaCl: A Python Package for Interactive DEnsity-BAsed CLustering

30 jul 2013 highly interpretable encoding of the clustering behavior of a dataset. By representing the hierarchy of data modes as a dendrogram of the ...

Overlapping Hierarchical Clustering (OHC)

29 may 2020 may bias the data analysis process if for example

Cluster Analysis: What It Is and How to Use It - Alyssa Wittle and

programming methods and languages including SAS

CluSim: a python package for calculating clustering similarity

21 mar 2019 (i.e. data points or network vertices) into clusters (the groups). Hierarchical Clusterings also contain a dendrogram or more generally an ...

fastcluster: Fast Hierarchical Agglomerative Clustering Routines for

1 may 2013 Figure 3: Python example single linkage dendrogram. 0. 1. 2. 3. 4. 5. 6. 7.

Hierarchical Clustering Combination and an Algorithm to Solve the

10 jun 2021 For example the idea of using supervised learning algorithms ... 2.3.3 Hierarchical Clustering

HIERARCHICAL CLUSTERING TO PREDICT THE RESPONSE OF

Adeosun Rukayat Bukola

[PDF] Chapter 19 Hierarchical clustering

A hierarchical clustering can be represented as a dendrogram (a tree) by joining together first the examples that are more similar and then gradually

Hierarchical Clustering Python - Analytics Vidhya

27 mai 2019 · Hierarchical clustering groups similar objects into a dendrogram It merges similar clusters iteratively starting with each data point as a

[PDF] Basics of hierarchical clustering - Amazon S3

CLUSTERING METHODS WITH SCIPY Creating a distance matrix using linkage scipy cluster hierarchy linkage(observations method='single' metric='euclidean'

(PDF) Hierarchical clustering Dijana Fejzi? - Academiaedu

This white paper discusses the following topics related to applying data mining to the power systems industry: ? provide a high level overview of data mining

[PDF] Overlapping Hierarchical Clustering (OHC) - Hal-Inria

29 mai 2020 · – We propose an agglomerative clustering method that produces a directed acyclic graph of clusters instead of a tree called a quasi-dendrogram

scipyclusterhierarchydendrogram — SciPy v1101 Manual

Plot the hierarchical clustering as a dendrogram The dendrogram illustrates how each cluster is composed by drawing a U-shaped link between a non-singleton

SciPy Hierarchical Clustering and Dendrogram Tutorial - Jörns Blog

26 août 2015 · This is a tutorial on how to use scipy's hierarchical clustering One of the benefits of hierarchical clustering is that you don't need to

[PDF] An Efficient Hierarchical Clustering Algorithm for Large Datasets

25 fév 2015 · This dataset provides an example of a typical medium-size clustering problem involved in bioinformatics and cheminformatics research Dataset D2

[PDF] Chapter 7 Hierarchical cluster analysis

The method of hierarchical cluster analysis is best explained by describing the algorithm or set of instructions which creates the dendrogram results In

JSSJournal of Statistical Software

MMMMMM YYYY, Volume VV, Issue II.

h ttp://www.jstatsoft.org/DeBaCl: APythonPackage for Interactive

DEnsity-BAsed CLustering

Brian P. Kent

Carnegie Mellon UniversityAlessandro Rinaldo

Carnegie Mellon UniversityTimothy Verstynen

Carnegie Mellon UniversityAbstract

The level set tree approach of Hartigan (

1975
) provides a probabilistically based and highly interpretable encoding of the clustering behavior of a dataset. By representing the hierarchy of data modes as a dendrogram of the level sets of a density estimator, this approach oers many advantages for exploratory analysis and clustering, especially for complex and high-dimensional data. SeveralRpackages exist for level set tree estimation, but their practical usefulness is limited by computational ineciency, absence of inter- active graphical capabilities and, from a theoretical perspective, reliance on asymptotic approximations. To make it easier for practitioners to capture the advantages of level set trees, we have written thePythonpackageDeBaClfor DEnsity-BAsed CLustering. In this article we illustrate howDeBaCl's level set tree estimates can be used for dicult clustering tasks and interactive graphical data analysis. The package is intended to pro- mote the practical use of level set trees through improvements in computational eciency and a high degree of user customization. In addition, the exible algorithms implemented inDeBaClenjoy nite sample accuracy, as demonstrated in recent literature on density clustering. Finally, we show the level set tree framework can be easily extended to deal with functional data. Keywords: density-based clustering, level set tree,Python, interactive graphics, functional data analysis.1. Introduction Clustering is one of the most fundamental tasks in statistics and machine learning, and nu- merous algorithms are available to practitioners. Some of the most popular methods, such as K-means (

MacQueen

1967

Llo yd

1982
) and spectral clustering (

Shi and Malik

2000
), relyarXiv:1307.8136v1 [stat.ME] 30 Jul 2013

2DeBaCl: APythonPackage for Interactive DEnsity-BAsed CLustering

on the key operational assumption that there is one optimal partition of the data intoK well-separated groups, whereKis assumed to be knowna priori. While eective in some cases, this at or scale-free notion of clustering is inadequate when the data are very noisy or corrupted, or exhibit complex multimodal behavior and spatial heterogeneity, or simply when the value ofKis unknown. In these cases, hierarchical clustering aords a more realistic and exible framework in which the data are assumed to have multi-scale clustering features that can be captured by a hierarchy of nested subsets of the data. The expression of these subsets and their order of inclusions|typically depicted as a dendrogram|provide a great deal of information that goes beyond the original clustering task. In particular, it frees the practi- tioner from the requirement of knowing in advance the\right"number of clusters, provides a useful global summary of the entire dataset, and allows the practitioner to identify and focus on interesting sub-clusters at dierent levels of spatial resolution. There are, of course, myriad algorithms just for hierarchical clustering. However, in most cases their usage is advocated on the basis of heuristic arguments or computational ease, rather than well-founded theoretical guarantees. The high-density hierarchical clustering paradigm put forth by Hartigan ( 1975
) is an exception. It is based on the simple but powerful denition of clusters as the maximal connected components of the super-level sets of the probability density specifying the data-generating distribution. This formalization has numerous advan- tages: (1) it provides a probabilistic notion of clustering that conforms to the intuition that clusters are the regions with largest probability to volume ratio; (2) it establishes a direct link between the clustering task and the fundamental problem of nonparametric density estima- tion; (3) it allows for a clear denition of clustering performance and consistency (

Hartigan

1981
) that is amenable to rigorous theoretical analysis and (4) as we show below, the den- drogram it produces is highly interpretable, oers a compact yet informative representation of a distribution, and can be interactively queried to extract and visualize subsets of data at desired resolutions. Though the notion of high-density clustering has been studied for quite some time (

Polonik

1995
), recent theoretical advances have further demonstrated the exibility and power of density clustering. See, for example,

Ri naldo,Si ngh,Nuge nt,and

Wasserman

2012

R inaldoan dW asserman

2010

K potufean dLu xburg

2011

C haudhuri

and Dasgupta 2010

S teinwart

2011

S riperumbuduran dSt einwart

2012

Lei ,Robi ns,

and Wasserman 2013
Bal akrishnan,Nar ayanan,Rin aldo,S ingh,and W asserman 2013
and the refences therein. This paper introduces thePythonpackageDeBaClfor ecient and statistically-principled DEnsity-BAsed CLustering.DeBaClis not the rst implementation of level set tree estimation and clustering; theRpackagesdenpro(Klemela2004 ),gslclust(Stuetzlean dNu gent2010 ), andpdfCluster(Azzalinian dM enardi2012 ) also contain various level set tree estimators. However, they tend to be too inecient for most practical uses and rely on methods lacking rigorous theoretical justication. The popular nonparametric density-based clustering algo- rithm DBSCAN (

Ester,Kr iegel,an dXu

1996
) is implemented in theRpackagefpc(Hennig 2013
) and thePythonlibraryscikit-learn(Pedregosa,V aroquaux,G ramfort,M ichel,Th irion, Grisel, Blondel, Prettenhofer, Weiss, Dubourg, Vanderplas, Passos, Cournapeau, Brucher,

Perrot, and Duchesnay

2011
), but this method does not provide an estimate of the level set tree. DeBaClhandles much larger datasets than existing software, improves computational speed,

Journal of Statistical Software3

and extends the utility of level set trees in three important ways: (1) it provides several novel visualization tools to improve the readability and interpetability of density cluster trees; (2) it oers a high degree of user customization; and (3) it implements several recent methodological advances. In particular, it enables construction of level set trees for arbitrary functions over a dataset, building on the idea that level set trees can be used even with data that lack a bona de probability density fuction.DeBaClalso includes the rst practical implementation of the recent, theoretically well-supported algorithm from

Ch audhuriand D asgupta

2010

2. Level set trees

Suppose we have a collection of pointsXn=fx1;:::;xnginRd, which we model as i.i.d. draws from an unknown probability distribution with probability density functionf(with respect to Lebesgue measure). Our goal is to identify and extract clusters ofXnwithout any a prioriknowledge aboutfor the number of clusters. Following the statistically-principled approach of

Hart igan

1975
), clusters can be identied as modes off. For any threshold value

0, the-upper level setoffis

L (f) =fx2Rd:f(x)g:(1) The connected components ofL(f) are called the-clusters offandhigh-density clusters are-clusters for any value of. It is easy to see that-clusters associated with larger values ofare regions where the ratio of probability content to volume is higher. Also note that for a xed value of, the corresponding set of clusters will typically not give a partition of fx:f(x)0g. The level set tree is simply the set of all high-density clusters. This collection is a tree because it has the following property: for any two high-density clustersAandB, eitherAis a subset ofB,Bis a subset ofA, or they are disjoint. This property allows us to visualize the level set tree with a dendrogram that shows all high-density clusters simultaneously and can be queried quickly and directly to obtain specic cluster assignments. Branching points of the dendrogram correspond to density levels where two or more modes of the pdf, i.e. new clusters, emerge. Each vertical line segment in the dendrogram represents the high-density clusters within a single pdf mode; these clusters are all subsets of the cluster at the level where the mode emerges. Line segments that do not branch are considered high-density modes, which we call the leaves of the tree. For simplicity, we tend to refer to the dendrogram as the level set tree itself. Becausefis unknown, the level set tree must be estimated from the data. Ideally we would use the high-density clusters of a suitable density estimatebfto do this; for a well-behavedf and a large sample size,bfis close tofwith high probability so the level set tree forbfwould be a good estimate for the level set tree off(Chaudhuriand D asgupta2010 ). Unfortunately, this approach is not computationally feasible even for low-dimensional data because nding the upper level sets ofbfrequires evaluating the function on a dense mesh and identifying -clusters requires a combinatorial search over all possible paths connecting any two points in the mesh. Many methods have been proposed to overcome these computational obstacles. The rst

4DeBaCl: APythonPackage for Interactive DEnsity-BAsed CLustering

category includes techniques that remain faithful to the idea that clusters are regions of the sample space. Members of this family include histogram-based partitions (

Klemel

a2004 ), binary tree partitions (

Klemel

a2005 ) (implemented in theRpackagedenpro) and Delaunay triangulation partitions (

Azzalinian dT orelli

2007
) (implemented inRpackagepdfCluster). These techniques tend to work well for low-dimension data, but suer from the curse of dimensionality because partitioning the sample space requires an exponentially increasing number of cells or algorithmic complexity (

Azzalinian dT orelli

2007
In contrast, another family of estimators produces high-density clusters of data points rather than sample space regions; this is the approach taken by our package. Conceptually, these methods estimate the level set tree offby intersecting the level sets offwith the sample pointsXnand then evaluating the connectivity of each set by graph theoretic means. This typically consists of three high-level steps: estimation of the probability densitybf(x) from the data; construction of a graphGthat describes the similarity between each pair of data points; and a search for connected components in a series of subgraphs ofGinduced by removing nodes and/or edges of insucient weight, relative to various density levels. The variations within the latter category are found in the denition ofG, the set of density levels over which to iterate, and the way in whichGis restricted to a subgraph for a given density level.Edge iterationmethods assign a weight to the edges ofGbased on the proximity of the incident vertices in feature space (

Chaudhurian dD asgupta

2010
) or the value ofbf(x) at the incident vertices (Wongan dLan e1983 ) or on a line segment connecting them (

Stuetzlean dNu gent

2010
). For these procedures, the relevant density levels are the edge weights ofG. Frequently, iteration over these levels is done by initializingGwith an empty edge set and adding successively more heavily weighted edges, in the manner of traditional single linkage clustering. In this family, the Chaudhuri and Dasgupta algorithm (which is a generalization of Wishart ( 1969
)) is particularly interesting because the authors prove nite sample rates for convergence to the true level set tree (

Chaudhurian dD asgupta

2010
). To the best of our knowledge, however, only

S tuetzleand Nuge nt

2010
) has a publicly available implementation, in theRpackagegslclust. Point iterationmethods constructGso the vertex for observationxiis weighted according tobf(xi), but the edges are unweighted. In the simplest form, there is an edge between the vertices for observationsxiandxjif the distance betweenxiandxjis smaller than some threshold value, or ifxiandxjare among each other'sk-closest neighbors (Kpotufean d

Luxburg

2011

M aier,Hei n,an dv onLux burg

2009
). A more complicated version places an edge (xi;xj) inGif the amount of probability mass that would be needed to ll the valleys along a line segment betweenxiandxjis smaller than a user-specied threshold (Menardi and Azzalini 2013
). The latter method is available in theRpackagepdfCluster.

3. Implementation

The default level set tree algorithm inDeBaClis described in Algorithm1 , based on the method proposed by

K potufean dLux burg

2011
) and

M aieret al.(2009). For a sample with

Journal of Statistical Software5

nobservations inRd, the k-nearest neighbor (kNN) density estimate is: b f(xj) =knvdrdk(xj)(2) wherevdis the volume of the Euclidean unit ball inRdandrk(xj) is the Euclidean distance from pointxjto itsk'th closest neighbor. The process of computing subgraphs and nding connected components of those subgraphs is implemented with theigraphpackage (Csardi and

Nepusz

2006
). Our package also depends on theNumPyandSciPypackages for basic compu- tation (

Jones,Ol iphant,an dP eterson

2001
) and theMatplotlibpackage for plotting (Hunter 2007

). We use this algorithm because it is straightforward and fast; although it does requireAlgorithm 1:BaselineDeBaCllevel set tree estimation procedureInput:fx1;:::;xng,k,

Output:bT, a hierarchy of subsets offx1;:::;xng

G k-nearest neighbor similarity graph onfx1;:::;xng; b f() k-nearest neighbor density estimate based onfx1;:::;xng; forj 1tondo j bf(xj); L j fxi:bf(xi)jg; G j subgraph ofGinduced byLj;

Find the connected components ofGj;

b T dendrogram of connected components of graphsG1;:::;Gn, ordered by inclusions; b

T remove components of size smaller than

return bTcomputation of all n

2pairwise distances, the procedure can be substantially shortened by

estimating connected components on a sparse grid of density levels. The implementation of this algorithm is novel in its own right (to the best of our knowledge), andDeBaClincludes several other new visualization and methodological tools.

3.1. Visualization tools

Our level set tree plots increase the amount of information contained in a tree visualization and greatly improve interpretability relative to existing software. Suppose a sample of 2,000 observations inR2from a mixture of three Gaussian distributions (Figure1a ). The traditional level set tree is illustrated in Figure 1b and t heDeBaClversion in Figure1c .A pl otb asedon ly on the mathematical denition of a level set tree conveys the structure of the mode hierarchy and indicates the density levels where each tree node begins and ends, but does not indicate how many points are in each branch or visually associate the branches with a particular subset of data. In the proposed software package, level set trees are plotted to emphasize the empirical mass in each branch (i.e. the fraction of data in the associated cluster): tree branches are sorted from left-to-right by decreasing empirical mass, branch widths are proportional to

6DeBaCl: APythonPackage for Interactive DEnsity-BAsed CLustering

empirical mass, and the white space around the branches is proportional to empirical mass. For matching tree nodes to the data, branches can be colored to correspond to high-density data clusters (Figures 1c an d 1d ). Clicking on a tree branch produces a banner that indicates the start and end levels of the associated high-density cluster as well as its empirical mass (Figure 5a The level set tree plot is an excellent tool for interactive exploratory data analysis because it acts as a handle for identifying and plotting spatially coherent, high-density subsets of data. The full power of this feature can be seen clearly with the more complex data of Section 4 .(a)(b) (c)(d) Figure 1: Level set tree plots and cluster labeling for a simple simulation. A level set tree is constructed from a sample of 2,000 observations drawn from a mix of three Gaussians inR2. a ) The kNN density estimator evaluated on the data. b )A p lotof t het reeb asedon lyon th e mathematical denition of level set trees. c ) The new level set tree plot, fromDeBaCl. Tree branches emphasize empirical mass through ordering, spacing, and line width, and they are colored to match the cluster labels in d .A s econdv erticalax isi sad dedt hati ndicatest hat fraction of background mass at each critical density level. d )Cl usterl abelsfr omt heall-mode labeling technique, where each leaf of the level set tree is designated as a cluster.

Journal of Statistical Software7

3.2. Alternate scales

By construction, the nodes of a level set tree are indexed by density levels, which determine the scale of the vertical axis in a plot of the tree. While this does encode the parent-child relationships in the tree, interpretability of thescale is limited by the fact that it depends on the height of the density estimatebf. It is not clear, for example, whether= 1 would be a low- or a high-density threshold; this depends on the particular distribution. To remove the scale dependence we can instead index level set tree nodes based on the probability content of upper level sets. Specically, letbe a number between 0 and 1 and dene = sup( :Z x2L(f)f(x)dx) (3) to be the value offor which the upper level set offhas probability content no smaller than(Rinaldoet al.2012). The map7!gives a monotonically decreasing one-to-one correspondence between values ofin [0;1] and values ofin [0;maxxf(x)]. In particular,

1= 0 and0= maxxf(x). For an empirical level set tree, setto the-quantile of

fbf(xi)gni=1. Expressing the height of the tree in terms ofinstead ofdoes not change the topology (i.e. number and ordering of the branches) of the tree; the re-indexed tree is a deformation of the original tree in which some of its nodes are stretched out and others are compressed. -indexing is more interpretable and useful for several reasons. Thelevel of the tree indexes clusters corresponding to the 1fraction of \most clusterable" data points; in particular, largervalues yield more compact and well-separated clusters, while smaller values can be used for de-noising and outlier removal. Becauseis always between 0 and 1, scaling by probability content also enables comparisons of level set trees arising from data sets drawn from dierent pdfs, possibly in spaces of dierent dimensions. Finally, the-index is more eective than-indexing in representing regions of large probability content but low density and is less aected by small uctuations in density estimates. A common (incorrect) intuition when looking at an-indexed level set tree plot is to interpret the height of the branches as the size of the corresponding cluster, as measured by its empirical mass. However, with-indexing the height of any branch depends on its empirical mass as well as the empirical mass of all other branches that coexist with it. In order to obtain trees that do conform to this intuition, we introduce the-indexed level set tree.

Recall from Section

2 t hatc lustersar ede nedas max imalcon nectedc omponentsof th e setsL(f) (see equation1 ) asvaries from 0 to maxxf(x), and that the level set tree is the dendrogram representing the hierarchy of all clusters. Assume the tree is binary and with tooted. Letf1;2;:::;Kgbe an enumeration of the nodes of the level set tree and let C=fC0;:::;CKgbe the corresponding clusters. We can always choose the enumeration in a way that is consistent with the hierarchy of inclusions of the elements ofC; that is,C0is the support off(which we assume for simplicity to be a connected set) and ifCiCj, then i > j. For a nodei >0, we denote with parentithe unique nodejsuch thatCjis the smallest element ofCsuch thatCjCi. Similarly, kidiis the pair of nodes (j;j0) such thatCjand

8DeBaCl: APythonPackage for Interactive DEnsity-BAsed CLustering

C j0are the maximal subsets ofCi. Finally, fori >0, sibiis the nodejsuch there exists ak for which kid k= (i;j). For a clusterCi2 C, we set M i=Z C if(x)dx;(4) which we refer to as themassofCi.

The true-tree can be dened recursively by associating with each nodeitwo numbers0iand00isuch that0i00iis thesalient massof nodei. For leaf nodes, the salient mass is the

mass of the cluster, and for non-leaves it is the mass of the cluster boundary region.0and

00are dened dierently for each node type.

In ternaln odes,i ncludingt her ootno de.

00=M0= 1;

0i=00parenti

00i=X j2kidiM j+X k2sibiM k 2.

Leaf no des.

0i=00parenti

00i=0iMi

To estimate the-tree, we usebfinstead offand letmibe the fraction of data contained in the cluster for the tree nodeiat birth. Again, dene the estimated tree recursively: b00= 1; b0i=b00parenti; b00i=b0imi+X j2kidim j: In practice we subtract the above quantities from 1 to get an increasing scale that matches theandscales. Note that switching between thetoindex does not change the overall shape of the tree, but switching to theindexdoes. In particular, the tallest leaf of thetree corresponds to the cluster with largest empirical mass. In both theandtrees, on the other hand, leaves correspond to clusters composed of points with high density values. The dierence can be substantial. Figure 3 i llustratest hedi erencesb etweent het hreet ypesof i ndexing for the \crater" example in Figure 2quotesdbs_dbs20.pdfusesText_26

[PDF] DeBaCl: A Python Package for Interactive DEnsity-BAsed CLustering

JSSJournal of Statistical Software

MMMMMM YYYY, Volume VV, Issue II.

DEnsity-BAsed CLustering

Brian P. Kent

Carnegie Mellon UniversityAlessandro Rinaldo

Carnegie Mellon UniversityTimothy Verstynen

Carnegie Mellon UniversityAbstract

The level set tree approach of Hartigan (

MacQueen

Llo yd

Shi and Malik

2DeBaCl: APythonPackage for Interactive DEnsity-BAsed CLustering

Hartigan

Polonik

Ri naldo,Si ngh,Nuge nt,and

Wasserman

R inaldoan dW asserman

K potufean dLu xburg

C haudhuri

S teinwart

S riperumbuduran dSt einwart

Lei ,Robi ns,

Ester,Kr iegel,an dXu

Perrot, and Duchesnay

Journal of Statistical Software3

Ch audhuriand D asgupta

2. Level set trees

Hart igan

0, the-upper level setoffis

4DeBaCl: APythonPackage for Interactive DEnsity-BAsed CLustering

Klemel

Klemel

Azzalinian dT orelli

Azzalinian dT orelli

Chaudhurian dD asgupta

Stuetzlean dNu gent

Chaudhurian dD asgupta

S tuetzleand Nuge nt

Luxburg

M aier,Hei n,an dv onLux burg

3. Implementation

K potufean dLux burg

M aieret al.(2009). For a sample with

Journal of Statistical Software5

Nepusz

Jones,Ol iphant,an dP eterson

Output:bT, a hierarchy of subsets offx1;:::;xng

Find the connected components ofGj;

T remove components of size smaller than

2pairwise distances, the procedure can be substantially shortened by

3.1. Visualization tools

6DeBaCl: APythonPackage for Interactive DEnsity-BAsed CLustering

Journal of Statistical Software7

3.2. Alternate scales

1= 0 and0= maxxf(x). For an empirical level set tree, setto the-quantile of

Recall from Section

8DeBaCl: APythonPackage for Interactive DEnsity-BAsed CLustering

00are dened dierently for each node type.

In ternaln odes,i ncludingt her ootno de.

00=M0= 1;

0i=00parenti

Leaf no des.

0i=00parenti

00i=0iMi