fastcluster: Fast Hierarchical Agglomerative Clustering Routines for PDF

12 sept. 2011 Hierarchical agglomerative clustering is an important and ... to the statistical software R and the programming language Python (van.

fastcluster: Fast Hierarchical Agglomerative Clustering Routines for

1 mai 2013 It provides a fast. C++ implementation of each algorithm and currently offers interfaces to R and Python. We improve both the asymptotic worst- ...

Wards Hierarchical Agglomerative Clustering Method: Which

18 oct. 2014 The criterion is denoted by him as ?mot. 4.3 Implementation of Ward: Ward2. We now look at the Ward2 algorithm described in Kaufman and.

Scalable Hierarchical Agglomerative Clustering

14 août 2021 examples. We find the clusters to be precise and coherent. The algorithm discovers a clustering related to tennis strategies which contains ...

A Comparative Study of Divisive and Agglomerative Hierarchical

31 mars 2019 Hierarchical Clustering Algorithms. Maurice Roux. Aix-Marseille Université France. Abstract: A general scheme for divisive hierarchical ...

Package fastcluster

Fast hierarchical agglomerative clustering routines for R and Python. Description package but with much faster algorithms. Usage.

Higra: Hierarchical Graph Analysis

9 oct. 2019 Higra provides a Python API to ease its usage and to benefit ... classical agglomerative clustering algorithms hierarchies cannot.

Overlapping Hierarchical Clustering (OHC)

29 mai 2020 may bias the data analysis process if for example

PACk: An Efficient Partition-based Distributed Agglomerative

The Agglomerative Hierarchical Clustering (AHC) algorithm is For example the Microsoft Dynamics 365 service applies clus-.

An Online Hierarchical Algorithm for Extreme Clustering

6 avr. 2017 clustering algorithms and scales well with both N and K ... with existing clustering methods; for example under separability

Hierarchical Agglomerative Clustering Algorithm Example In Python

Hierarchical clustering algorithms group similar objects into groups called clusters Learn how to implement hierarchical clustering in Python

[PDF] Modern hierarchical agglomerative clustering algorithms - arXiv

12 sept 2011 · This paper presents algorithms for hierarchical agglomerative clustering which perform most efficiently in the general-purpose setup that

[PDF] Chapter 19 Hierarchical clustering

A hierarchical clustering can be represented as a dendrogram (a tree) by joining An example of a divisive hierarchical clustering algorithm is the

[PDF] Hierarchical clustering implementation - csPrinceton

Single-Link Hierarchical Clustering Iteration • Closest pair of clusters (i j) is one with the smallest dist value

[PDF] a framework for parallel hierarchical agglomerative clustering using

This paper studies the hierarchical clustering problem where the goal is to produce a dendrogram that represents clusters at varying scales of a data set We

Fast Hierarchical Agglomerative Clustering Routines for R and Python

18 jan 2023 · It provides a fast implementation of the most e_cient current algorithms when the input is a dissimilarity index Moreover it features memory-

Modern hierarchical agglomerative clustering algorithms

27 nov 2022 · Request PDF Modern hierarchical agglomerative clustering algorithms This paper presents algorithms for hierarchical agglomerative

Hierarchical Clustering Python - Analytics Vidhya

27 mai 2019 · Agglomerative Hierarchical Clustering; Divisive Hierarchical In the first example we have to predict the count of bikes based on

[PDF] Fast Hierarchical Agglomerative Clustering Routines for R and Python

1 mai 2013 · It provides a fast implementation of the most efficient current algorithms when the input is a dissimilarity index Moreover it features

Fast Hierarchical Agglomerative Clustering Routines for R and Python

29 mai 2013 · The fastcluster package is a C++ library for hierarchical agglomerative clustering that provides a fast implementation of the most

JSSJournal of Statistical Software

May 2013, Volume 53, Issue 9.

ht tp://www.jstatsoft.org/fastcluster: Fast Hierarchical, Agglomerative

Clustering Routines forRandPython

Daniel M

ullner

Stanford UniversityAbstract

Thefastclusterpackage is aC++library for hierarchical, agglomerative clustering. It provides a fast implementation of the most ecient, current algorithms when the input is a dissimilarity index. Moreover, it features memory-saving routines for hierarchical clustering of vector data. It improves both asymptotic time complexity (in most cases) and practical performance (in all cases) compared to the existing implementations in standard software: severalRpackages,MATLAB,Mathematica,PythonwithSciPy. Thefastclusterpackage presently has interfaces toRandPython. Part of the func- tionality is designed as a drop-in replacement for the methodshclustandflashClust inRandscipy.cluster.hierarchy.linkageinPython, so that existing programs can be eortlessly adapted for improved performance. Keywords: clustering, algorithm, hierarchical, agglomerative, linkage, single, complete, aver- age, UPGMA, weighted, WPGMA, McQuitty, Ward, centroid, UPGMC, median, WPGMC, MATLAB,Mathematica,Python,SciPy,C++.1. Introduction Hierarchical clustering is an important, well-established technique in unsupervised machine learning. The common hierarchical, agglomerative clustering methods share the same algo- rithmic denition but dier in the way in which inter-cluster distances are updated after each clustering step (

Anderberg

1973
, page 133). The seven common clustering schemes are called single, complete, average (UPGMA), weighted (WPGMA, McQuitty), Ward, centroid (UP-

GMC) and median (WPGMC) linkage (

Everitt,Land au,Le ese,and S tahl

2011
, Table 4.1) A variety of algorithms has been developed in the past decades to improve performance com- pared to the primitive algorithmic setup, in particular

And erberg

1973
, page 135),

R ohlf

1973

S ibson

1973

D ayan dEd elsbrunner

1984
, Table 5),

Mu rtagh

1984

Ep pstein

2fastcluster: Fast Hierarchical, Agglomerative Clustering inRandPython

2000

Car dinaland E ppstein

2004
). Also, hierarchical clustering methods have been im- plemented in standard scientic software such asR(RCore Team2011 ),MATLAB(The

MathWorks, Inc.

2011
),Mathematica(WolframRes earch,Inc .2010 ) and theSciPylibrary for the programming languagePython(Jones,O liphant,P etersonet al.2001;v anRos sum et al.2011). Specically, there are the following functions: ?hclustinR'sstatspackage (RCore Team2011 ), ?flashClustinR's ashClustpackage (Langfelder2011 ), ?agnesinR'sclusterpackage (Machler, Rousseeuw, Struyf, Hubert, and Hornik2011 ), ?linkageinMATLAB'sstatisticstoolbox (TheM athWorks,In c.2011 ), ?AgglomerateandDirectAgglomerateinMathematica(WolframRe search,In c.2010 ), ?linkagein thePythonmodulescipy.cluster.hierarchy(Eads2008 ). We found that all these implementations do not give satisfactory performance because| insofar as this can be checked only for open-source software|they use inferior algorithms. The fastclusterpackage, which is available from the ComprehensiveRArchive Network athttp:// CRAN.R-project.org/package=fastcluster, builds upon the author's work (Mullner2011 ), where we identied three algorithms, including a new one by the author, as the most ecient current algorithms for the seven clustering schemes above, when the input is given by pairwise dissimilarities between elements. Thefastclusterpackage also has memory-saving algorithms for some of the clustering schemes when the input is given as vector data. It provides a fast C++implementation of each algorithm and currently oers interfaces toRandPython. We improve both the asymptotic worst-case time complexity (in those cases where this can be determined due to open source) and the practical performance (in all cases) of the existing implementations listed above.

The paper is structured as follows: In Section

2 , we brie y introduce the clustering methods which our package provides in order to establish the technical context. The presentation is selective and focuses on the aspects which are important for this paper. For a general introduction to the well-known hierarchical clustering methods, we refer to textbooks, e.g.,

Anderberg

1973
) or E verittet al.(2011). Section3 con tainsi nformationab outt heal gorithms and the implementation. Section 4 c omparest hep erformanceof t hepac kagewi thex isting software, both in a theoretical, asymptotic complexity analysis in Section 4. 1 an db yus ecas e performance experiments in Sections 4. 2 and 4. 3 . In Section 5 , we explain how to use the fastclusterpackage and point to dierences between the interfaces. This part is independent of the two preceding sections. A detailed user's manual is available in the package distribution. The paper nishes with a short conclusion in Section 6

2. SAHN clustering methods

The clustering methods in this paper have been characterized by the acronym SAHN (se- quential, agglomerative, hierarchic, nonoverlapping methods) by

S neathand S okal

1973
, Sec- tions 5.4 and 5.5). The input to the methods is adissimilarity indexon a nite set (seeHans en

Journal of Statistical Software3

and Jaumard 1997
, Section 2.1). For a setS, this is by denition a mapd:SS![0;1) which is re exive and symmetric, i.e., we haved(x;x) = 0 andd(x;y) =d(y;x) for allx;y2S. A metric onSis certainly a dissimilarity index. The dissimilarity index can be given directly to a clustering algorithm as theN

2pairwise dissimilarities. This is thestored matrix approach

Anderberg

1973
, Section 6.2). Alternatively, the input points can be specied in a dierent manner, e.g., as points in a normed vector space, where the dissimilarity information is given implicitly by the metric on the ambient space. This input format is called thestored data approach(Anderberg1973 , Section 6.3). The rst option has input size (N2) forNelements, and the second option (ND) forN points in aD-dimensional vector space. We call an algorithmmemory-savingin this paper if it accepts vector data and the required memory is of classO(ND). The common procedural denition for all clustering methods in this paper is as follows (cf.

Anderberg

1973
, Section 6.1): 1. Let Sbe the current set of nodes, with implicit or explicit dissimilarity information. Determine a pair of mutually closest points (a;b). 2. J oinaandbinto a new noden. Deleteaandbfrom the set of nodes and addnto it. 3. O utputt hen odel abelsaandband their dissimilarityd(a;b). 4. Up dateth edi ssimilarityi nformationb ys pecifyingt hed istancefr omnto all other nodes. This can be done explicitly by specifying the distances, or by dening a cluster representative in thestored data approach. 5. Rep eats teps1{4 u ntilt herei sa si nglen odel eft,wh ichcon tainsall t hei nitialn odes. The clustering schemes dier in the update formula for cluster dissimilarities in step 4. Table 1 lists the formulas for the seven common clustering schemes. The output of the clustering procedure is a list ofN1 triples (ai;bi;i), which encodes astepwise dendrogram(seeM ullner2011 , Section 2.2, for the dierence to non-stepwise variants): Thei-th triple contains the information which nodes are joined into a new node in thei-th step, and what was the cluster dissimilarity betweenaiandbi. This is sucient information to draw the usual representation of the dendrogram as a rooted tree, where the leaves are the initial nodes, and a branching point at a given heightirepresents the joining of nodesai;biwith mutual distancei:=d(ai;bi). Note that the output of the clustering procedure above is not unique: if more than one pair of nodes realizes the current minimal distance in step 1, any of them might be chosen, and this in uences later steps. The algorithms in thefastclusterpackage are correct in the sense that they always return one of the possible outputs of the procedure above, and hence resolve ties in one of possibly several correct ways. For detailed information on why the handling of ties is a non-trivial matter and how it in uences the choice of algorithms in thefastcluster package, see M ullner( 2011, Sections 3 and 5). The procedural denition of the clustering scheme above already constitutes a primitive al- gorithm. This algorithm has a time complexity of (N3) forNinput points, since in thei-th iteration a pair of closest nodes is searched amongNi+ 1 nodes in step 1, which requires ((Ni)2) comparisons by an exhaustive search. Thefastclusterpackage reduces the time

4fastcluster: Fast Hierarchical, Agglomerative Clustering inRandPythonName Distance update formula

ford(I[J;K)Cluster dissimilarity between clustersAandBSingle min(d(I;K);d(J;K)) mina2A;b2Bd[a;b]

Complete max(d(I;K);d(J;K)) maxa2A;b2Bd[a;b]

Average

nId(I;K) +nJd(J;K)n

I+nJ1jAjjBjX

a2AX b2Bd[a;b]

Weighted/McQuitty

d(I;K) +d(J;K)2

Wards(nI+nK)d(I;K)2+ (nJ+nK)d(J;K)2nKd(I;J)2n

I+nJ+nKs2jAjjBjjAj+jBj k~cA~cBk2

Centroid

Id(I;K)2+nJd(J;K)2n

I+nJnInJd(I;J)2(nI+nJ)2k~cA~cBk2

Median

rd(I;K)22 +d(J;K)22 d(I;J)24 k~wA~wBk2Table 1: Agglomerative clustering schemes. LetI;Jbe two clusters joined into a new cluster, and letKbe any other cluster. Denote bynI,nJandnKthe sizes of (i.e., number of elements in) clustersI;J;K, respectively. The update formulas for the\Ward",\Centroid"and\Median"methods assume that the input points are given as vectors in Euclidean space with the Euclidean distance as dissimilarity measure. The expression~cXdenotes the centroid of a clusterX. The point~wXis dened iteratively and depends on the order of clustering steps: If the clusterLis formed by joining

IandJ, we dene~wLas the midpoint12

(~wI+~wJ).

References:

Lan ceand W illiams

1967

K aufmanan dRous seeuw

1990
, Section 5.5.1). complexity from cubic to quadratic: as a theoretically guaranteed worst-case complexity for ve of the seven methods, and as a heuristic behavior in all observed cases for the \centroid" and \median" distance update formulas.

The formulas in Table

1 s howth ati nter-clusterd istancescan b econ venientlyd enedan d computed in constant time from (weighted or unweighted) cluster centroids in Euclidean space for the \Ward", \centroid" and \median" methods. Hence, these clustering schemes are predestined for a memory-saving clustering algorithm where not all distance values are stored but distances are computed as they are needed. Moreover, there are algorithms for single linkage clustering (including the one which is used in thefastclusterpackage) which read in every pairwise dissimilarity value between initial nodes exactly once, and otherwise need onlyO(N) temporary memory. Hence, also single linkage clustering can be implemented in a memory-saving manner. Thefastclusterpackage contains memory-saving algorithms for these four methods. It assumes Euclidean distances between input vectors for the \Ward", \centroid" and \median" methods and provides a variety of metrics for the \single" method.

Journal of Statistical Software5

3. Algorithms and implementation

The ecient algorithms which are used in thefastclusterpackage were described in detail by the author in M ullner( 2011). In that paper, the author introduced an algorithm which works with any distance formula, proved the correctness of two existing algorithms by Rohlf and Murtagh and identied the most ecient algorithms. Thefastclusterpackage builds upon this knowledge and implements the most ecient algorithms. There are three algorithms in total:

For single linkage clustering, an algorithm by

Rohl f

1973
) is used. It links Prim's algorithm for the minimum spanning tree of a graph (see

Cor men,Le iserson,Riv est,an dS tein

2009
Section 23.2) with the observation that a single linkage dendrogram can be obtained from the minimum spanning tree of a graph. More precisely, this graph is the weighted, complete graph whose adjacency matrix is the dissimilarity index (

Gowerand Ross

1969
The \complete", \average", \weighted" and \Ward" (for dissimilarity input) methods use the nearest-neighbor chain algorithm by

M urtagh

1984
, page 86). This algorithm exploits the fact that pairs of mutually closest points can be merged in any order, if the distance update formula fullls certain criteria (see also M ullner2011 , Theorem 3). The \centroid", \median" and \Ward" (vector input) methods use the author's algorithm ( M ullner2011 , Section 3.1), which delays repeated searches for nearest neighbors as long as possible. The algorithms are fairly optimized for speed. For example, there are two variants of the author's algorithm, which dier by the indexing of intermediate results and how the list of nearest neighbors for all nodes is organized. These two variants perform dierent enough to make the distinction worthwhile. The nal choice which algorithm to use for which method was then made according to performance on test datasets. All algorithms in thefastclusterpackage are implemented in aC++library. The core code is the same for both interfaces. It is accompanied by language-specic wrappers, which handle the input and output in the interface-specic array data structures and also handle the dierent indexing conventions and extra output (like theordereld in theRinterface). The interface code in the interpreted languagesRandPythonis very short so that the overhead is low. TheC++code extensively uses template programming to avoid code duplication among the seven methods. Also the data types are exible through template programming. The current setup always uses double precision for oating-point numbers since this is the default inPythonand R. The indices to the data points are represented by at least 32 bit wide signed integers, and indices to the dissimilarity array of sizeN

2are represented by signed 64-bit

integers. This is in theory sucient to handle 2

291 data points, hence does not pose an

actual obstruction.

4. Performance

In this section, we compare the performance of thefastclusterpackage with the other imple- mentations in standard software. In Section 4. 1 , we analyze the asymptotic time complexity of the algorithms whose source code is available, both in the worst and in the best case. We then compare the performance of all implementations experimentally on a range of test datasets.

Section

4. 2 de alswi tht hec asewhe nt hei nputi sa d issimilarityi ndex,and Se ction 4.3 c overs the memory-saving routines for vector input.

6fastcluster: Fast Hierarchical, Agglomerative Clustering inRandPython

4.1. Asymptotic run-time complexity

The asymptotic worst-case time complexity of the methods in open-source packages (agnesin cluster,flashClustin ashClust,hclustinstatsandlinkageinSciPy) is (N3) through- out. These bounds were determined by careful inspection of the source code and construct- ing series of worst-case examples. Thefastclusterpackage improves performance to (N2) worst-case time complexity for the \single", \complete", \average", \weighted" and \Ward" (dissimilarity input case only) methods. For the \centroid", \median" and \Ward" (vector input) methods, we still useO(N3) algorithms due to their better performance in practice, even though quadratic run-time algorithms are available (

Eppstein

2000
Regarding the best-case complexity, only thefastclusterand ashClustpackages achieve the theoretically optimal bound of (N2).SciPy,agnesandhclusthave a best-case complexity of only (N3).

4.2. Use case performance: Dissimilarity methods

As indicated by the asymptotic best-case complexity, the various algorithms indeed perform dierently (quadratic vs. cubic time) in test cases. The diagrams in Figure 1 s howt he performance on a range of test datasets for all seven methods. Thefastclusterpackage consistently outperforms all other packages, and by rather signicant factors in the majority of cases. Also, the performance on very small datasets is very good, due to a low overhead. To be fair, the measurements for small datasets depend on the software environment in addition to the clustering algorithms, so, e.g., theRfastclusterpackage can only be directly compared to the other threeRpackages: here it has the lowest overhead. The other graphs were generated by dierent interpreters (Python,MATLAB,Mathematica).fastcluster'sPythoninterface has even lower overhead than theRversion and gives otherwise similar timings. In summary, thefastclusterpackage not only improves the theoretical, asymptotic bounds of the clustering algorithms, but it also signicantly improves the run-time performance of the existing implementations. The test sets were synthetic datasets: i.i.d. samples from a mixture of multivariate Gaussian distributions in Euclidean space with standard covariance. For each number of input pointsN, we generated 12 test sets by varying the following parameters: ?Dimension of the Euclidean space: 2, 3, 10, 200. ?Number of centers of Gaussian distributions: 1, 5, round(pN). The centers are also distributed according to a Gaussian distribution. Moreover, for the methods for which it makes sense (single, complete, average, weighted: the \combinatorial" methods), we also generated 10 test sets per number of input points with a uniform distribution of dissimilarities. The timings were obtained on a PC with an Intel dual-core CPU T7500 with 2.2 GHz clock speed and 4GB of RAM and no swap space. The operating system was Ubuntu 11.04 64-bit.

Rversion: 2.13.0,fastclusterversion: 1.1.5,

ashClustversion: 1.01,clusterversion: 1.13.3, statsversion: 2.13.0,Pythonversion: 2.7.1,NumPyversion: 1.5.1,SciPyversion: 0.8.0. Journal of Statistical Software7Method: Single Method: Complete 101
102
103
104

Number of points

10-5 10-4 10-3 10-2 10-1 100
101
102
103

CPU time in s

101
102
103
104

Number of points

10-5 10-4 10-3 10-2 10-1 100
101
102
103

CPU time in sMethod: Average Method: Weighted

101
102
103
104

Number of points

10-5 10-4 10-3 10-2 10-1 100
101
102
103

CPU time in s

101
102
103
104

Number of points

10-5 10-4 10-3 10-2 10-1 100
101
102
103
quotesdbs_dbs21.pdfusesText_27

[PDF] fastcluster: Fast Hierarchical Agglomerative Clustering Routines for

JSSJournal of Statistical Software

May 2013, Volume 53, Issue 9.

Clustering Routines forRandPython

Daniel M

Stanford UniversityAbstract

Anderberg

GMC) and median (WPGMC) linkage (

Everitt,Land au,Le ese,and S tahl

And erberg

R ohlf

S ibson

D ayan dEd elsbrunner

Mu rtagh

Ep pstein

2fastcluster: Fast Hierarchical, Agglomerative Clustering inRandPython

Car dinaland E ppstein

MathWorks, Inc.

The paper is structured as follows: In Section

Anderberg

2. SAHN clustering methods

S neathand S okal

Journal of Statistical Software3

2pairwise dissimilarities. This is thestored matrix approach

Anderberg

Anderberg

4fastcluster: Fast Hierarchical, Agglomerative Clustering inRandPythonName Distance update formula

Complete max(d(I;K);d(J;K)) maxa2A;b2Bd[a;b]

Average

I+nJ1jAjjBjX

Weighted/McQuitty

Wards(nI+nK)d(I;K)2+ (nJ+nK)d(J;K)2nKd(I;J)2n

I+nJ+nKs2jAjjBjjAj+jBj k~cA~cBk2

Centroid

Id(I;K)2+nJd(J;K)2n

I+nJnInJd(I;J)2(nI+nJ)2k~cA~cBk2

Median

IandJ, we dene~wLas the midpoint12

References:

Lan ceand W illiams

K aufmanan dRous seeuw

The formulas in Table

Journal of Statistical Software5

3. Algorithms and implementation

For single linkage clustering, an algorithm by

Rohl f

Cor men,Le iserson,Riv est,an dS tein

Gowerand Ross

M urtagh

2are represented by signed 64-bit

291 data points, hence does not pose an

4. Performance

Section

6fastcluster: Fast Hierarchical, Agglomerative Clustering inRandPython

4.1. Asymptotic run-time complexity

Eppstein

4.2. Use case performance: Dissimilarity methods

Rversion: 2.13.0,fastclusterversion: 1.1.5,

Number of points

CPU time in s

Number of points

CPU time in sMethod: Average Method: Weighted

Number of points

CPU time in s

Number of points