[PDF] [PDF] CSE601 Hierarchical Clustering

Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Motivation, definition, evaluation the dendrogram at the desired level, then each



Previous PDF Next PDF





[PDF] Hierarchical Clustering / Dendrograms

The agglomerative hierarchical clustering algorithms available in this In this example we can compare our interpretation with an actual plot of the data



[PDF] Chapter 7 Hierarchical cluster analysis

describing the algorithm, or set of instructions, which creates the dendrogram results In this chapter we demonstrate hierarchical clustering on a small example  



[PDF] Hierarchical Agglomerative Clustering - Université Lumière Lyon 2

Hierarchical cluster analysis The cluster dendrogram is very important to Example #HAC - single linkage cah



[PDF] Hierarchical Clustering - Princeton University Computer Science

For example, we partition organisms into different species, but science has also Hierarchical clustering constructs a (usually binary) tree over the data The leaves are distance matrix that was used to compute the dendrogram Notice how 



[PDF] Hierarchical clustering - CMU Statistics

29 jan 2013 · Given the linkage, hierarchical clustering produces a sequence of clustering Example of a dendrogram with no inversions 1 7 4 5 6 2 3



[PDF] CSE601 Hierarchical Clustering

Clustering Lecture 3: Hierarchical Methods Jing Gao SUNY Buffalo 1 Motivation, definition, evaluation the dendrogram at the desired level, then each



[PDF] Hierarchical clustering for gene expression data analysis

clusters) 2 Clustering of samples (columns) => identification of sub-types of related Dendrogram 1 For agglomerative hierarchical clustering we start with



[PDF] Hierarchical Clustering - Introduction to Information Retrieval

An HAC clustering is typically visualized as a dendrogram as shown in DENDROGRAM We will see an example of an inversion in Figure 17 12 Hierarchical 



[PDF] Automatic Extraction of Clusters from Hierarchical Clustering

dendrogram or reachability plot, guided by a visual inspection of the graphical representation The second Figure 1: Example dataset 1: hierarchical clustering  

[PDF] hierarchical clustering dendrogram python example

[PDF] hierarchical clustering elbow method python

[PDF] hierarchical clustering in r datanovia

[PDF] hierarchical clustering python scikit learn

[PDF] hierarchical clustering python scipy example

[PDF] hierarchical inheritance in java

[PDF] hierarchical network

[PDF] hierarchical network design pdf

[PDF] hierarchical regression table apa

[PDF] hierarchical structure journal article

[PDF] hierarchy java example

[PDF] hierarchy of law reports

[PDF] hifly a321

[PDF] hifly a380 interior

[PDF] hifly a380 model

Clustering

Lecture 3: Hierarchical Methods

Jing Gao

SUNY Buffalo

1

Outline

Basics

Motivation, definition, evaluation

Methods

Partitional

Hierarchical

Density-based

Mixture model

Spectral methods

Advanced topics

Clustering ensemble

Clustering in MapReduce

Semi-supervised clustering, subspace clustering, co-clustering, etc. 2 3

Hierarchical Clustering

Agglomerative approach

b d c e a a b d e c d e a b c d e

Step 0 Step 1 Step 2 Step 3 Step 4 bottom-up

Initialization:

Each object is a cluster

Iteration:

Merge two clusters which are

most similar to each other;

Until all objects are merged

into a single cluster 4

Hierarchical Clustering

Divisive Approaches

b d c e a a b d e c d e a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0 Top-down

Initialization:

All objects stay in one cluster

Iteration:

Select a cluster and split it into

two sub clusters

Until each leaf cluster contains

only one object 5

Dendrogram

A tree that shows how clusters are merged/split

hierarchically Each node on the tree is a cluster; each leaf node is a singleton cluster 6

Dendrogram

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster

Agglomerative Clustering Algorithm

More popular hierarchical clustering technique

Basic algorithm is straightforward

1.Compute the distance matrix

2.Let each data point be a cluster

3.Repeat

4. Merge the two closest clusters

5. Update the distance matrix

6.Until only a single cluster remains

Key operation is the computation of the distance between two clusters Different approaches to defining the distance between clusters distinguish the different algorithms 7

Starting Situation

Start with clusters of individual points and a distance matrix p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . . . Distance Matrix p1p2p3p4p9p10p11p128

Intermediate Situation

After some merging steps, we have some clusters

Choose two clusters that has the smallest

distance (largest similarity) to merge C1 C4 C2 C5 C3 C2 C1 C1 C3 C5 C4 C2

C3 C4 C5

Distance Matrix

p1p2p3p4p9p10p11p12 9

Intermediate Situation

We want to merge the two closest clusters (C2 and C5) and update the distance matrix. C1 C4 C2 C5 C3 C2 C1 C1 C3 C5 C4 C2

C3 C4 C5

Distance Matrix

p1p2p3p4p9p10p11p12 10

After Merging

C1 C4

C2 U C5

C3 C2 U C5 C1 C1 C3 C4

C2 U C5

C3 C4

Distance Matrix

p1p2p3p4p9p10p11p12 11

How to Define Inter-Cluster Distance

p1 p3 p5 p4 p2 p1 p2 p3 p4 p5 . . .

Distance?

YMIN YMAX

YGroup Average

YDistance Between Centroids

Distance Matrix

12

MIN or Single Link

Inter-cluster distance

The distance between two clusters is represented by the distance of the closest pair of data objects belonging to different clusters. Determined by one pair of points, i.e., by one link in the proximity graph ),(min),( ,minqpdCCd MIN

Nested Clusters Dendrogram

1 2 3 4 5 6 1 2 3 4 5

3625410

0.05 0.1 0.15 0.2 14

Strength of MIN

Original Points Two Clusters

ͻ Can handle non-elliptical shapes

15

Limitations of MIN

Original Points

Two Clusters

ͻ Sensitive to noise and outliers

16

MAX or Complete Link

Inter-cluster distance

The distance between two clusters is represented by the distance of the farthest pair of data objects belonging to different clusters ),(max),( ,minqpdCCd MAX

Nested Clusters Dendrogram

3641250

0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 1 2 3 4 5 6 1 2 5 3 4 18

Strength of MAX

Original Points

Two Clusters

ͻ Less susceptible to noise and outliers

19

Limitations of MAX

Original Points

ͻTends to break large clusters

20 21

MIN (2 clusters) MAX (2 clusters)

Limitations of MAX

ͻBiased towards globular clusters

Group Average or Average Link

Inter-cluster distance

The distance between two clusters is represented by the average distance of all pairs of data objects belonging to different clusters Determined by all pairs of points in the two clusters ,minqpdavgCCd

Group Average

Nested Clusters Dendrogram

3641250

0.05 0.1 0.15 0.2 0.25 1 2 3 4 5 6 1 2 5 3 4 23

Group Average

Compromise between Single and Complete

Link

Strengths

Less susceptible to noise and outliers

Limitations

Biased towards globular clusters

24

Centroid Distance

Inter-cluster distance

The distance between two clusters is represented by the distance between the centers of the clusters

Determined by cluster centroids

),(),(jijimeanmmdCCd 25
Similarity of two clusters is based on the increase in squared error when two clusters are merged Similar to group average if distance between points is distance squared

Less susceptible to noise and outliers

Biased towards globular clusters

Hierarchical analogue of K-means

Can be used to initialize K-means

26

Comparison

Group Average

Ward's Method

1 2 3 4 5 6 1 2 5 3 4

MIN MAX

1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 5 3 4 1 2 3 4 5 6 1 2 3 4 5 27

Time and Space Requirements

O(N2) space since it uses the distance matrix

N is the number of points

O(N3) time in many cases

There are N steps and at each step the size, N2,

distance matrix must be updated and searched

Complexity can be reduced to O(N2 log(N) ) time

for some approaches 28

Strengths

Do not have to assume any particular number

of clusters

Any desired number of clusters can be obtained

They may correspond to meaningful

taxonomies camera, ..), furniture, groceries 29

Problems and Limitations

Once a decision is made to combine two clusters, it cannot be undone

No objective function is directly minimized

Different schemes have problems with one or more of the following:

Sensitivity to noise and outliers

Difficulty handling different sized clusters and irregular shapes

Breaking large clusters

30

Take-away Message

Agglomerative and divisive hierarchical clustering

Several ways of defining inter-cluster distance

The properties of clusters outputted by different

approaches based on different inter-cluster distance definition

Pros and cons of hierarchical clustering

31
quotesdbs_dbs17.pdfusesText_23