[PDF] [PDF] (I) Cluster Analysis - Mining Latent Entity Structures

CS 412 Intro to Data Mining Chapter 10 3 Chapter 10 Cluster Analysis: Basic Concepts and Methods User-given preferences or constraints; domain knowledge; user queries Given K, the number of clusters, the K-Means clustering algorithm is outlined as follows From wikipedia and http://home dei polimi it 



Previous PDF Next PDF





[PDF] Cluster Analysis - Computer Science & Engineering User Home Pages

We then describe three specific clustering techniques that represent Page 4 490 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms broad categories of  



[PDF] Data Mining Cluster Analysis - Computer Science & Engineering

Data Mining Cluster Analysis: Basic Concepts and Algorithms Fannie-Mae- DOWN,Fed-Home-Loan-DOWN, Hierarchical clustering algorithms typically have local objectives Traditional hierarchical algorithms use a similarity or distance 



[PDF] CS6220: Data Mining Techniques

5 oct 2014 · Cluster Analysis: Basic Concepts clustering • Land use: Identification of areas of similar land use in an earth Partitioning Algorithms: Basic Concept http:// webdocs cs ualberta ca/~yaling/Cluster/Applet/Code/Cluster html



[PDF] Data Mining Cluster Analysis: Basic Concepts and - DidaWiki

Cluster Analysis: Basic Concepts and Algorithms Fannie-Mae-DOWN,Fed- Home-Loan-DOWN, Source: http://cs jhu edu/~razvanm/fs-expedition/tux3 html  



[PDF] (I) Cluster Analysis - Mining Latent Entity Structures

CS 412 Intro to Data Mining Chapter 10 3 Chapter 10 Cluster Analysis: Basic Concepts and Methods User-given preferences or constraints; domain knowledge; user queries Given K, the number of clusters, the K-Means clustering algorithm is outlined as follows From wikipedia and http://home dei polimi it 



[PDF] Introduction to Data Mining

8 Cluster Analysis: Basic Concepts and Algorithms 125 9 Cluster Analysis: them to the user in a more concise form, e g , by reporting the 10 most frequent 



[PDF] Clustering techniques and unsupervised learning - Berkeley bCourses

Cluster Analysis: Basic Concepts and Algorithms, Chapter 8, Tan, Steinbach, Kumar, University of http://www-users cs umn edu/~kumar/dmbook/ch8 pdf



[PDF] Some Key Concepts in Data Mining – Clustering - DIMACS

and Theoretical Computer Science Volume tain large numbers of variables of different types: geographic (home address, work Data users need to be aware of all these effects before We begin our discussion of clustering algorithms with a simple to describe the significance and meaning of the results of clustering



[PDF] Cluster Analysis - UCL

Aggarwal, C C and Reddy, C K (2014), Data Clustering: Algorithms and Applications, Further (somewhat outdated) books on cluster analysis are for example Gordon basic tasks for the development of human language and conceptual thinking This assumes that the dataset in in the directory in which R is run;



A Data-Clustering Algorithm On Distributed Memory Multiprocessors

WWW home page: http://www cs utexas edu/users/inderjit 2 IBM Almaden Our interest in clustering stems from the need to mine and analyze heaps of unstructured concepts” in sets of unstructured text documents, and to summarize and label In this paper, as our main contribution, we propose a parallel clustering al-

[PDF] Manipulation des donnees avec Pandas

[PDF] Base R cheat sheet - RStudio

[PDF] Spark SQL: Relational Data Processing in Spark - UC Berkeley

[PDF] Cours 4 data frames

[PDF] Data Mart Consolidation - IBM Redbooks

[PDF] Data mining 1 Exploration Statistique - Institut de Recherche

[PDF] Cours de Data Mining

[PDF] Cours IFT6266, Exemple d'application: Data-Mining

[PDF] Introduction au Data Mining - Cedric/CNAM

[PDF] Defining a Data Model - CA Support

[PDF] Learning Data Modelling by Example - Database Answers

[PDF] Nouveaux prix à partir du 1er août 2017 Mobilus Mobilus - Proximus

[PDF] règlement général de la consultation - Inventons la Métropole du

[PDF] Data science : fondamentaux et études de cas

[PDF] Bases du data scientist - Data science Master 2 ISIDIS - LISIC

CS 412 Intro. to Data Mining

Chapter 10. Cluster Analysis: Basic Concepts and

Methods

Jiawei Han, Computer Science, Univ. Illinois at Urbana-Champaign, 2017 1 2 3

Chapter 10.

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: An Introduction

Partitioning Methods

Hierarchical Methods

Density-and Grid-Based Methods

Evaluation of Clustering (Coverage will be based on the available time)

Summary

4

Cluster Analysis:

An Introduction

What Is Cluster Analysis?

Applications of Cluster Analysis

Cluster Analysis: Requirements and Challenges

Cluster Analysis: A Multi-Dimensional Categorization

An Overview of Typical Clustering Methodologies

An Overview of Clustering Different Types of Data

An Overview of User Insights and Clustering

5

What Is

Cluster Analysis

What is a cluster?

A cluster is a collection of data objects which are Similar (or related) to one another within the same group (i.e., cluster) Dissimilar (or unrelated) to the objects in other groups (i.e., clusters) Given a set of data points, partition them into a set of groups (i.e., clusters) which are as similar as possible Cluster analysis is unsupervised learning (i.e., no predefined classes) This contrasts with classification(i.e., supervised learning)

Typical ways to use/apply cluster analysis

As a stand-alone tool to get insight into data distribution, or As a preprocessing (or intermediate) step for other algorithms 6

What Is Good Clustering?

A good clustering method will produce high quality clusters which should have High intra-class similarity:Cohesivewithin clusters Low inter-class similarity: Distinctivebetween clusters

Quality function

a cluster

The answer is typically highly subjective

There exist many similarity measures and/or functions for different applications Similarity measure is critical for cluster analysis 7

Cluster Analysis

: Applications A key intermediate step for other data mining tasks Generating a compact summary of data for classification, pattern discovery, hypothesis generation and testing, etc.

Data summarization, compression, and reduction

Ex. Image processing: Vector quantization

Collaborative filtering, recommendation systems, or customer segmentation

Find like-minded users or similar products

Dynamic trend detection

Clustering stream data and detecting trends and patterns Multimedia data analysis, biological data analysis and social network analysis Ex. Clustering images or video/audio clips, gene/protein sequences, etc. 8

Considerations for Cluster Analysis

Partitioning criteria

Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable, e.g., grouping topical terms)

Separation of clusters

Exclusive (e.g., one customer belongs to only one region) vs. non- exclusive (e.g., one document may belong to more than one class)

Similarity measure

Distance-based (e.g., Euclidean, road network, vector) vs. connectivity- based (e.g., density or contiguity)

Clustering space

Full space (often when low dimensional) vs. subspaces (often in high- dimensional clustering) 9

Requirements and Challenges

Quality

Ability to deal with different types of attributes: Numerical, categorical, text, multimedia, networks, and mixture of multiple types

Discovery of clusters with arbitrary shape

Ability to deal with noisy data

Scalability

Clustering all the data instead of only on samples

High dimensionality

Incremental or stream clustering and insensitivity to input order

Constraint-based clustering

User-given preferences or constraints; domain knowledge; user queries

Interpretability and usability

The final generated clusters should be semantically meaningful and useful 10

Chapter 10.

Cluster Analysis: Basic Concepts and Methods

Cluster Analysis: An Introduction

Partitioning Methods

Hierarchical Methods

Density-and Grid-Based Methods

Evaluation of Clustering

Summary

11

Partitioning

Based Clustering Methods

Basic Concepts of Partitioning Algorithms

The K-Means Clustering Method

Initialization of K-Means Clustering

The K-MedoidsClustering Method

The K-Medians and K-Modes Clustering Methods

The Kernel K-Means Clustering Method

12

Partitioning Algorithms: Basic Concepts

Partitioning method:Discovering the groupings in the data by optimizing a specific objective function and iteratively improving the quality of partitions K-partitioning method: Partitioning a dataset Dof nobjects into a set of Kclusters so that an objective function is optimized (e.g., the sum of squared distances is minimized, where ckis the centroid or medoidof cluster Ck) A typical objective function: Sum of Squared Errors (SSE) Problem definition: Given K, find a partition of K clusters that optimizes the chosen partitioning criterion Global optimal: Needs to exhaustively enumerate all partitions Heuristic methods (i.e., greedy algorithms): K-Means, K-Medians, K-Medoids, etc. 2

1( ) || ||

iCk K ik kxSSE C x c 13 The K Means

Clustering Method

Each cluster is represented by the center of the cluster Given K, the number of clusters, the K-Meansclustering algorithm is outlined as follows

Select Kpoints as initial centroids

Repeat

Form Kclusters by assigning each point to its closest centroid Re-compute the centroids (i.e., mean point)of each cluster

Until convergence criterion is satisfied

Different kinds of measures can be used

Manhattan distance (L1norm), Euclidean distance (L2norm),Cosine similarity 14

Example:

K Means

Clustering

The original data points &

randomly select K = 2 centroids

Select Kpoints as initial centroids

Repeat

Form Kclusters by assigning each point to its closest centroid Re-compute the centroids (i.e., mean point)of each cluster

Until convergence criterion is satisfied

Assign

points to clusters

Recompute

cluster centers

Redo point assignment

Execution of the K-MeansClustering Algorithm

15

Discussion on the

K Means

Method

Efficiency: O(tKn) where n: # of objects, K: # of clusters, and t: # of iterations

Normally, K, t<< n; thus, an efficient method

K-means clustering often terminates at a local optimal Initialization can be important to find high-quality clusters Need to specify K, the numberof clusters, in advance

Sensitive to noisy data and outliers

Variations: Using K-medians, K-medoids, etc.

K-means is applicable only to objects in a continuous n-dimensional space

Using the K-modes for categorical data

Not suitable to discover clusters with non-convex shapes Using density-based clustering, kernel K-means, etc. 16

Variations of

K Means There are many variants of the K-Meansmethod, varying in different aspects

Choosing better initial centroid estimates

K-means++, Intelligent K-Means, Genetic K-Means

Choosing different representative prototypes for the clusters

K-Medoids, K-Medians, K-Modes

Applying feature transformation techniques

Weighted K-Means, Kernel K-MeansTo be discussed in this lecture

To be discussed in this lecture

To be discussed in this lecture

17

Poor Initialization in K

Means May Lead to Poor Clustering

Rerun of the K-Meansusing another random Kseeds

This run of K-Means generates a poor quality clustering

Recompute

cluster centers

Assign

points to clusters

Another random selection of k

centroids for the same data points 18

Initialization of K

Means: Problem and Solution

Different initializations may generate rather different clustering results (some could be far from optimal) Need to run the algorithm multiple times using different seeds There are many methods proposed for better initialization of kseeds

The first centroid is selected at random

The next centroid selected is the one that is farthest from the currently selected (selection is based on a weighted probability score) The selection continues until Kcentroids are obtained 19

Handling Outliers: From

K Means to K

Medoids

large value may substantially distort the distribution of the data K-Medoids: Instead of taking the meanvalue of the object in a cluster as a reference point, medoidscan be used, which is the most centrally locatedobject in a cluster

The K-Medoidsclustering algorithm:

Select Kpoints as the initial representative objects (i.e., as initial K medoids)

Repeat

Assigning each point to the cluster with the closest medoid

Randomly select a non-representative object oi

Compute the total cost Sof swapping the medoidmwith oi If S< 0, then swap mwith oito form the new set of medoids

Until convergence criterion is satisfied

quotesdbs_dbs20.pdfusesText_26