[PDF] [PDF] CS6220: Data Mining Techniques

5 oct 2014 · Cluster Analysis: Basic Concepts clustering • Land use: Identification of areas of similar land use in an earth Partitioning Algorithms: Basic Concept http:// webdocs cs ualberta ca/~yaling/Cluster/Applet/Code/Cluster html



Previous PDF Next PDF





[PDF] Cluster Analysis - Computer Science & Engineering User Home Pages

We then describe three specific clustering techniques that represent Page 4 490 Chapter 8 Cluster Analysis: Basic Concepts and Algorithms broad categories of  



[PDF] Data Mining Cluster Analysis - Computer Science & Engineering

Data Mining Cluster Analysis: Basic Concepts and Algorithms Fannie-Mae- DOWN,Fed-Home-Loan-DOWN, Hierarchical clustering algorithms typically have local objectives Traditional hierarchical algorithms use a similarity or distance 



[PDF] CS6220: Data Mining Techniques

5 oct 2014 · Cluster Analysis: Basic Concepts clustering • Land use: Identification of areas of similar land use in an earth Partitioning Algorithms: Basic Concept http:// webdocs cs ualberta ca/~yaling/Cluster/Applet/Code/Cluster html



[PDF] Data Mining Cluster Analysis: Basic Concepts and - DidaWiki

Cluster Analysis: Basic Concepts and Algorithms Fannie-Mae-DOWN,Fed- Home-Loan-DOWN, Source: http://cs jhu edu/~razvanm/fs-expedition/tux3 html  



[PDF] (I) Cluster Analysis - Mining Latent Entity Structures

CS 412 Intro to Data Mining Chapter 10 3 Chapter 10 Cluster Analysis: Basic Concepts and Methods User-given preferences or constraints; domain knowledge; user queries Given K, the number of clusters, the K-Means clustering algorithm is outlined as follows From wikipedia and http://home dei polimi it 



[PDF] Introduction to Data Mining

8 Cluster Analysis: Basic Concepts and Algorithms 125 9 Cluster Analysis: them to the user in a more concise form, e g , by reporting the 10 most frequent 



[PDF] Clustering techniques and unsupervised learning - Berkeley bCourses

Cluster Analysis: Basic Concepts and Algorithms, Chapter 8, Tan, Steinbach, Kumar, University of http://www-users cs umn edu/~kumar/dmbook/ch8 pdf



[PDF] Some Key Concepts in Data Mining – Clustering - DIMACS

and Theoretical Computer Science Volume tain large numbers of variables of different types: geographic (home address, work Data users need to be aware of all these effects before We begin our discussion of clustering algorithms with a simple to describe the significance and meaning of the results of clustering



[PDF] Cluster Analysis - UCL

Aggarwal, C C and Reddy, C K (2014), Data Clustering: Algorithms and Applications, Further (somewhat outdated) books on cluster analysis are for example Gordon basic tasks for the development of human language and conceptual thinking This assumes that the dataset in in the directory in which R is run;



A Data-Clustering Algorithm On Distributed Memory Multiprocessors

WWW home page: http://www cs utexas edu/users/inderjit 2 IBM Almaden Our interest in clustering stems from the need to mine and analyze heaps of unstructured concepts” in sets of unstructured text documents, and to summarize and label In this paper, as our main contribution, we propose a parallel clustering al-

[PDF] Manipulation des donnees avec Pandas

[PDF] Base R cheat sheet - RStudio

[PDF] Spark SQL: Relational Data Processing in Spark - UC Berkeley

[PDF] Cours 4 data frames

[PDF] Data Mart Consolidation - IBM Redbooks

[PDF] Data mining 1 Exploration Statistique - Institut de Recherche

[PDF] Cours de Data Mining

[PDF] Cours IFT6266, Exemple d'application: Data-Mining

[PDF] Introduction au Data Mining - Cedric/CNAM

[PDF] Defining a Data Model - CA Support

[PDF] Learning Data Modelling by Example - Database Answers

[PDF] Nouveaux prix à partir du 1er août 2017 Mobilus Mobilus - Proximus

[PDF] règlement général de la consultation - Inventons la Métropole du

[PDF] Data science : fondamentaux et études de cas

[PDF] Bases du data scientist - Data science Master 2 ISIDIS - LISIC

CS6220: DATA

MINING

TECHNIQUES

Instructor: YizhouSun

yzsun@ccs.neu.edu

October 5, 2014

Matrix Data: Clustering: Part 1

Methods to Learn

Matrix DataSet DataSequence

Data

Time SeriesGraph &

Network

ClassificationDecision Tree; Naïve

Bayes; Logistic

Regression

SVM; kNN

HMMLabel Propagation

ClusteringK-means; hierarchical

clustering; DBSCAN;

Mixture Models;

kernel k-means

SCAN; Spectral

Clustering

Frequent

Pattern

Mining

Apriori;

FP-growth

GSP;

PrefixSpan

PredictionLinear RegressionAutoregression

Similarity

Search

DTWP-PageRank

RankingPageRank

2

Matrix

Data: Clustering: Part 1

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Evaluation of Clustering

Summary

3

What is Cluster Analysis?

Cluster: A collection of data objects

similar (or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes (i.e., learning by observationsvs. learning by examples: supervised)

Typical applications

As a stand-alone toolto get insight into data distribution

As a preprocessing stepfor other algorithms

4

Applications of Cluster Analysis

Data reduction

Summarization: Preprocessing for regression, PCA, classification, and association analysis Compression: Image processing: vector quantization

Prediction based on groups

Cluster & find characteristics/patterns for each group

Finding K-nearest Neighbors

Localizing search to one or a small number of clusters Outlier detection͗ Outliers are often ǀiewed as those ͞far away" from any cluster 5

Clustering: Application Examples

Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species

Information retrieval: document clustering

Land use: Identification of areas of similar land use in an earth observation database Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understanding earth climate, find patterns of atmospheric and ocean 6

Basic Steps to Develop a Clustering Task

Feature selection

Select info concerning the task of interest

Minimal information redundancy

Proximity measure

Similarity of two feature vectors

Clustering criterion

Expressed via a cost function or some rules

Clustering algorithms

Choice of algorithms

Validation of the results

Validation test (also, clustering tendency test)

Interpretation of the results

Integration with applications

7

Requirements and Challenges

Scalability

Clustering all the data instead of only on samples Ability to deal with different types of attributes Numerical, binary, categorical, ordinal, linked, and mixture of these

Constraint-based clustering

User may give inputs on constraints

Use domain knowledge to determine input parameters

Interpretability and usability

Others

Discovery of clusters with arbitrary shape

Ability to deal with noisy data

Incremental clustering and insensitivity to input order

High dimensionality

8

Matrix

Data: Clustering: Part 1

Cluster Analysis: Basic Concepts

Partitioning Methods

Hierarchical Methods

Density-Based Methods

Evaluation of Clustering

Summary

9

Partitioning Algorithms: Basic Concept

Partitioning method:Partitioning a dataset Dof nobjects into a set of k clusters, such that the sum of squared distances is minimized (where ciis the centroid or medoidof cluster Ci) Given k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions

Heuristic methods: k-meansand k-medoidsalgorithms

k-means0MŃ4XHHQ·67 IOR\G·D7C·82 (MŃO ŃOXVPHU LV UHSUHVHQPHG N\ POH center of the cluster k-medoidsor PAM (Partition around medoids) (Kaufman &

5RXVVHHXR·87 (MŃO ŃOXVPHU LV UHSUHVHQPHG N\ RQH RI POH RNÓHŃPV LQ POH

cluster 102

1)),((iCp

k icpdEi 6 The K Means

Clustering Method

Given k, the k-meansalgorithm is implemented in four steps:

Step 0: Partition objects into knonempty subsets

Step 1: Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) Step 2: Assign each object to the cluster with the nearest seed point Step 3: Go back to Step 1, stop when the assignment does not change 11

An Example of

K Means

Clustering

K=2

Arbitrarily

partition objects into k groups

Update the

cluster centroids

Update the

cluster centroids

Reassign objectsLoop if

needed

The initial data set

"Partition objects into knonempty subsets "Repeat "Compute centroid (i.e., mean point) for each partition "Assign each object to the cluster of its nearest centroid "Until no change 12

Comments on the

K Means

Method

Strength:Efficient: O(tkn), where nis # objects, kis # clusters, and t is # iterations. Normally, k, t<< n.

Comment:Often terminates at a local optimal

Weakness

Applicable only to objects in a continuous n-dimensional space

Using the k-modes method for categorical data

In comparison, k-medoidscan be applied to a wide range of data Need to specify k, the numberof clusters, in advance (there are ways to automatically determine the best k (see Hastie et al., 2009)

Sensitive to noisy data and outliers

Not suitable to discover clusters with non-convex shapes 13

Variations of the

K Means

Method

Most of the variants of the k-meanswhich differ in

Selection of the initial kmeans

Dissimilarity calculations

Strategies to calculate cluster means

Handling categorical data: k-modes

Replacing means of clusters with modes

Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data: k-prototypemethod 14

What Is the Problem of the K

Means Method?

The k-means algorithm is sensitive to outliers !

Since an object with an extremely large value may substantially distort the distribution of the data K-Medoids: Instead of taking the meanvalue of the object in a cluster as a reference point, medoidscan be used, which is the most centrally located object in a cluster 0quotesdbs_dbs20.pdfusesText_26