Statistical Data Mining

Billard and Diday – Symbolic Data Analysis: Conceptual Statistics and Data Mining First published under the title 'Data Mining et Statistique ...

Data Mining and Official Statistics

Data Mining and Official Statistics. Gilbert Saporta. Chaire de Statistique Appliquée Conservatoire National des Arts et Métiers. 292 rue Saint Martin

Data Mining Machine Learning and Official Statistics

22-Mar-2020 Statistics. Gilbert Saporta and Hossein Hassani. Abstract We examine the issues of applying Data mining and Machine Learning.

The Elements of Statistical Learning

Springer Series in Statistics. Trevor Hastie. Robert Tibshirani. Jerome Friedman. The Elements of. Statistical Learning. Data Mining Inference

The Elements of Statistical Learning

Springer Series in Statistics. Trevor Hastie. Robert Tibshirani. Jerome Friedman. The Elements of. Statistical Learning. Data Mining Inference

Statistical methods for data mining in genomics databases (Gene

21-Jul-2015 Méthodes statistiques pour la fouille de données dans les bases de données de génomique (Gene. Set Enrichment Analysis).

Data mining et statistique

CAROLINE LE GALL. NATHALIE RAIMBAULT. SOPHIE SARPY. Data mining et statistique. Journal de la société française de statistique tome 142

The Elements of Statistical Learning

13-Jan-2017 Springer Series in Statistics. Trevor Hastie. Robert Tibshirani. Jerome Friedman. The Elements of. Statistical Learning. Data Mining ...

Data Mining et Statistique

Mots clefs Data mining modélisation statistique

Symbolic Data Analysis: another look at the interaction of Data

Data Mining and Statistics. Paula Brito?. Symbolic Data Analysis (SDA) provides a framework for the representation and analysis of data that comprehends

Data Mining - Stanford University

2 CHAPTER 1 DATA MINING and standarddeviationofthis Gaussiandistribution completely characterizethe distribution and would become the model of the data 1 1 2 Machine Learning There are some who regard data mining as synonymous with machine learning There is no question that some data mining appropriately uses algorithms from machine learning

HANDBOOK OF STATISTICAL ANALYSIS AND DATA MINING APPLICATI

Data Mining Preamble 15 The Scientific Method 16 What Is Data Mining? 17 A Theoretical Framework for the Data Mining Process 18 Microeconomic Approach 19 Inductive Database Approach 19 Strengths of the Data Mining Process 19 Customer-Centric Versus Account-Centric: A New Way to Look at Your Data 20 The Physical Data Mart 20 The Virtual Data Mart 21

Statistical Data Mining - University of Oxford

Overview of Data Mining Ten years ago data miningwas a pejorative phrase amongst statisticians but the English language evolves and that sense is now encapsulated in the phrasedata dredging In its current sense data miningmeans ?nding structure in large-scale databases It is one of many newly-popular terms for this activity another being

Data Mining et Statistique - univ-toulousefr

Abstract This article gives an introduction to Data Mining in the form of a re?ection about interactions between two disciplines Data processing and Statistics collaborating in the analysis of large sets of data

Searches related to data mining statistique filetype:pdf

† A data mining engine which consists of a set of functional modules for tasks such as classi?cation association classi?cation cluster analysis and evolution and deviation analysis † A pattern evaluation module that works in tandem with the data mining modules by employing

What is the difference between statistical analysis and data mining?

Thus, statistical analysis uses a model to characterize a pattern in the data; data mining uses the pattern in the data to build a model. This approach uses deductive reasoning, following an Aristotelian approach to truth. From the “model” accepted in the beginning (based on the mathematical distributions assumed), outcomes are deduced.

What is data mining?

DEFINITION AND OBJECTIVES The term data mining is not new to statisticians. It is a term synonymous with data dredging or fshing and has been used to describe the process of trawling through data in the hope of identifying patterns.

How can I gain experience using STATISTICA Data Miner QC-miner Text Miner?

To gain experience using STATISTICA Data Miner þ QC-Miner þ Text Miner for the Desktop using tutorials that take you through all the steps of a data mining project, please install the free 90-day STATISTICA that is on the DVD bound with this book.

What are the different types of data mining techniques?

Techniques coveredinclude perceptrons, support-vector machines, ?nding models by gradient de-scent, nearest-neighbor models, and decision trees. Data Mining: This term refers to the process of extracting useful modelsof data. Sometimes, a model can be a summary of the data, or it can bethe set of most extreme features of the data.

B. D. Ripley

May 2002

c?B.D. Ripley1998-2002. MaterialfromRipley(1996)isc?B. D. Ripley1996.

Materialfrom Venables and Ripley(1999,2002)is

c?Springer-Verlag, New York

1994-2002.

Introduction

This material is partly based on Ripley (1996), Venables & Ripley (1999, 2002) and the on-line complements available at http://www.stats.ox.ac.uk/pub/MASS4/ My copyrightagreements allow me to use the material on courses, but no further distributionis allowed. The Scode in this version of the notes was tested withS-PLUS 6.0for

Unix/Linux and

Windows,andS-PLUS 2000 release 3. With minor changes it works with

Rversion 1.5.0.

The specific add-ons for the material in this course are available at All the other add-on libraries mentioned are available forUnixand forWin- dows . Compiled versions forS-PLUS 2000are available from http://www.stats.ox.ac.uk/pub/SWin/ and forS-PLUS 6.xfrom ii

1 Overview of Data Mining 1

1.1 Multivariateanalysis ........................ 2

1.2 Graphical methods......................... 3

1.3 Clusteranalysis........................... 13

1.4 Kohonen"s self organizing maps.................. 19

1.5 Exploratoryprojectionpursuit ................... 20

1.6 Anexampleofvisualization .................... 23

1.7 Categoricaldata........................... 30

2 Tree-based Methods 36

2.1 Partitioningmethods . . ...................... 37

2.2 Implementation inrpart...................... 49

3 Neural Networks 58

3.1 Feed-forwardneuralnetworks ................... 59

3.2 Multiplelogisticregression and discrimination.......... 68

3.3 Neuralnetworksinclassification.................. 69

3.4 A look at support vector machines................. 76

4 Near-neighbour Methods 79

4.1 Nearest neighbour methods..................... 79

4.2 Learningvectorquantization.................... 85

4.3 Forensicglass............................ 88

5 Assessing Performance 91

5.1 Practicalwaysofperformanceassessment............. 91

5.2 Calibrationplots........................... 93

5.3 PerformancesummariesandROCcurves ............. 95

5.4 Assessinggeneralization ...................... 97

References 99

Contentsiii

Index 105

Chapter 1 Overview of Data Mining

Ten years agodata miningwas a pejorative phrase amongst statisticians, but the English language evolves and that sense is now encapsulated in the phrasedata dredging. In its current sensedata miningmeans finding structure in large-scale databases. It is one of many newly-popular terms for this activity, another being KDD(Knowledge Discovery in Databases), and is a subject at the boundaries of statistics, engineering, machine learning and computer science. Such phrases are to a large extent fashion, and finding structure in datasets is emphaticallynota new activity. What is new is the scale of databases that are becoming available through the computer-based acquisition of data, either through new instrumentation (fMRI machines can collect 100Mb of images in a hour"ssession)orthroughthe by-productofcomputerisedaccountingrecords (for example, spottingfraudulent use of credit cards or telephones). This is a short course instatisticaldata mining. As such we will not cover the aspects of data mining that are concerned with querying very large databases, although building efficient database interfaces to statistical software is becoming a very importantarea instatisticalcomputing. Indeed, many ofthe problemsarise with quite modest datasets with a thousand or so examples, but even those were not common a decade or two ago. We will always need to bear in mind the data dredging" aspect of the term. When (literally) mining or dredging, the proportion of good material to dross is usually very low, and when mining for minerals can often be too low to cover the costs of extraction. Exactly the same issues occur in lookingfor structure in data: it is all too easy to find structure that is only characteristic of the particular set of data to hand. We wantgeneralizationin the terminology of the psychologists, that is to find structurethat will help with future examples too. for many years: that is what the design of experiments developed in the inter-war years had as its aims. Generally that gave a singleoutcome (yield")on a hundred or so experimental points.Multivariate analysiswas concerned with multiple (usually more than two and often fewer than twenty) measurements on different subjects. In engineering, very similar (often identical) methods were being developed under the heading ofpattern recognition. Engineers tend to distinguishbetween

2Overview of Data Mining

statisticalpattern recognitionwhere everything is learnt from examples" structural pattern recognitionwhere most of the structure is imposed froma prioriknowledge. This used to be calledsyntactic pattern recognition,in which the structure was imposed by a formal grammar, but that has proved to be pretty unsuccessful. Note that structure is imposed instatisticalpattern recognitionvia prior assump- tionsonthedifference betweensignalandnoise,butthatstructureisnotdetermin- isticas in structural pattern recognition. It is the inabilityto cope with exceptions that has bedevilled structural pattern recognition (and much of the research on expert systems). However, a practically much more important distinctionis between unsupervised methodsin which there is no known groupingof the examples supervised methodsinwhichthe examples are knowntobegroupedinadvance, or ordered by some response, and the task is to group future examples or predict which are going to be give a good" response. It is important to bear in mind that unsupervised pattern recognition is like lookingforneedlesinhaystacks. Itcovers theformationofgoodscientifictheories and the search for therapeutically useful pharmaceutical compounds. It is best thoughtof as hypothesisformation, and independentconfirmation willbe needed. There are a large number of books now in this area, including Dudaet al. (2001); Handet al.(2001); Hastieet al.(2001); Ripley (1996); Webb (1999);

Witten & Frank (2000).

1.1 Multivariate analysis

Multivariate analysis is concerned with datasets which have more than one re- sponse variable foreach observational or experimental unit. The datasets can be summarized by data matricesXwithnrows andpcolumns, the rows repre- senting the observations or cases, and the columns the variables. The matrix can be viewed either way, depending whether the main interest is in the relationships betweenthecases orbetweenthevariables. Notethatforconsistencywerepresent the variables of a case by therowvectorx. The main division in multivariate methods is between those methods which assume a given structure, for example dividing the cases into groups, and those which seek to discover structure from the evidence of the data matrix alone. One ofourexamples isthe(in)famousirisdatacollectedbyAnderson(1935)andgiven and analysed by Fisher (1936). This has 150 cases, which are stated to be 50 of each of the three speciesIris setosa,I. virginicaandI. versicolor. Each case has four measurements on the length and width of its petals and sepals.Apriorithis is a supervised problem, and the obvious questions are to use measurements on a future case to classify it, and perhaps to ask how the variables vary between the species. (In fact, Fisher (1936) used these data to test a genetic hypothesis which

1.2 Graphical methods3

placedI.versicoloras ahybridtwo-thirdsofthewayfromI. setosatoI.virginica.) However, theclassification ofspecies isuncertain,and similardatahave been used toidentifyspecies by groupingthe cases. (Indeed, Wilson (1982)and McLachlan (1992, §6.9) consider whether the iris data can be split into sub-species.) We end the chapter with a similar example on splittinga species of crab.

1.2 Graphical methods

The simplest way to examine multivariate data is via a pairs plot, enhanced to show the groups. More dynamic versions are available in

XGobi,GGobiand

S-PLUS"sbrush.

Figure 1.1:S-PLUSbrushplot of theirisdata.

Principal component analysis

Linear methods are the heart of classical multivariate analysis, and depend on seeking linear combinations of the variables with desirable properties. For the unsupervisedcase the main method isprincipalcomponentanalysis, which seeks linear combinations of the columns ofXwith maximal (or minimal) variance. Because thevariance can be scaled by rescaling the combination,we constrain the combinations to have unit length.

4Overview of Data Mining

LetSdenote the covariance matrix of the dataX, which is defined 1 by nS=(X-n -1 11 T X) T (X-n -1 11 T X)=(X T X-nxx T where x=1 T X/nis the row vector of means of the variables. Then the sample variance of a linear combinationxaof a row vectorxisa T

Σaand this is

to be maximized (or minimized) subject to?a? 2 =a T a=1.SinceΣis a non-negativedefinite matrix, it has an eigendecomposition

Σ=C

T ΛC whereΛis a diagonal matrix of (non-negative) eigenvalues in decreasing order. Letb=Ca, which has the same length asa(sinceCis orthogonal). The problemisthenequivalenttomaximizingb T

Λb=?λ

i b 2i subjectto?b 2i =1. Clearly the variance is maximized by takingbto be the first unit vector, or equivalently takingato be the column eigenvector corresponding to the largest eigenvalue ofΣ. Taking subsequent eigenvectors gives combinations with as large as possible variance which are uncorrelated with those which have been taken earlier. Theith principal component is then theith linear combination picked by this procedure. (It is only determined up to a change of sign; you may get different signs in different implementations of S.)

Anotherpointofviewistoseeknewvariablesy

j whicharerotationsoftheold variables to explain best the variation in the dataset. Clearly these new variables should be taken to be the principal components, in order. Suppose we use the firstkprincipal components. Then the subspace they span contains the best" k-dimensional view of the data. It both has maximal covariance matrix (both in trace and determinant) and best approximates the original points in the sense of minimizing the sum of squared distance from the points to their projections. The first few principalcomponents are often useful to reveal structure in the data. The principal components corresponding to the smallest eigenvalues are the most nearly constant combinations of the variables, and can also be of interest. Note that the principal components depend on the scaling of the original variables, and thiswillbe undesirableexcept perhaps if (as intheirisdata) they are in comparable units. (Even in this case, correlations would often be used.) Otherwise it is conventional to take the principal components of thecorrelation matrix, implicitlyrescaling all the variables to have unit sample variance. The functionprincompcomputes principalcomponents. The argumentcor controls whether the covariance or correlation matrix is used (via re-scaling the variables). > ir.pca <- princomp(log(ir), cor=T) > ir.pca

Standard deviations:

Comp.1 Comp.2 Comp.3 Comp.4

1.7125 0.95238 0.3647 0.16568

1 A divisorofn-1is more conventional,butprincompcallscov.wt, which usesn.

1.2 Graphical methods5

first principal component second principal component -4 -2 0 2 -3 -2 -1 0 1 2 s s s ss s s s s ss s s ss s s ss s ss s ss ss ss s sss s s ss s ss s s s s s ss ss sc c c c c cc c c c cc cc cc c c c cc c cccc cc c c c cc ccc c c c c cc c cc c cc c cv vv v vv v v vv v v v v vv vv v vv v v vv v v v vv vv v v vv v v vv vv vv v v vv vquotesdbs_dbs17.pdfusesText_23

[PDF] Cours IFT6266, Exemple d'application: Data-Mining

[PDF] Introduction au Data Mining - Cedric/CNAM

[PDF] Defining a Data Model - CA Support

[PDF] Learning Data Modelling by Example - Database Answers

[PDF] Nouveaux prix à partir du 1er août 2017 Mobilus Mobilus - Proximus

[PDF] règlement général de la consultation - Inventons la Métropole du

[PDF] Data science : fondamentaux et études de cas

[PDF] Bases du data scientist - Data science Master 2 ISIDIS - LISIC

[PDF] R Programming for Data Science - Computer Science Department

[PDF] Sashelp Data Sets - SAS Support

[PDF] Introduction au domaine du décisionnel et aux data warehouses

[PDF] DESIGNING AND IMPLEMENTING A DATA WAREHOUSE 1

[PDF] Datawarehouse

[PDF] Definition • a database is an organized collection of - Dal Libraries

[PDF] DBMS tutorials pdf

[PDF] Statistical Data Mining - University of Oxford

What is the difference between statistical analysis and data mining?

What is data mining?

How can I gain experience using STATISTICA Data Miner QC-miner Text Miner?

What are the different types of data mining techniques?

Statistical Data Mining

B. D. Ripley

May 2002

Materialfrom Venables and Ripley(1999,2002)is

1994-2002.

Introduction

Unix/Linux and

Rversion 1.5.0.

Contents

1 Overview of Data Mining 1

1.1 Multivariateanalysis ........................ 2

1.2 Graphical methods......................... 3

1.3 Clusteranalysis........................... 13

1.4 Kohonen"s self organizing maps.................. 19

1.5 Exploratoryprojectionpursuit ................... 20

1.6 Anexampleofvisualization .................... 23

1.7 Categoricaldata........................... 30

2 Tree-based Methods 36

2.1 Partitioningmethods . . ...................... 37

2.2 Implementation inrpart...................... 49

3 Neural Networks 58

3.1 Feed-forwardneuralnetworks ................... 59

3.2 Multiplelogisticregression and discrimination.......... 68

3.3 Neuralnetworksinclassification.................. 69

3.4 A look at support vector machines................. 76

4 Near-neighbour Methods 79

4.1 Nearest neighbour methods..................... 79

4.2 Learningvectorquantization.................... 85

4.3 Forensicglass............................ 88

5 Assessing Performance 91

5.1 Practicalwaysofperformanceassessment............. 91

5.2 Calibrationplots........................... 93

5.3 PerformancesummariesandROCcurves ............. 95

5.4 Assessinggeneralization ...................... 97

References 99

Contentsiii

Index 105

Chapter 1

Overview of Data Mining

2Overview of Data Mining

Witten & Frank (2000).

1.1 Multivariate analysis

1.2 Graphical methods3

1.2 Graphical methods

XGobi,GGobiand

S-PLUS"sbrush.

Figure 1.1:S-PLUSbrushplot of theirisdata.

Principal component analysis

4Overview of Data Mining

Σaand this is

Σ=C

Λb=?λ

Anotherpointofviewistoseeknewvariablesy

Standard deviations:

Comp.1 Comp.2 Comp.3 Comp.4

1.7125 0.95238 0.3647 0.16568

1.2 Graphical methods5