[PDF] [PDF] Data Mining Input: Concepts Instances Attributes and Pre





Previous PDF Next PDF



Data Objects and Attribute Types • Basic Statistical Descriptions of

Note that quantitative attributes can be integer-valued or continuous. – Numeric operations such as mean standard deviation are meaningful. Data Mining.





Optimal Subgroup Discovery in Purely Numerical Data

27 janv. 2021 Mining purely numerical data is quite popular. It concerns data made of objects described by numerical attributes and one of these attributes ...



DB-HReduction: A Data Preprocessing Algorithm for Data Mining

time the data are collected without “mining” in mind. In addition



1 CLUSTERING LARGE DATA SETS WITH MIXED NUMERIC AND

Another characteristic is that data in data mining often contains both numeric and categorical values. The traditional way to treat categorical attributes as 



Numerical Association Rule Mining from a Defined Schema Using

2 juil. 2021 Keywords: association rules; data mining; ... encompasses numerical attributes in the search process for patterns through rules in the data.



Mining Optimized Association Rules for Numeric Attributes

algorithms that compute the optimized ranges in linear time if the data are sorted. Since sorting data with respect to each numeric attribute is.



LATEX-Numeric: Language Agnostic Text Attribute Extraction for

11 juin 2021 We rely on dis- tant supervision for training data generation removing dependency on manual labels. One issue with distant supervision is that ...



Optimal Subgroup Discovery in Purely Numerical Data

27 janv. 2021 Mining purely numerical data is quite popular. It concerns data made of objects described by numerical attributes and one of these attributes ...



1992-ChiMerge: Discretization of Numeric Attributes

Many classification algorithms require that the training data contain only discrete attributes. To use such an algorithm when there are numeric at-.



Data Mining and Machine Learning: Fundamental Concepts and

Chapter 2: Numeric Attributes Zaki & Meira Jr (RPI and UFMG) Data Mining and Machine Learning Chapter 2: Numeric Attributes 1/35 Univariate Analysis Univariate analysis focuses on a single attribute at a time The data matrix D is an n×1 matrix D = X x 1 x 2 x n where X is the numeric attribute of interest with x



Describe the different types of attributes one may come across in a

01/27/2021 Introduction to Data Mining 2nd Edition 18 Tan Steinbach Karpatne Kumar Data Matrix ˜ If data objects have the same fixed set of numeric attributes then the data objects can be thought of as points in a multi-dimensional space where each dimension represents a distinct attribute



Data Mining and Analysis - Cambridge

numeric attribute is one that has a real-valued or integer-valued domain ForexampleAgewithdomain(Age) =NwhereNdenotes the set of natural numbers(non-negative integers) is numeric and so is petal length in Table 1 1 withdomain(petal length)=R+(the set of all positive real numbers)



Data Mining - University of Waikato

We will focus on nominal and numeric ones Data Mining: Practical Machine Learning Tools and Techniques (Chapter 2) 4 What’s a concept? Styles of learning: Classification learning: predicting a discrete class Association learning: detecting associations between features Clustering: grouping similar instances into clusters



Data Mining: Data - Khoury College of Computer Sciences

There are different types of attributes –Nominal uExamples: ID numbers eye color zip codes –Ordinal uExamples: rankings (e g taste of potato chips on a scale from 1-10) grades height in {tall medium short} –Interval uExamples: calendar dates temperatures in Celsius or Fahrenheit –Ratio



Searches related to numeric attributes in data mining filetype:pdf

There are a variety of statistical techniques available to analyse quantitative (numeric) data sets In this case we have selected to use Principal Components Analysis (PCA) to reduce the dimensionality of our data and Growing Neural Gas (GNG) to identify potentially interesting clusters of data



[PDF] Data Objects and Attribute Types • Basic Statistical Descriptions of

A collection of attributes describe an object • Attribute values are numbers or symbols assigned to an attribute Data Mining



[PDF] Data Lecture Notes for Chapter 2 Introduction to Data Mining 2nd

27 jan 2021 · Introduction to Data Mining 2nd Edition Tan Steinbach Karpatne Kumar Attribute Values Attribute values are numbers or symbols



[PDF] Data Mining - University of Waikato

Attributes: measuring aspects of an instance We will focus on nominal and numeric ones 4 Data Mining: Practical Machine Learning Tools and Techniques 



[PDF] Data Mining

There are different types of attributes – Nominal:Examples: ID numbers eye color zip codes – Ordinal: Examples: rankings (e g taste of potato



[PDF] Data Mining Input: Concepts Instances Attributes and Pre

Numeric attributes have values that come from a range of numbers attribute possible values Body Temp any value in 96 0-106 0 Salary any value in $15000 



[PDF] Basic Data Mining Techniques

Attributes Objects Data Mining Lecture 2 4 Attribute Values • Attribute values are numbers or symbols assigned to an attribute



[PDF] Know Your Data

In our presentation we have organized attributes into nominal binary ordinal and numeric types There are many ways to organize attribute types The types



[PDF] Data Chapter 2 Introduction to Data Mining

Data Mining: Data Chapter 2 Attribute values are numbers or symbols assigned to an attribute Different attributes can be mapped to the same set of



[PDF] 22 Chapter 2 Data

In turn data objects are described by a number of attributes that capture the basic characteristics of an object such as the mass of a physical object or the 



[PDF] LECTURE NOTES ON DATA MINING& DATA WAREHOUSING

A user does not want hundreds of pages of numeric results He does not understand them; he cannot summarize interpret and use them for successful decision 

What are the different types of attributes in data mining?

    Describe the different types of attributes one may come across in a data mining data set with two examples of each type. The values of a nominal attribute are just different names, i.e. nominal attributes provide only enough information to distinguish one object from another (=,?) Examples: zip codes, employees ID numbers.

What are the characteristics of a data mining algorithm?

    Data mining algorithms are often sensitive to specific characteristics of the data: outliers (data values that are very different from the typical values in your database), irrelevant columns, columns that vary together (such as age and date of birth), data coding, and data that you choose to include or exclude.

What is attribute importance in Oracle Data Mining?

    Oracle Data Mining supports the Attribute Importance mining function, which ranks attributes according to their importance in predicting a target. Attribute importance does not actually perform feature selection since all the predictors are retained in the model.

What is a numeric attribute?

    A numeric attribute is quantitative; that is, it is a measurable quantity, represented in integer or real values. Numeric attributes can be interval-scaled or ratio-scaled. Photo by Luke Chesseron Unsplash What are interval-scaled attributes? A temperature attribute is interval-scaled.

10/20/2022

1

Data Mining Input:

Concepts, Instances,

Attributes

...and Pre-Processing

Chapter 2 of Data Mining

Terminology

2

Components of the input:

Concepts: kinds of things that can be learned

Goal: intelligible and operational concept description

E.g.: "Under what conditions should we play?"

This concept is located somewhere in the input data Instances: the individual, independent examples of a concept Note: more complicated forms of input are possible Attributes: measuring characteristics/aspects of an instance

We will focus on nominaland numericattributes

10/20/2022

2

Whatisaconcept?

3

Styles of learning:

Classification learning:

understanding/predicting a discrete class

Association learning:

detecting associations between features

Clustering:

grouping similar instances into clusters

Numeric estimation:

understanding/predicting a numeric quantity

Concept: thing to be learned

Concept description:

output of learning scheme

Classificationlearning

4

Example problems: weather data, medical

diagnosis, contact lenses, irises, labor negotiations, etc.

Can you think of others?

Classification learning is supervised

Algorithm is provided with actual outcomes

Outcome is called the class attributeof the example

Measure success on fresh data for which class

labels are known (test data, as opposed to training data) In practice success is often measured subjectively How acceptable the learned description is to a human user

10/20/2022

3

Associationlearning

5 Can be applied if no class is specified and any kind of structure is considered "interesting"

Difference from classification learning:

Unsupervised

I.e., not told what to learn

Can predict any attribute's value, not just the class, and more than one attribute's value at a time Hence: far more association rules than classification rules

Thus: constraints are necessary

Minimum coverageand minimum accuracy

Clustering

6

Finding groups of items that are similar

Clustering is unsupervised

The class of an example is not known

Success often measured subjectively

Iris virginica1.95.12.75.810210152

512
1

Iris virginica2.56.03.36.3

Iris versicolor1.54.53.26.4

Iris versicolor1.44.73.27.0

Iris setosa0.21.43.04.9

Iris setosa0.21.43.55.1

TypePetal widthPetal lengthSepal widthSepal length

10/20/2022

4

Numericestimation

7

Variant of classification learning where the

output attribute is numeric (also called "regression")

Learning is supervised

Algorithm is provided with target values

Measure success on test data

40FalseNormalMildRainy

55FalseHighHot Overcast

0TrueHigh Hot Sunny

5FalseHighHotSunny

Play-timeWindyHumidityTemperatureOutlook

Someinputterminology

model input attributesoutput attribute rules or tree or... fever swollen glands headache...diagnosis orinstance. -theinputattributes-everythingelse •Example:

10/20/2022

5

What'sinanexample?

9

Instance: specific type of example

Thing to be classified, associated, or clustered

Individual, independent example of target concept

Characterized by a predetermined set of attributes

Input to learning scheme: set of independent

instances dataset

Represented as a single relation/flat file

Note difference from relational database

Rather restricted form of input

No relationships between objects/instances

Most common form in practical data mining

Example:Afamilytree

10

Steven

MGraham

MPam F Grace FRay M= Ian

MPippa

FBrian

M Anna

FNikki

F Peggy

FPeter

M

10/20/2022

6

Familytreerepresentedasatable

11

IanPamFemaleNikki

IanPamFemaleAnna

RayGraceMaleBrian

RayGraceFemalePippa

RayGraceMaleIan

PeggyPeterFemalePam

PeggyPeterMaleGraham

PeggyPeterMaleSteven

??FemalePeggy ??MalePeter parent2Parent1GenderName

The"sisterͲof"relation:

Twoversions

12 yesAnnaNikki

YesNikkiAnna

YesPippaIan

YesPamSteven

NoGrahamSteven

NoPeterSteven

NoStevenPeter

NoPeggyPeter

Sister of?Second

personFirst person

NoAll the rest

YesAnnaNikki

YesNikkiAnna

YesPippaBrian

YesPippaIan

YesPamGraham

YesPamSteven

Sister of?Second

personFirst person

Closed-world assumption

10/20/2022

7

Afullrepresentationinoneflatfiletable

13 Ian Ian Ray Ray Peggy Peggy

Parent2

Female

Female

Female

Female

Female

Female

Gender

Pam Pam Grace Grace Peter Peter

Parent1NameParent2Parent1GenderName

Ian Ian Ray Ray Peggy Peggy Pam Pam Grace Grace Peter Peter

Female

Female

Male Male Male Male

NoAll the rest

YesAnnaNikki

YesNikkiAnna

YesPippaBrian

YesPippaIan

YesPamGraham

YesPamSteven

Sister

of?Second personFirst person

If second person's gender = female

and first person's parent1 = second person's parent1 then sister-of = yes

Generatingaflatfile

14

Process of flattening is called "denormalization"

Several relations are joined together to make one

Possible with any finite set of finite relations

More on this in CSC-341

Denormalization may produce spurious regularities

that reflect structure of database

Examples (functional dependencies):

"supplier" predicts "supplier address" "cournum" predicts "courname" "model" predicts "make"

May also cause data inconsistencies and

redundancies

From same data stored in different tables

10/20/2022

8

MultiͲinstanceConcepts

individualinstancesarenotindependent

Oneormoreinstanceswithinanexamplemaybe

responsibleforitsclassification

Examples

multiͲdaygameactivity(theweatherdata) performanceofastudentovermultipleclasses 15

What'sinanattribute?

16

Each instance is described by a fixed predefined

set of features, its "attributes"

But: number of relevantattributes may vary

Example: table of baseball statistics

Possible solution: "irrelevant value" flag

Related problem: value of an attribute may depend

on value of another one Potential impact on learning beyond prior discussion

Possible solution: methods of data reduction

Possible attribute types ("levels of measurement"):

Nominal, ordinal, interval and ratio

Simplifies to nominaland numeric

10/20/2022

9

Typesofattributes

-thereisasmallsetofpossiblevalues attribute possiblevalues

Fever {Yes,No}

Diagnosis {Allergy,Cold,StrepThroat}

Outlook {sunny,overcast,raining}

•Noorderingordistancemeasure •Canonlytestforequality attribute possiblevalues

BodyTemp anyvaluein96.0Ͳ106.0

Salary anyvaluein$15,000Ͳ250,000

$210,000>$125,000

98.6<101.3

Typesofattributes

•Whataboutthisone? attribute possiblevalues

ProductType {0,1,2,3}

suchattributes. -example:producttype2>producttype1 doesn'thaveanymeaning -hot>mild>cool -young10/20/2022 10

Ordinalquantities

19

Impose order on values

But no distance between values defined

Example:

attribute "temperature" in weather data

Values: "hot" > "mild" > "cool"

Note: addition and subtraction don't make sense

Example rule:

temperature < hot play = yes

Distinction between nominal and ordinal not always clear (e.g. attribute "outlook" - is there an ordering?)

Nominalvs.ordinal

20

Attribute "age" nominal

Attribute "age" ordinal

(e.g. "young" < "pre-presbyopic" < "presbyopic")

If age = young and astigmatic = no

and tear production rate = normal then recommendation = soft

If age = pre-presbyopic and astigmatic = no

and tear production rate = normal then recommendation = soft

If age pre-presbyopic and astigmatic = no

and tear production rate = normal then recommendation = soft

10/20/2022

11

Attributetypesusedinpractice

23

Most schemes accommodate just two levels of

measurement: nominaland numeric, by which we typically only mean ordinal

Nominal attributes are also called "categorical",

"enumerated", or "discrete" Ordinal attributes are also called "numeric", or "continuous"

Preparingtheinput

data cleaning,andreductioncomprisethemajority oftheworkofbuildingadatasetforeffective datamining

10/20/2022

12

Preparingtheinput

25

Extraction: acquiring the data (more later)

Integration

Denormalization is not the only challenge

Problem: different data sources (e.g. sales department, customer billing department, ...) Differences: styles of record keeping, conventions, time periods, primary keys, errors

External data may be required ("overlay data")

Transformation: reformat for specific data mining

algorithms (we'll come back to this) Many potential dataset problems requiring "cleaning"

Missingvalues

26

Frequently indicated by out-of-range entries

E.g. -999, "?"

Types: unknown, unrecorded, irrelevant

Reasons:

malfunctioning equipment changes in experimental design (e.g., new survey questions) collation of different datasets measurement not possible user refusal to answer survey question Missing value may have significance in itself (e.g. missing test in a medical examination) Most schemes assume that is not the case: "missing" may need to be coded as additional value

10/20/2022

13

Inaccuratevalues

27
Reason: data has not been collected for the purpose of mining Result: errors and omissions that don't affect original purpose of data but are critical to mining E.g. data on hobbies of university students and faculty Typographical errors in nominal attributes values need to be checked for consistency Typographical, measurement, rounding errors in numeric attributes outliers need to be identified What facility of Weka did we learn in lab that might be useful here?

Errors may be deliberate

E.g. wrong zip codes

Unbalanceddata

•Supposethediagnosisdatasethad97 instancesofallergy,2ofcold,and1ofstrep -Consequences? •Anotherlessonaboutrawaccuracy percentagesnottellingthewholestory evaluation anythinginterestingaboutthedata

10/20/2022

14

Otherproblems

Duplicate/redundantdata

Instances

Outliers

Staledata

Differentformats

2022Ͳ09Ͳ13vs.Sep.13,2022

Noise •Noisy datais meaningless data - Not useful for prediction • The term has often been used as a synonym for corrupt data • Its meaning has expanded to include any data that cannot be understood and interpreted correctly by machines - unstructured text for example • Distinguishing signalfrom noiseis the task at the heart of data mining

10/20/2022

15 datacleaning •AlsocalledpreͲprocessing,or •Datawrangling(sometimes)g

Gettingtoknowthedata

32

Simple visualization tools are very useful

Nominal attributes: histograms

Q: Is the distribution consistent with background

knowledge? Build hypotheses about which attributes to study closely

Numeric attributes: graphs

Q: Any obvious outliers?

2-D and 3-D plots show dependencies

Need to consult domain experts

Too much data to inspect? Take a sample!

More complex data viz tools represent an

entire subdiscipline of Computer Science

10/20/2022

16

TheARFFformat

33
% ARFF file for weather data with some numeric features @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes

Additionalattributetypes

ARFFsupportsstringattributes:

notpreͲspecified

Italsosupportsdateattributes:

UsestheISOͲ8601combineddateandtime

formatyyyyͲMMͲddͲTHH:mm:ss 34
@attribute description string @attribute today date

10/20/2022

17

Sparsedata

Insomeapplicationsmostattributevaluesina

datasetarezero wordcountsinatextcategorizationproblem productcountsinmarketbasketanalysis

ARFFsupportssparsedata

35

0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, "class A"

0, 0, 0, 42, 0, 0, 0, 0, 0, 0, "class B"

quotesdbs_dbs20.pdfusesText_26
[PDF] numerical analysis 1

[PDF] numerical analysis 1 pdf

[PDF] numerical analysis book for bsc

[PDF] numerical analysis book pdf by b.s. grewal

[PDF] numerical analysis book pdf by jain and iyengar

[PDF] numerical analysis books indian authors

[PDF] numerical analysis bsc 3rd year

[PDF] numerical analysis handwritten notes pdf

[PDF] numerical analysis pdf download

[PDF] numerical analysis pdf for computer science

[PDF] numerical analysis pdf s.s sastry

[PDF] numerical analysis pdf sauer

[PDF] numerical analysis pdf solutions

[PDF] numerical analysis questions and answers pdf

[PDF] numerical mathematical analysis pdf