[PDF] [PDF] Data Mining Input: Concepts, Instances, and Attributes - Computer

17 mar 2021 · We will focus on nominal and numeric attributes output attribute is numeric ( also called Most common form in practical data mining



Previous PDF Next PDF





[PDF] Data Mining - Computer Science & Engineering User Home Pages

27 jan 2021 · – Often represented as integer variables – Has real numbers as attribute values – Examples: temperature, height, or weight – Practically, real values can only be measured and represented using a finite number of digits – Continuous attributes are typically represented as floating- point variables



[PDF] Basic Data Mining Techniques

Attributes Objects Data Mining Lecture 2 4 Attribute Values • Attribute values are numbers or symbols assigned to an attribute • Distinction between attributes  



[PDF] Data Mining - University of Waikato

Attributes: measuring aspects of an instance We will focus on nominal and numeric ones 4 Data Mining: Practical Machine Learning Tools and Techniques  



[PDF] Data Mining: Data

There are different types of attributes – Nominal:Examples: ID numbers, eye color, zip codes – Ordinal: Examples: rankings (e g , taste of potato chips on a 



[PDF] Attribute - CS416 Compiler Design

A collection of attributes describe an object • Attribute values are numbers or symbols assigned to an attribute Data Mining 4 



[PDF] Basic Concepts in Data Mining

Data Normalization assigns the correct numerical weighting to the values of different attributes • For example: – Transform all numerical values from min to max on 



Mining Numerical Data – A Rough Set Approach

For knowledge acquisition (or data mining) from data with numerical attributes special techniques are applied [13] Most frequently, an additional step, taken



[PDF] A Method for Handling Numerical Attributes in GA-based Inductive

Numerical attributes affect the efficiency of learning and the accuracy of the learned the- ory The standard approach for dealing with numerical attributes in 



[PDF] Data Mining Input: Concepts, Instances, and Attributes - Computer

17 mar 2021 · We will focus on nominal and numeric attributes output attribute is numeric ( also called Most common form in practical data mining

[PDF] numerical analysis 1

[PDF] numerical analysis 1 pdf

[PDF] numerical analysis book for bsc

[PDF] numerical analysis book pdf by b.s. grewal

[PDF] numerical analysis book pdf by jain and iyengar

[PDF] numerical analysis books indian authors

[PDF] numerical analysis bsc 3rd year

[PDF] numerical analysis handwritten notes pdf

[PDF] numerical analysis pdf download

[PDF] numerical analysis pdf for computer science

[PDF] numerical analysis pdf s.s sastry

[PDF] numerical analysis pdf sauer

[PDF] numerical analysis pdf solutions

[PDF] numerical analysis questions and answers pdf

[PDF] numerical mathematical analysis pdf

10/20/2022

1

Data Mining Input:

Concepts, Instances,

Attributes

...and Pre-Processing

Chapter 2 of Data Mining

Terminology

2

Components of the input:

Concepts: kinds of things that can be learned

Goal: intelligible and operational concept description

E.g.: "Under what conditions should we play?"

This concept is located somewhere in the input data Instances: the individual, independent examples of a concept Note: more complicated forms of input are possible Attributes: measuring characteristics/aspects of an instance

We will focus on nominaland numericattributes

10/20/2022

2

Whatisaconcept?

3

Styles of learning:

Classification learning:

understanding/predicting a discrete class

Association learning:

detecting associations between features

Clustering:

grouping similar instances into clusters

Numeric estimation:

understanding/predicting a numeric quantity

Concept: thing to be learned

Concept description:

output of learning scheme

Classificationlearning

4

Example problems: weather data, medical

diagnosis, contact lenses, irises, labor negotiations, etc.

Can you think of others?

Classification learning is supervised

Algorithm is provided with actual outcomes

Outcome is called the class attributeof the example

Measure success on fresh data for which class

labels are known (test data, as opposed to training data) In practice success is often measured subjectively How acceptable the learned description is to a human user

10/20/2022

3

Associationlearning

5 Can be applied if no class is specified and any kind of structure is considered "interesting"

Difference from classification learning:

Unsupervised

I.e., not told what to learn

Can predict any attribute's value, not just the class, and more than one attribute's value at a time Hence: far more association rules than classification rules

Thus: constraints are necessary

Minimum coverageand minimum accuracy

Clustering

6

Finding groups of items that are similar

Clustering is unsupervised

The class of an example is not known

Success often measured subjectively

Iris virginica1.95.12.75.810210152

512
1

Iris virginica2.56.03.36.3

Iris versicolor1.54.53.26.4

Iris versicolor1.44.73.27.0

Iris setosa0.21.43.04.9

Iris setosa0.21.43.55.1

TypePetal widthPetal lengthSepal widthSepal length

10/20/2022

4

Numericestimation

7

Variant of classification learning where the

output attribute is numeric (also called "regression")

Learning is supervised

Algorithm is provided with target values

Measure success on test data

40FalseNormalMildRainy

55FalseHighHot Overcast

0TrueHigh Hot Sunny

5FalseHighHotSunny

Play-timeWindyHumidityTemperatureOutlook

Someinputterminology

model input attributesoutput attribute rules or tree or... fever swollen glands headache...diagnosis orinstance. -theinputattributes-everythingelse •Example:

10/20/2022

5

What'sinanexample?

9

Instance: specific type of example

Thing to be classified, associated, or clustered

Individual, independent example of target concept

Characterized by a predetermined set of attributes

Input to learning scheme: set of independent

instances dataset

Represented as a single relation/flat file

Note difference from relational database

Rather restricted form of input

No relationships between objects/instances

Most common form in practical data mining

Example:Afamilytree

10

Steven

MGraham

MPam F Grace FRay M= Ian

MPippa

FBrian

M Anna

FNikki

F Peggy

FPeter

M

10/20/2022

6

Familytreerepresentedasatable

11

IanPamFemaleNikki

IanPamFemaleAnna

RayGraceMaleBrian

RayGraceFemalePippa

RayGraceMaleIan

PeggyPeterFemalePam

PeggyPeterMaleGraham

PeggyPeterMaleSteven

??FemalePeggy ??MalePeter parent2Parent1GenderName

The"sisterͲof"relation:

Twoversions

12 yesAnnaNikki

YesNikkiAnna

YesPippaIan

YesPamSteven

NoGrahamSteven

NoPeterSteven

NoStevenPeter

NoPeggyPeter

Sister of?Second

personFirst person

NoAll the rest

YesAnnaNikki

YesNikkiAnna

YesPippaBrian

YesPippaIan

YesPamGraham

YesPamSteven

Sister of?Second

personFirst person

Closed-world assumption

10/20/2022

7

Afullrepresentationinoneflatfiletable

13 Ian Ian Ray Ray Peggy Peggy

Parent2

Female

Female

Female

Female

Female

Female

Gender

Pam Pam Grace Grace Peter Peter

Parent1NameParent2Parent1GenderName

Ian Ian Ray Ray Peggy Peggy Pam Pam Grace Grace Peter Peter

Female

Female

Male Male Male Male

NoAll the rest

YesAnnaNikki

YesNikkiAnna

YesPippaBrian

YesPippaIan

YesPamGraham

YesPamSteven

Sister

of?Second personFirst person

If second person's gender = female

and first person's parent1 = second person's parent1 then sister-of = yes

Generatingaflatfile

14

Process of flattening is called "denormalization"

Several relations are joined together to make one

Possible with any finite set of finite relations

More on this in CSC-341

Denormalization may produce spurious regularities

that reflect structure of database

Examples (functional dependencies):

"supplier" predicts "supplier address" "cournum" predicts "courname" "model" predicts "make"

May also cause data inconsistencies and

redundancies

From same data stored in different tables

10/20/2022

8

MultiͲinstanceConcepts

individualinstancesarenotindependent

Oneormoreinstanceswithinanexamplemaybe

responsibleforitsclassification

Examples

multiͲdaygameactivity(theweatherdata) performanceofastudentovermultipleclasses 15

What'sinanattribute?

16

Each instance is described by a fixed predefined

set of features, its "attributes"

But: number of relevantattributes may vary

Example: table of baseball statistics

Possible solution: "irrelevant value" flag

Related problem: value of an attribute may depend

on value of another one Potential impact on learning beyond prior discussion

Possible solution: methods of data reduction

Possible attribute types ("levels of measurement"):

Nominal, ordinal, interval and ratio

Simplifies to nominaland numeric

10/20/2022

9

Typesofattributes

-thereisasmallsetofpossiblevalues attribute possiblevalues

Fever {Yes,No}

Diagnosis {Allergy,Cold,StrepThroat}

Outlook {sunny,overcast,raining}

•Noorderingordistancemeasure •Canonlytestforequality attribute possiblevalues

BodyTemp anyvaluein96.0Ͳ106.0

Salary anyvaluein$15,000Ͳ250,000

$210,000>$125,000

98.6<101.3

Typesofattributes

•Whataboutthisone? attribute possiblevalues

ProductType {0,1,2,3}

suchattributes. -example:producttype2>producttype1 doesn'thaveanymeaning -hot>mild>cool -young10/20/2022 10

Ordinalquantities

19

Impose order on values

But no distance between values defined

Example:

attribute "temperature" in weather data

Values: "hot" > "mild" > "cool"

Note: addition and subtraction don't make sense

Example rule:

temperature < hot play = yes

Distinction between nominal and ordinal not always clear (e.g. attribute "outlook" - is there an ordering?)

Nominalvs.ordinal

20

Attribute "age" nominal

Attribute "age" ordinal

(e.g. "young" < "pre-presbyopic" < "presbyopic")

If age = young and astigmatic = no

and tear production rate = normal then recommendation = soft

If age = pre-presbyopic and astigmatic = no

and tear production rate = normal then recommendation = soft

If age pre-presbyopic and astigmatic = no

and tear production rate = normal then recommendation = soft

10/20/2022

11

Attributetypesusedinpractice

23

Most schemes accommodate just two levels of

measurement: nominaland numeric, by which we typically only mean ordinal

Nominal attributes are also called "categorical",

"enumerated", or "discrete" Ordinal attributes are also called "numeric", or "continuous"

Preparingtheinput

data cleaning,andreductioncomprisethemajority oftheworkofbuildingadatasetforeffective datamining

10/20/2022

12

Preparingtheinput

25

Extraction: acquiring the data (more later)

Integration

Denormalization is not the only challenge

Problem: different data sources (e.g. sales department, customer billing department, ...) Differences: styles of record keeping, conventions, time periods, primary keys, errors

External data may be required ("overlay data")

Transformation: reformat for specific data mining

algorithms (we'll come back to this) Many potential dataset problems requiring "cleaning"

Missingvalues

26

Frequently indicated by out-of-range entries

E.g. -999, "?"

Types: unknown, unrecorded, irrelevant

Reasons:

malfunctioning equipment changes in experimental design (e.g., new survey questions) collation of different datasets measurement not possible user refusal to answer survey question Missing value may have significance in itself (e.g. missing test in a medical examination) Most schemes assume that is not the case: "missing" may need to be coded as additional value

10/20/2022

13

Inaccuratevalues

27
Reason: data has not been collected for the purpose of mining Result: errors and omissions that don't affect original purpose of data but are critical to mining E.g. data on hobbies of university students and faculty Typographical errors in nominal attributes values need to be checked for consistency Typographical, measurement, rounding errors in numeric attributes outliers need to be identified What facility of Weka did we learn in lab that might be useful here?

Errors may be deliberate

E.g. wrong zip codes

Unbalanceddata

•Supposethediagnosisdatasethad97 instancesofallergy,2ofcold,and1ofstrep -Consequences? •Anotherlessonaboutrawaccuracy percentagesnottellingthewholestory evaluation anythinginterestingaboutthedata

10/20/2022

14

Otherproblems

Duplicate/redundantdata

Instances

Outliers

Staledata

Differentformats

2022Ͳ09Ͳ13vs.Sep.13,2022

Noise •Noisy datais meaningless data - Not useful for prediction • The term has often been used as a synonym for corrupt data • Its meaning has expanded to include any data that cannot be understood and interpreted correctly by machines - unstructured text for example • Distinguishing signalfrom noiseis the task at the heart of data mining

10/20/2022

15 datacleaning •AlsocalledpreͲprocessing,or •Datawrangling(sometimes)g

Gettingtoknowthedata

32

Simple visualization tools are very useful

Nominal attributes: histograms

Q: Is the distribution consistent with background

knowledge? Build hypotheses about which attributes to study closely

Numeric attributes: graphs

Q: Any obvious outliers?

2-D and 3-D plots show dependencies

Need to consult domain experts

Too much data to inspect? Take a sample!

More complex data viz tools represent an

entire subdiscipline of Computer Science

10/20/2022

16

TheARFFformat

33
% ARFF file for weather data with some numeric features @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes

Additionalattributetypes

ARFFsupportsstringattributes:

notpreͲspecified

Italsosupportsdateattributes:

UsestheISOͲ8601combineddateandtime

formatyyyyͲMMͲddͲTHH:mm:ss 34
@attribute description string @attribute today date

10/20/2022

17

Sparsedata

Insomeapplicationsmostattributevaluesina

datasetarezero wordcountsinatextcategorizationproblem productcountsinmarketbasketanalysis

ARFFsupportssparsedata

35

0, 26, 0, 0, 0 ,0, 63, 0, 0, 0, "class A"

0, 0, 0, 42, 0, 0, 0, 0, 0, 0, "class B"

quotesdbs_dbs20.pdfusesText_26