17 mar 2021 · We will focus on nominal and numeric attributes output attribute is numeric ( also called Most common form in practical data mining
Previous PDF | Next PDF |
[PDF] Data Mining - Computer Science & Engineering User Home Pages
27 jan 2021 · – Often represented as integer variables – Has real numbers as attribute values – Examples: temperature, height, or weight – Practically, real values can only be measured and represented using a finite number of digits – Continuous attributes are typically represented as floating- point variables
[PDF] Basic Data Mining Techniques
Attributes Objects Data Mining Lecture 2 4 Attribute Values • Attribute values are numbers or symbols assigned to an attribute • Distinction between attributes
[PDF] Data Mining - University of Waikato
Attributes: measuring aspects of an instance We will focus on nominal and numeric ones 4 Data Mining: Practical Machine Learning Tools and Techniques
[PDF] Data Mining: Data
There are different types of attributes – Nominal:Examples: ID numbers, eye color, zip codes – Ordinal: Examples: rankings (e g , taste of potato chips on a
[PDF] Attribute - CS416 Compiler Design
A collection of attributes describe an object • Attribute values are numbers or symbols assigned to an attribute Data Mining 4
[PDF] Basic Concepts in Data Mining
Data Normalization assigns the correct numerical weighting to the values of different attributes • For example: – Transform all numerical values from min to max on
Mining Numerical Data – A Rough Set Approach
For knowledge acquisition (or data mining) from data with numerical attributes special techniques are applied [13] Most frequently, an additional step, taken
[PDF] A Method for Handling Numerical Attributes in GA-based Inductive
Numerical attributes affect the efficiency of learning and the accuracy of the learned the- ory The standard approach for dealing with numerical attributes in
[PDF] Data Mining Input: Concepts, Instances, and Attributes - Computer
17 mar 2021 · We will focus on nominal and numeric attributes output attribute is numeric ( also called Most common form in practical data mining
[PDF] numerical analysis 1 pdf
[PDF] numerical analysis book for bsc
[PDF] numerical analysis book pdf by b.s. grewal
[PDF] numerical analysis book pdf by jain and iyengar
[PDF] numerical analysis books indian authors
[PDF] numerical analysis bsc 3rd year
[PDF] numerical analysis handwritten notes pdf
[PDF] numerical analysis pdf download
[PDF] numerical analysis pdf for computer science
[PDF] numerical analysis pdf s.s sastry
[PDF] numerical analysis pdf sauer
[PDF] numerical analysis pdf solutions
[PDF] numerical analysis questions and answers pdf
[PDF] numerical mathematical analysis pdf
10/20/2022
1Data Mining Input:
Concepts, Instances,
Attributes
...and Pre-ProcessingChapter 2 of Data Mining
Terminology
2Components of the input:
Concepts: kinds of things that can be learned
Goal: intelligible and operational concept descriptionE.g.: "Under what conditions should we play?"
This concept is located somewhere in the input data Instances: the individual, independent examples of a concept Note: more complicated forms of input are possible Attributes: measuring characteristics/aspects of an instanceWe will focus on nominaland numericattributes
10/20/2022
2Whatisaconcept?
3Styles of learning:
Classification learning:
understanding/predicting a discrete classAssociation learning:
detecting associations between featuresClustering:
grouping similar instances into clustersNumeric estimation:
understanding/predicting a numeric quantityConcept: thing to be learned
Concept description:
output of learning schemeClassificationlearning
4Example problems: weather data, medical
diagnosis, contact lenses, irises, labor negotiations, etc.Can you think of others?
Classification learning is supervised
Algorithm is provided with actual outcomes
Outcome is called the class attributeof the exampleMeasure success on fresh data for which class
labels are known (test data, as opposed to training data) In practice success is often measured subjectively How acceptable the learned description is to a human user10/20/2022
3Associationlearning
5 Can be applied if no class is specified and any kind of structure is considered "interesting"Difference from classification learning:
Unsupervised
I.e., not told what to learn
Can predict any attribute's value, not just the class, and more than one attribute's value at a time Hence: far more association rules than classification rulesThus: constraints are necessary
Minimum coverageand minimum accuracy
Clustering
6Finding groups of items that are similar
Clustering is unsupervised
The class of an example is not known
Success often measured subjectively
Iris virginica1.95.12.75.810210152
5121
Iris virginica2.56.03.36.3
Iris versicolor1.54.53.26.4
Iris versicolor1.44.73.27.0
Iris setosa0.21.43.04.9
Iris setosa0.21.43.55.1
TypePetal widthPetal lengthSepal widthSepal length10/20/2022
4Numericestimation
7Variant of classification learning where the
output attribute is numeric (also called "regression")Learning is supervised
Algorithm is provided with target values
Measure success on test data
40FalseNormalMildRainy
55FalseHighHot Overcast
0TrueHigh Hot Sunny
5FalseHighHotSunny
Play-timeWindyHumidityTemperatureOutlook
Someinputterminology
model input attributesoutput attribute rules or tree or... fever swollen glands headache...diagnosis orinstance. -theinputattributes-everythingelse •Example:10/20/2022
5What'sinanexample?
9Instance: specific type of example
Thing to be classified, associated, or clustered
Individual, independent example of target concept
Characterized by a predetermined set of attributesInput to learning scheme: set of independent
instances datasetRepresented as a single relation/flat file
Note difference from relational database
Rather restricted form of input
No relationships between objects/instances
Most common form in practical data mining
Example:Afamilytree
10Steven
MGraham
MPam F Grace FRay M= IanMPippa
FBrian
M AnnaFNikki
F PeggyFPeter
M10/20/2022
6Familytreerepresentedasatable
11IanPamFemaleNikki
IanPamFemaleAnna
RayGraceMaleBrian
RayGraceFemalePippa
RayGraceMaleIan
PeggyPeterFemalePam
PeggyPeterMaleGraham
PeggyPeterMaleSteven
??FemalePeggy ??MalePeter parent2Parent1GenderNameThe"sisterͲof"relation:
Twoversions
12 yesAnnaNikkiYesNikkiAnna
YesPippaIan
YesPamSteven
NoGrahamSteven
NoPeterSteven
NoStevenPeter
NoPeggyPeter
Sister of?Second
personFirst personNoAll the rest
YesAnnaNikki
YesNikkiAnna
YesPippaBrian
YesPippaIan
YesPamGraham
YesPamSteven
Sister of?Second
personFirst personClosed-world assumption
10/20/2022
7Afullrepresentationinoneflatfiletable
13 Ian Ian Ray Ray Peggy PeggyParent2
Female
Female
Female
Female
Female
Female
Gender
Pam Pam Grace Grace Peter PeterParent1NameParent2Parent1GenderName
Ian Ian Ray Ray Peggy Peggy Pam Pam Grace Grace Peter PeterFemale
Female
Male Male Male MaleNoAll the rest
YesAnnaNikki
YesNikkiAnna
YesPippaBrian
YesPippaIan
YesPamGraham
YesPamSteven
Sister
of?Second personFirst personIf second person's gender = female
and first person's parent1 = second person's parent1 then sister-of = yesGeneratingaflatfile
14Process of flattening is called "denormalization"
Several relations are joined together to make one
Possible with any finite set of finite relations
More on this in CSC-341
Denormalization may produce spurious regularities
that reflect structure of databaseExamples (functional dependencies):
"supplier" predicts "supplier address" "cournum" predicts "courname" "model" predicts "make"May also cause data inconsistencies and
redundanciesFrom same data stored in different tables
10/20/2022
8MultiͲinstanceConcepts
individualinstancesarenotindependentOneormoreinstanceswithinanexamplemaybe
responsibleforitsclassificationExamples
multiͲdaygameactivity(theweatherdata) performanceofastudentovermultipleclasses 15What'sinanattribute?
16Each instance is described by a fixed predefined
set of features, its "attributes"But: number of relevantattributes may vary
Example: table of baseball statistics
Possible solution: "irrelevant value" flag
Related problem: value of an attribute may depend
on value of another one Potential impact on learning beyond prior discussionPossible solution: methods of data reduction
Possible attribute types ("levels of measurement"):Nominal, ordinal, interval and ratio
Simplifies to nominaland numeric
10/20/2022
9Typesofattributes
-thereisasmallsetofpossiblevalues attribute possiblevaluesFever {Yes,No}
Diagnosis {Allergy,Cold,StrepThroat}
Outlook {sunny,overcast,raining}
•Noorderingordistancemeasure •Canonlytestforequality attribute possiblevaluesBodyTemp anyvaluein96.0Ͳ106.0
Salary anyvaluein$15,000Ͳ250,000
$210,000>$125,00098.6<101.3
Typesofattributes
•Whataboutthisone? attribute possiblevaluesProductType {0,1,2,3}
suchattributes. -example:producttype2>producttype1 doesn'thaveanymeaning -hot>mild>cool -youngOrdinalquantities
19Impose order on values
But no distance between values defined
Example:
attribute "temperature" in weather dataValues: "hot" > "mild" > "cool"
Note: addition and subtraction don't make sense
Example rule:
temperature < hot play = yesDistinction between nominal and ordinal not always clear (e.g. attribute "outlook" - is there an ordering?)
Nominalvs.ordinal
20Attribute "age" nominal
Attribute "age" ordinal
(e.g. "young" < "pre-presbyopic" < "presbyopic")If age = young and astigmatic = no
and tear production rate = normal then recommendation = softIf age = pre-presbyopic and astigmatic = no
and tear production rate = normal then recommendation = softIf age pre-presbyopic and astigmatic = no
and tear production rate = normal then recommendation = soft10/20/2022
11Attributetypesusedinpractice
23Most schemes accommodate just two levels of
measurement: nominaland numeric, by which we typically only mean ordinalNominal attributes are also called "categorical",
"enumerated", or "discrete" Ordinal attributes are also called "numeric", or "continuous"Preparingtheinput
data cleaning,andreductioncomprisethemajority oftheworkofbuildingadatasetforeffective datamining10/20/2022
12Preparingtheinput
25Extraction: acquiring the data (more later)
Integration
Denormalization is not the only challenge
Problem: different data sources (e.g. sales department, customer billing department, ...) Differences: styles of record keeping, conventions, time periods, primary keys, errorsExternal data may be required ("overlay data")
Transformation: reformat for specific data mining
algorithms (we'll come back to this) Many potential dataset problems requiring "cleaning"Missingvalues
26Frequently indicated by out-of-range entries
E.g. -999, "?"
Types: unknown, unrecorded, irrelevant
Reasons:
malfunctioning equipment changes in experimental design (e.g., new survey questions) collation of different datasets measurement not possible user refusal to answer survey question Missing value may have significance in itself (e.g. missing test in a medical examination) Most schemes assume that is not the case: "missing" may need to be coded as additional value10/20/2022
13Inaccuratevalues
27Reason: data has not been collected for the purpose of mining Result: errors and omissions that don't affect original purpose of data but are critical to mining E.g. data on hobbies of university students and faculty Typographical errors in nominal attributes values need to be checked for consistency Typographical, measurement, rounding errors in numeric attributes outliers need to be identified What facility of Weka did we learn in lab that might be useful here?
Errors may be deliberate
E.g. wrong zip codes
Unbalanceddata
•Supposethediagnosisdatasethad97 instancesofallergy,2ofcold,and1ofstrep -Consequences? •Anotherlessonaboutrawaccuracy percentagesnottellingthewholestory evaluation anythinginterestingaboutthedata10/20/2022
14Otherproblems
Duplicate/redundantdata
Instances
Outliers
Staledata
Differentformats
2022Ͳ09Ͳ13vs.Sep.13,2022
Noise •Noisy datais meaningless data - Not useful for prediction • The term has often been used as a synonym for corrupt data • Its meaning has expanded to include any data that cannot be understood and interpreted correctly by machines - unstructured text for example • Distinguishing signalfrom noiseis the task at the heart of data mining10/20/2022
15 datacleaning •AlsocalledpreͲprocessing,or •Datawrangling(sometimes)gGettingtoknowthedata
32Simple visualization tools are very useful
Nominal attributes: histograms
Q: Is the distribution consistent with background
knowledge? Build hypotheses about which attributes to study closelyNumeric attributes: graphs
Q: Any obvious outliers?
2-D and 3-D plots show dependencies
Need to consult domain experts
Too much data to inspect? Take a sample!
More complex data viz tools represent an
entire subdiscipline of Computer Science10/20/2022
16TheARFFformat
33% ARFF file for weather data with some numeric features @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes
Additionalattributetypes
ARFFsupportsstringattributes:
notpreͲspecifiedItalsosupportsdateattributes:
UsestheISOͲ8601combineddateandtime
formatyyyyͲMMͲddͲTHH:mm:ss 34@attribute description string @attribute today date