https://pandas pydata McKinney, Wes "Data structures for statistical computing in python "

Sebastian Raschka STAT 479: Machine Learning FS 2018 7 https://pandas pydata McKinney, Wes "Data structures for statistical computing in python "

Data Preprocessing and

Machine Learning with Scikit-Learn

(Computational Foundations Part 3/3)Lecture 051STAT 479: Machine Learning, Fall 2018

Sebastian Raschka

!3Labels Raw Data

Training DatasetTest DatasetLabelsNew DataLabels

Training DatasetTest DatasetLabelsNew DataLabels



PreprocessingLearningEvaluationPredictionFinal Model

Feature Extraction and Scaling

Feature Selection

Dimensionality Reduction


Model Selection


Performance Metrics

Hyperparameter Optimization

Reading a Dataset from a Tabular Text File

Iris-VersicolorIris-VirginicaIris-SetosaFisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).

Sebastian Raschka STAT 479: Machine Learning FS 20186

https://pandas.pydata.orgMcKinney, Wes. "Data structures for statistical computing in python." Proceedings of the 9th Python in Science Conference. Vol. 445. 2010.

https://pandas.pydata.org Basic Data Handling

Raschka, Sebastian. "MLxtend: Providing machine learning and data science utilities and extensions to Python's scientific computing stack."

The Journal of Open Source Software 3.24 (2018).http://rasbt.github.io/mlxtend/MLXTEND Python Classes

The Journal of Open Source Software 3.24 (2018).http://rasbt.github.io/mlxtend/MLXTEND Sebastian Raschka STAT 479: Machine Learning FS 201818 Sebastian Raschka STAT 479: Machine Learning FS 201819 Sebastian Raschka STAT 479: Machine Learning FS 201820 Sebastian Raschka STAT 479: Machine Learning FS 201821 Sebastian Raschka STAT 479: Machine Learning FS 201822 Sebastian Raschka STAT 479: Machine Learning FS 201823Python Classes Sebastian Raschka STAT 479: Machine Learning FS 201824Python Classes Sebastian Raschka STAT 479: Machine Learning FS 201825Python Classes Sebastian Raschka STAT 479: Machine Learning FS 201826Python Classes Sebastian Raschka STAT 479: Machine Learning FS 201827 Sebastian Raschka STAT 479: Machine Learning FS 201828Python Classes Sebastian Raschka STAT 479: Machine Learning FS 201829Python Classes Sebastian Raschka STAT 479: Machine Learning FS 201830 Sebastian Raschka STAT 479: Machine Learning FS 201831

http://scikit-learn.orgPedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python."

Journal of machine learning research 12.Oct (2011): 2825-2830.

Journal of machine learning research 12.Oct (2011): 2825-2830. Sebastian Raschka STAT 479: Machine Learning FS 201833

Training DataModelTraining LabelsPredicted labelsTest Dataest.fit(X_train, y_train)est.predict(X_test)①②

Scikit-learn Estimator API

Scikit-learn Estimator API

Sebastian Raschka STAT 479: Machine Learning FS 201835

Sebastian Raschka STAT 479: Machine Learning FS 2018!361.3ResubstitutionV alidationand theHoldoutMethod

Theholdoutmethod isinarguably thesimplestmodel evaluation technique;itcanbe summarizedas follows.First,wetake alabeleddataset andsplitit intotwoparts: Atrainingand atestset. Then,we fitamodel tothetraining dataand predictthelabels ofthetest set.Thefraction ofcorrectpredictions, whichcanbe computedbycomparing thepredictedlabels totheground truthlabelsof thetest set, constitutesourestimate ofthemodel' sprediction accuracy. Here,it isimportanttonotethat we donotw anttotrain andevaluatea modelonthe sametrainingdataset (thisiscalled resubstitution validationorresubstitutionevaluation),sinceit would typicallyintroducea veryoptimisticbiasdue toov erfitting.Inotherwords,wecannottell whetherthemodel simplymemorized thetrainingdata, orwhetherit generalizeswellto new ,unseendata. (Onasi denote,we canestimatethisso-called optimismbiasasthedif ferencebetweenthe trainingandtestaccuracy.) Typically,thesplittingofadatasetinto trainingandtest setsisa simpleprocess ofrandomsubsam- pling.We assumethatalldatapoints have beendrawn fromthesame probabilitydistribution (with respecttoeach class).Andwe randomlychoose 2/3ofthese samplesforthe trainingsetand 1/3 ofthesamples forthetest set.Note thatthereare twoproblems withthisapproach, whichwewill discussinthe nextsections.


Wehaveto keepinmindthatadatasetrepresentsa randomsampledra wnfromaprobability distribution,andwetypicallyassume thatthis sampleisrepresentati veof thetruepopulation -more orless.No w,further subsamplingwithoutreplacementaltersthestatistic(mean,proportion, and variance)ofthesample.The degreeto whichsubsamplingwithout replacementaff ectsthe statisticof asamplei sinv erselyproportionaltothesizeof thesample.Letushavea lookatan exampleusing theIrisdataset 1 ,whichwe randomlydivide into2/3 trainingdataand 1/3testdataasillustrated in Figure1.(The sourcecodefor generatingthis graphicisa vailableon GitHub 2

All samples (n = 150)Training samples (n = 100)Test samples (n = 50)Figure1:Distrib utionof Irisflowerclassesuponrandomsubsampling intotraining andtestsets.

1 2 6

Issues with Subsampling

Stratified Split

Normalization: Min-Max Scaling!38x

[i] norm x [i] !x min x max !x min

[i] norm x [i] !x min x max !x min

Normalization: Min-Max Scaling!39x

[i] norm x [i] !x min x max !x min

[i] norm x [i] !x min x max !x min

Normalization: Standardization!40x

[i] std x [i] x x

[i] std x [i] x x

Normalization: Standardization!41x

[i] std x [i] x x

[i] std x [i] x x

Normalization: Standardization!42

Sample vs Population Standard Deviation!43s

x 1 n!1 i=1 n (x [i] !¯x) 2 x 1 n i=1 n (x [i] x 2

x 1 n!1 i=1 n (x [i] !¯x) 2 x 1 n i=1 n (x [i] x 2

Sample vs Population Standard Deviation!44s

x 1 n!1 i=1 n (x [i] !¯x) 2 x 1 n i=1 n (x [i] x 2

x 1 n!1 i=1 n (x [i] !¯x) 2 x 1 n i=1 n (x [i] x 2

Scaling Validation and Test Sets

Scaling Validation and Test SetsGiven 3 training examples: - example1: 10 cm -> class 2

- example2: 20 cm -> class 2 - example3: 30 cm -> class 1

Estimate: mean: 20 cm

standard deviation: 8.2 cm

- example2: 20 cm -> class 2 - example3: 30 cm -> class 1

Estimate: mean: 20 cm

standard deviation: 8.2 cm

Sebastian Raschka STAT 479: Machine Learning FS 2018!47Scaling Validation and Test SetsGiven 3 training examples: - example1: 10 cm -> class 2

- example2: 20 cm -> class 2 - example3: 30 cm -> class 1

Estimate: mean: 20 cm

standard deviation: 8.2 cm

Standardize: - example1: -1.21 -> class 2

- example2: 0.00 -> class 2 - example3: 1.21 -> class 1

Sebastian Raschka STAT 479: Machine Learning FS 2018!48Scaling Validation and Test SetsGiven 3 training examples: - example1: 10 cm -> class 2

- example2: 20 cm -> class 2 - example3: 30 cm -> class 1

Estimate: mean: 20 cm

standard deviation: 8.2 cm Standardize (z scores): - example1: -1.21 -> class 2 - example2: 0.00 -> class 2 - example3: 1.21 -> class 1h(z)=



Sebastian Raschka STAT 479: Machine Learning FS 2018!49Scaling Validation and Test SetsGiven 3 training examples: - example1: 10 cm -> class 2

- example2: 20 cm -> class 2 - example3: 30 cm -> class 1

Estimate: mean: 20 cm

standard deviation: 8.2 cm Standardize (z scores): - example1: -1.21 -> class 2 - example2: 0.00 -> class 2 - example3: 1.21 -> class 1h(z)=



Given 3 NEW examples: - example4: 5 cm -> class ?

- example5: 6 cm -> class ? - example6: 7 cm -> class ?

Estimate "new" mean and std.: - example5: -1.21 -> class 2 - example6: 0.00 -> class 2 - example7: 1.21 -> class 1

Sebastian Raschka STAT 479: Machine Learning FS 2018!50Scaling Validation and Test SetsGiven 3 training examples: - example1: 10 cm -> class 2

- example2: 20 cm -> class 2 - example3: 30 cm -> class 1

Estimate: mean: 20 cm

standard deviation: 8.2 cm Standardize (z scores): - example1: -1.21 -> class 2 - example2: 0.00 -> class 2 - example3: 1.21 -> class 1h(z)=



- example4: 5 cm -> class ? - example5: 6 cm -> class ? - example6: 7 cm -> class ?

Estimate "new" mean and std.: - example5: -1.21 -> class 2 - example6: 0.00 -> class 2 - example7: 1.21 -> class 1 - example5: -18.37

- example6: -17.15 - example7: -15.92

Training DataModelTransformed Test DataTest DataTransformed Training Dataest.fit(X_train)est.transform(X_train)est.transform(X_test)①②③

Scikit-Learn Transformer API

Scikit-Learn Transformer API

Scikit-Learn Transformer API

Categorical: Ordinal Categorical: Ordinal Categorical: Nominal One-hot Encoding One-hot Encoding

Scikit-Learn Pipelines!62Training setTest setScalingDimensionality ReductionLearning Algorithm.fit(...) & .transform(...).fit(...) & .transform(...).fit(...)Predictive Model.transform(...).transform(...).predict(...)pipeline.fit(...)Class labelspipeline.predict(...)Class labels(Step 1)(Step 2)Pipeline Scikit-Learn Pipelines!63 Scikit-Learn Pipelines!64

Sebastian Raschka STAT 479: Machine Learning FS 2018Scikit-Learn Pipelines!63 Sebastian Raschka STAT 479: Machine Learning FS 2018Scikit-Learn Pipelines!64

Scikit-Learn Pipelines!65Training setTest setScalingDimensionality ReductionLearning Algorithm.fit(...) & .transform(...).fit(...) & .transform(...).fit(...)Predictive Model.transform(...).transform(...).predict(...)pipeline.fit(...)Class labelspipeline.predict(...)Class labels(Step 1)(Step 2)Pipeline

Model Selection: Simple Holdout Method!66Original datasetTraining setValidation setTest setTraining setTest setMachine learning algorithmPredictive modelChange hyperparametersand repeatFinal performance estimateFitEvaluate Model Selection: Simple Holdout Method!67 Model Selection: Simple Holdout Method!68 Model Selection: Simple Holdout Method!69

Sebastian Raschka STAT 479: Machine Learning FS 2018Model Selection: Simple Holdout Method!67

Sebastian Raschka STAT 479: Machine Learning FS 2018Model Selection: Simple Holdout Method!68

Sebastian Raschka STAT 479: Machine Learning FS 2018Model Selection: Simple Holdout Method!69

Sebastian Raschka STAT 479: Machine Learning FS 2018!70Reading Assignments•Python Machine Learning, 2nd ed.:

Ch04 up to "Selecting Meaningful Features"

(pg 107-123) •Python Machine Learning, 2nd ed.: Ch06 up to "Debugging Algorithms with Learning and Validation Curves" (pg 185-194)quotesdbs_dbs20.pdfusesText_26