[PDF] [PDF] Data Preprocessing and Machine Learning with Scikit-Learn

Sebastian Raschka STAT 479: Machine Learning FS 2018 7 https://pandas pydata McKinney, Wes "Data structures for statistical computing in python "



Previous PDF Next PDF





[PDF] Python Machine Learning - Second Edition - HDip Data Analytics

Sebastian Raschka, the author of the bestselling book, Python Machine Did you know that Packt offers eBook versions of every book published, with PDF



[PDF] Python Machine Learning - Sebastian Raschka Vahid Mirjalili Bok

Python Machine Learning - Sebastian Raschka Vahid Mirjalili boken PDF Unlock modern machine learning and deep learning techniques with Python by using 



[PDF] Python: Deeper Insights into Machine Learning - [Home] [Articles

Sebastian Raschka David Julian Module 1, Python Machine Learning, discusses the essential machine algorithms for classification and pdf After successfully installing Anaconda, we can install new Python packages using the following 



[PDF] Deep Learning

Deep Learning, Ian Goodfellow and Yoshua Bengio and Aaron Python Machine Learning - Sebastian Raschka General programming, preferably Python 3 



[PDF] Data Preprocessing and Machine Learning with Scikit-Learn

Sebastian Raschka STAT 479: Machine Learning FS 2018 7 https://pandas pydata McKinney, Wes "Data structures for statistical computing in python "



[PDF] What are Machine Learning and Deep Learning? - Sebastian Raschka

History of neural networks and what makes deep learning different from “classic machine learning” Machine Learning https://homes cs washington edu/~ pedrod/papers/cacm12 pdf ) (Machine Learning) Main Scientific Python Libraries 

[PDF] python machine learning projects

[PDF] python machine learning sebastian raschka pdf github

[PDF] python mcq online test

[PDF] python midterm exam pdf

[PDF] python mini projects with database

[PDF] python mit pdf

[PDF] python mysql connector

[PDF] python numpy partial differential equation

[PDF] python oop

[PDF] python oop exercises with solutions

[PDF] python oracle database programming examples pdf

[PDF] python oracle database programming pdf

[PDF] python pdfminer python3

[PDF] python physics examples

[PDF] python pour les nuls

Sebastian Raschka STAT 479: Machine Learning FS 2018Data Preprocessing and

Machine Learning with Scikit-Learn

(Computational Foundations Part 3/3)Lecture 051STAT 479: Machine Learning, Fall 2018

Sebastian Raschka

Sebastian Raschka STAT 479: Machine Learning FS 20182 Sebastian Raschka STAT 479: Machine Learning FS 2018!3Labels Raw Data

Training DatasetTest DatasetLabelsNew DataLabels

Learning

Algorithm

PreprocessingLearningEvaluationPredictionFinal Model

Feature Extraction and Scaling

Feature Selection

Dimensionality Reduction

Sampling

Model Selection

Cross-Validation

Performance Metrics

Hyperparameter Optimization

Sebastian Raschka STAT 479: Machine Learning FS 20184Reading a Dataset from a Tabular Text File

Sebastian Raschka STAT 479: Machine Learning FS 20185Iris-VersicolorIris-VirginicaIris-SetosaFisher, R.A. "The use of multiple measurements in taxonomic problems" Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950).

Sebastian Raschka STAT 479: Machine Learning FS 20186

Sebastian Raschka STAT 479: Machine Learning FS 20187https://pandas.pydata.orgMcKinney, Wes. "Data structures for statistical computing in python." Proceedings of the 9th Python in Science Conference. Vol. 445. 2010.

Sebastian Raschka STAT 479: Machine Learning FS 20188https://pandas.pydata.org Sebastian Raschka STAT 479: Machine Learning FS 20189Basic Data Handling Sebastian Raschka STAT 479: Machine Learning FS 201810 Sebastian Raschka STAT 479: Machine Learning FS 201811 Sebastian Raschka STAT 479: Machine Learning FS 201812 Sebastian Raschka STAT 479: Machine Learning FS 201813 Sebastian Raschka STAT 479: Machine Learning FS 201814 Sebastian Raschka STAT 479: Machine Learning FS 201815 Sebastian Raschka STAT 479: Machine Learning FS 201816

Sebastian Raschka STAT 479: Machine Learning FS 201817Raschka, Sebastian. "MLxtend: Providing machine learning and data science utilities and extensions to Python's scientific computing stack."

The Journal of Open Source Software 3.24 (2018).http://rasbt.github.io/mlxtend/MLXTEND Sebastian Raschka STAT 479: Machine Learning FS 201818 Sebastian Raschka STAT 479: Machine Learning FS 201819 Sebastian Raschka STAT 479: Machine Learning FS 201820 Sebastian Raschka STAT 479: Machine Learning FS 201821 Sebastian Raschka STAT 479: Machine Learning FS 201822 Sebastian Raschka STAT 479: Machine Learning FS 201823Python Classes Sebastian Raschka STAT 479: Machine Learning FS 201824Python Classes Sebastian Raschka STAT 479: Machine Learning FS 201825Python Classes Sebastian Raschka STAT 479: Machine Learning FS 201826Python Classes Sebastian Raschka STAT 479: Machine Learning FS 201827 Sebastian Raschka STAT 479: Machine Learning FS 201828Python Classes Sebastian Raschka STAT 479: Machine Learning FS 201829Python Classes Sebastian Raschka STAT 479: Machine Learning FS 201830 Sebastian Raschka STAT 479: Machine Learning FS 201831

Sebastian Raschka STAT 479: Machine Learning FS 201832http://scikit-learn.orgPedregosa, Fabian, et al. "Scikit-learn: Machine learning in Python."

Journal of machine learning research 12.Oct (2011): 2825-2830. Sebastian Raschka STAT 479: Machine Learning FS 201833

Sebastian Raschka STAT 479: Machine Learning FS 201834Training DataModelTraining LabelsPredicted labelsTest Dataest.fit(X_train, y_train)est.predict(X_test)①②

Scikit-learn Estimator API

Sebastian Raschka STAT 479: Machine Learning FS 201835

Sebastian Raschka STAT 479: Machine Learning FS 2018!361.3ResubstitutionV alidationand theHoldoutMethod

Theholdoutmethod isinarguably thesimplestmodel evaluation technique;itcanbe summarizedas follows.First,wetake alabeleddataset andsplitit intotwoparts: Atrainingand atestset. Then,we fitamodel tothetraining dataand predictthelabels ofthetest set.Thefraction ofcorrectpredictions, whichcanbe computedbycomparing thepredictedlabels totheground truthlabelsof thetest set, constitutesourestimate ofthemodel' sprediction accuracy. Here,it isimportanttonotethat we donotw anttotrain andevaluatea modelonthe sametrainingdataset (thisiscalled resubstitution validationorresubstitutionevaluation),sinceit would typicallyintroducea veryoptimisticbiasdue toov erfitting.Inotherwords,wecannottell whetherthemodel simplymemorized thetrainingdata, orwhetherit generalizeswellto new ,unseendata. (Onasi denote,we canestimatethisso-called optimismbiasasthedif ferencebetweenthe trainingandtestaccuracy.) Typically,thesplittingofadatasetinto trainingandtest setsisa simpleprocess ofrandomsubsam- pling.We assumethatalldatapoints have beendrawn fromthesame probabilitydistribution (with respecttoeach class).Andwe randomlychoose 2/3ofthese samplesforthe trainingsetand 1/3 ofthesamples forthetest set.Note thatthereare twoproblems withthisapproach, whichwewill discussinthe nextsections.

1.4Stratification

Wehaveto keepinmindthatadatasetrepresentsa randomsampledra wnfromaprobability distribution,andwetypicallyassume thatthis sampleisrepresentati veof thetruepopulation -more orless.No w,further subsamplingwithoutreplacementaltersthestatistic(mean,proportion, and variance)ofthesample.The degreeto whichsubsamplingwithout replacementaff ectsthe statisticof asamplei sinv erselyproportionaltothesizeof thesample.Letushavea lookatan exampleusing theIrisdataset 1 ,whichwe randomlydivide into2/3 trainingdataand 1/3testdataasillustrated in Figure1.(The sourcecodefor generatingthis graphicisa vailableon GitHub 2

All samples (n = 150)Training samples (n = 100)Test samples (n = 50)Figure1:Distrib utionof Irisflowerclassesuponrandomsubsampling intotraining andtestsets.

1 2 6

Issues with Subsampling

Sebastian Raschka STAT 479: Machine Learning FS 2018!37Stratified Split

Sebastian Raschka STAT 479: Machine Learning FS 2018Normalization: Min-Max Scaling!38x

[i] norm x [i] !x min x max !x min

Sebastian Raschka STAT 479: Machine Learning FS 2018Normalization: Min-Max Scaling!39x

[i] norm x [i] !x min x max !x min

Sebastian Raschka STAT 479: Machine Learning FS 2018Normalization: Standardization!40x

[i] std x [i] x x

Sebastian Raschka STAT 479: Machine Learning FS 2018Normalization: Standardization!41x

[i] std x [i] x x

Sebastian Raschka STAT 479: Machine Learning FS 2018Normalization: Standardization!42

Sebastian Raschka STAT 479: Machine Learning FS 2018Sample vs Population Standard Deviation!43s

x 1 n!1 i=1 n (x [i] !¯x) 2 x 1 n i=1 n (x [i] x 2

Sebastian Raschka STAT 479: Machine Learning FS 2018Sample vs Population Standard Deviation!44s

x 1 n!1 i=1 n (x [i] !¯x) 2 x 1 n i=1 n (x [i] x 2

Sebastian Raschka STAT 479: Machine Learning FS 2018!45Scaling Validation and Test Sets

Sebastian Raschka STAT 479: Machine Learning FS 2018!46Scaling Validation and Test SetsGiven 3 training examples: - example1: 10 cm -> class 2

- example2: 20 cm -> class 2 - example3: 30 cm -> class 1

Estimate: mean: 20 cm

standard deviation: 8.2 cm

Sebastian Raschka STAT 479: Machine Learning FS 2018!47Scaling Validation and Test SetsGiven 3 training examples: - example1: 10 cm -> class 2

- example2: 20 cm -> class 2 - example3: 30 cm -> class 1

Estimate: mean: 20 cm

standard deviation: 8.2 cm

Standardize: - example1: -1.21 -> class 2

- example2: 0.00 -> class 2 - example3: 1.21 -> class 1

Sebastian Raschka STAT 479: Machine Learning FS 2018!48Scaling Validation and Test SetsGiven 3 training examples: - example1: 10 cm -> class 2

- example2: 20 cm -> class 2 - example3: 30 cm -> class 1

Estimate: mean: 20 cm

standard deviation: 8.2 cm Standardize (z scores): - example1: -1.21 -> class 2 - example2: 0.00 -> class 2 - example3: 1.21 -> class 1h(z)=

2z"0.6

1otherwise

Sebastian Raschka STAT 479: Machine Learning FS 2018!49Scaling Validation and Test SetsGiven 3 training examples: - example1: 10 cm -> class 2

- example2: 20 cm -> class 2 - example3: 30 cm -> class 1

Estimate: mean: 20 cm

standard deviation: 8.2 cm Standardize (z scores): - example1: -1.21 -> class 2 - example2: 0.00 -> class 2 - example3: 1.21 -> class 1h(z)=

2z"0.6

1otherwise

Given 3 NEW examples: - example4: 5 cm -> class ?

- example5: 6 cm -> class ? - example6: 7 cm -> class ?

Estimate "new" mean and std.: - example5: -1.21 -> class 2 - example6: 0.00 -> class 2 - example7: 1.21 -> class 1

Sebastian Raschka STAT 479: Machine Learning FS 2018!50Scaling Validation and Test SetsGiven 3 training examples: - example1: 10 cm -> class 2

- example2: 20 cm -> class 2 - example3: 30 cm -> class 1

Estimate: mean: 20 cm

standard deviation: 8.2 cm Standardize (z scores): - example1: -1.21 -> class 2 - example2: 0.00 -> class 2 - example3: 1.21 -> class 1h(z)=

2z"0.6

1otherwise

- example4: 5 cm -> class ? - example5: 6 cm -> class ? - example6: 7 cm -> class ?

Estimate "new" mean and std.: - example5: -1.21 -> class 2 - example6: 0.00 -> class 2 - example7: 1.21 -> class 1 - example5: -18.37

- example6: -17.15 - example7: -15.92

Sebastian Raschka STAT 479: Machine Learning FS 2018!51Training DataModelTransformed Test DataTest DataTransformed Training Dataest.fit(X_train)est.transform(X_train)est.transform(X_test)①②③

Scikit-Learn Transformer API

Sebastian Raschka STAT 479: Machine Learning FS 2018!52Scikit-Learn Transformer API

Sebastian Raschka STAT 479: Machine Learning FS 2018!53 Sebastian Raschka STAT 479: Machine Learning FS 2018Categorical: Ordinal!54 Sebastian Raschka STAT 479: Machine Learning FS 2018Categorical: Ordinal!55 Sebastian Raschka STAT 479: Machine Learning FS 2018Categorical: Nominal!56 Sebastian Raschka STAT 479: Machine Learning FS 2018One-hot Encoding!57 Sebastian Raschka STAT 479: Machine Learning FS 2018One-hot Encoding!58 Sebastian Raschka STAT 479: Machine Learning FS 2018!59 Sebastian Raschka STAT 479: Machine Learning FS 2018!60 Sebastian Raschka STAT 479: Machine Learning FS 2018!61

Sebastian Raschka STAT 479: Machine Learning FS 2018Scikit-Learn Pipelines!62Training setTest setScalingDimensionality ReductionLearning Algorithm.fit(...) & .transform(...).fit(...) & .transform(...).fit(...)Predictive Model.transform(...).transform(...).predict(...)pipeline.fit(...)Class labelspipeline.predict(...)Class labels(Step 1)(Step 2)Pipeline

Sebastian Raschka STAT 479: Machine Learning FS 2018Scikit-Learn Pipelines!63 Sebastian Raschka STAT 479: Machine Learning FS 2018Scikit-Learn Pipelines!64

Sebastian Raschka STAT 479: Machine Learning FS 2018Scikit-Learn Pipelines!65Training setTest setScalingDimensionality ReductionLearning Algorithm.fit(...) & .transform(...).fit(...) & .transform(...).fit(...)Predictive Model.transform(...).transform(...).predict(...)pipeline.fit(...)Class labelspipeline.predict(...)Class labels(Step 1)(Step 2)Pipeline

Sebastian Raschka STAT 479: Machine Learning FS 2018Model Selection: Simple Holdout Method!66Original datasetTraining setValidation setTest setTraining setTest setMachine learning algorithmPredictive modelChange hyperparametersand repeatFinal performance estimateFitEvaluate

Sebastian Raschka STAT 479: Machine Learning FS 2018Model Selection: Simple Holdout Method!67

Sebastian Raschka STAT 479: Machine Learning FS 2018Model Selection: Simple Holdout Method!68

Sebastian Raschka STAT 479: Machine Learning FS 2018Model Selection: Simple Holdout Method!69

Sebastian Raschka STAT 479: Machine Learning FS 2018!70Reading Assignments•Python Machine Learning, 2nd ed.:

Ch04 up to "Selecting Meaningful Features"

(pg 107-123) •Python Machine Learning, 2nd ed.: Ch06 up to "Debugging Algorithms with Learning and Validation Curves" (pg 185-194)quotesdbs_dbs20.pdfusesText_26