[PDF] [PDF] Stochastic Gradient Descent - CMU Statistics

In comparison, stochastic gradient descent or SGD (or incremental gradient descent) repeats: x(k) = x(k−1) − tk · ∇fik (x(k−1)), k = 1,2,3,



Previous PDF Next PDF





[PDF] Stochastic Gradient Descent - CMU Statistics

In comparison, stochastic gradient descent or SGD (or incremental gradient descent) repeats: x(k) = x(k−1) − tk · ∇fik (x(k−1)), k = 1,2,3,



[PDF] Lecture 24: November 26 241 Stochastic Gradient Descent

The idea is now to just use a subset of all samples, i e all possible fi(x)'s to approximate the full gradient This is called stochastic gradient descent, or short, SGD 



[PDF] Stochastic Gradient Descent Tricks - Microsoft

stochastic gradient descent (SGD) This chapter provides background material, explains why SGD is a good learning algorithm when the training set is large,



[PDF] Descente de gradient pour le machine learning - Université Lumière

9 fév 2018 · « Stochastic gradient descent » est utilisé – parmi d'autres – pour l'apprentissage des réseaux de neurones profonds (deep learning) Page 26 



[PDF] Stochastic Gradient Descent as Approximate Bayesian Inference

Stochastic gradient descent (SGD) has become crucial to modern machine learning SGD optimizes a function by following noisy gradients with a decreasing step 



[PDF] Stochastic gradient methods - Princeton University

Stochastic gradient descent (stochastic approximation) • Convergence analysis • Reducing variance via iterate averaging Stochastic gradient methods 11-2 



[PDF] Stochastic Gradient Descent with Only One Projection - NIPS

Although many variants of stochastic gradient descent have been proposed for large-scale convex optimization, most of them require projecting the solution at



[PDF] Variations of the Stochastic Gradient Descent for Multi-label

Stochastic Gradient Descent (SGD) methods have become popular in today's world of data abundance Many variants of SGD has since surfaced to attempt to  



Large-Scale Machine Learning with Stochastic Gradient Descent

optimization algorithms such as stochastic gradient descent show amazing perfor - mance for large-scale problems In particular, second order stochastic 

[PDF] stochastic optimization example

[PDF] stock calculator

[PDF] stock control specialist

[PDF] stock market analysis and prediction pdf

[PDF] stock market crash 2008 chart

[PDF] stock market in 2008

[PDF] stock prediction

[PDF] stock prediction algorithm

[PDF] stock price calculator

[PDF] stock price prediction machine learning

[PDF] stock price prediction machine learning python github

[PDF] stock price prediction using linear regression python

[PDF] stock selection criteria ammo

[PDF] stock prediction machine learning github

[PDF] stock price predictor github

Stochastic Gradient Descent

Ryan Tibshirani

Convex Optimization 10-725

Last time: proximal gradient descent

Consider the problem

min xg(x) +h(x) withg;hconvex,gdierentiable, andh\simple" in so much as prox h;t(x) = argmin z12tkxzk22+h(z) is computable.

Pro ximalgradient descent

: letx(0)2Rn, repeat: x (k)= proxh;tkx(k1)tkrg(x(k1)); k= 1;2;3;::: Step sizestkchosen to be xed and small, or via backtracking Ifrgis Lipschitz with constantL, then this has convergence rate O(1=). Lastly we canaccelerate this, to optimal rate O(1=p) 2

Outline

Today:

Stochastic gradient descent

Convergence rates

Mini-batches

Early stopping

3

Stochastic gradient descent

Consider minimizing an average of functions

min x1m m X i=1f i(x) AsrPm i=1fi(x) =Pm i=1rfi(x), gradient descent would repeat: x (k)=x(k1)tk1m m X i=1rfi(x(k1)); k= 1;2;3;:::

In comparison,

sto chasticgradient descent o rSGD (o rincremental gradient descent) repeats: x (k)=x(k1)tk rfik(x(k1)); k= 1;2;3;::: whereik2 f1;:::;mgis some chosen index at iterationk 4

Two rules for choosing indexikat iterationk:

Randomized rule

: chooseik2 f1;:::;mguniformly at random

Cyclic rule

: chooseik= 1;2;:::;m;1;2;:::;m;::: Randomized rule is more common in practice. For randomized rule, note that

E[rfik(x)] =rf(x)

so we can view SGD as using an unbiased estimate of the gradient at each step

Main appeal of SGD:

Iteration cost is independent ofm(number of functions) Can also be a big savings in terms of memory useage 5

Example: stochastic logistic regression

Given(xi;yi)2Rp f0;1g,i= 1;:::;n, recalllogistic regression : min 1n n X i=1 yixTi+ log(1 + exp(xTi)) |{z} f i()

Gradient computationrf() =1n

P n i=1yipi()xiis doable whennis moderate, butnot when nis huge Full gradient (also called batch) versus stochastic gradient:

One batch update costsO(np)

One stochastic update costsO(p)

Clearly, e.g., 10K stochastic steps are much more aordable 6 Small example withn= 10,p= 2to show the \classic picture" for batch versus stochastic methods:-20-1001020 -20 -10 0 10 20 l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l l lllll l ll ll l l l l l lll ll l l ll l l l l l l ll l l l ll l l l l l ll l l l l l l lll ll l l Batch

RandomBlue: batch steps,O(np)

Red : stochastic steps,O(p)

Rule of thumb for stochastic

methods: generally thrive far from optimum generally struggle close to optimum 7

Step sizes

Standard in SGD is to use

diminishing step sizes , e.g.,tk= 1=k Why not xed step sizes? Here's some intuition. Suppose we take cyclic rule for simplicity. Settk=tformupdates in a row, we get: x (k+m)=x(k)tmX i=1rfi(x(k+i1)) Meanwhile, full gradient with step sizemtwould give: x (k+1)=x(k)tmX i=1rfi(x(k))

The dierence here:tPm

i=1[rfi(x(k+i1)) rfi(x(k))], and if we holdtconstant, this dierence will not generally be going to zero 8

Convergence rates

Recall: for convexf, gradient descent with diminishing step sizes satises f(x(k))f?=O(1=pk) Whenfis dierentiable with Lipschitz gradient, we get for suitable xed step sizes f(x(k))f?=O(1=k) What about SGD? For convexf, SGD with diminishing step sizes satises 1

E[f(x(k))]f?=O(1=pk)

Unfortunately this

do esnot imp rove when w efurther assume fhas

Lipschitz gradient1

For example, Nemirosvki et al. (2009), \Robust stochastic optimization approach to stochastic programming"9

Even worse is the following discrepancy!

Whenfis strongly convex and has a Lipschitz gradient, gradient descent satises f(x(k))f?=O( k) where0< <1. But under same conditions, SGD gives us2

E[f(x(k))]f?=O(1=k)

So stochastic methods do not enjoy the

quotesdbs_dbs21.pdfusesText_27