Adagrad Adam and Online-to-Batch PDF

Before introducing our adaptive gradient algorithm which we term ADAGRAD

Adaptive Gradient Methods AdaGrad / Adam

Adagrad AdaDelta

AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

Abstract. Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the

Adagrad Adam and Online-to-Batch

04-Jul-2017 Adagrad. Adam. Online-To-Batch. Motivation. Stochastic Optimization. Standard stochastic gradient algorithms follow a predetermined scheme.

Why ADAGRAD Fails for Online Topic Modeling

lyzing large datasets and ADAGRAD is a widely-used technique for tuning learning rates during online gradient optimization.

Adagrad - An Optimizer for Stochastic Gradient Descent

The Adagrad optimizer in contrast modifies the learning rate adapting to the direction of the descent towards the optimum value. In other words

(Nearly) Dimension Independent Private ERM with AdaGrad Rates

In this paper we propose noisy-AdaGrad a novel optimization algorithm that leverages gradient pre-conditioning and knowledge of the subspace in which

Adaptive Gradient Methods AdaGrad / Adam

Adagrad AdaDelta

AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

Abstract. Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the

Adaptive Subgradient Methods for Online Learning and Stochastic

Our algorithm called ADAGRAD

AdagradAdamOnline-To-Batch

Adagrad, Adam and Online-to-Batch

Raunak Kumar

University of British Columbia

MLRG, 2017 Summer

July 4, 2017

AdagradAdamOnline-To-Batch

Overview

1Adagrad

Motivation

Algorithm

Regret Bounds

2Adam

Motivation

Algorithm

Regret Bounds

Comparison with Adagrad

3Online-To-Batch

Brief Overview

AdagradAdamOnline-To-Batch

Adagrad

AdagradAdamOnline-To-Batch

MotivationIntroduction

Let's begin by motivating Adagrad from 2 dierent viewpoints:

Stochastic optimization (brief).

Online convex optimization.

AdagradAdamOnline-To-Batch

MotivationStochastic Optimization

We usually deal with sparse feature vectors.

Infrequently occurring features are highly informative.

Examples

B luesky v so rangesky .

T F-IDF:W ordwis important in documentdif it occurs frequently indbut not in the entire corpus.

AdagradAdamOnline-To-Batch

MotivationStochastic Optimization

Standard stochastic gradient algorithms follow a predetermined scheme.Ignore the characteristics of the observed data. Idea: Use a learning rate dependent on the frequency of the features.

F requentf eature!low learning rate.

Infreq uentfeature !high learning rate.

AdagradAdamOnline-To-Batch

MotivationFTL

In online convex optimization, Follow The Leader (FTL) chooses the next decision as the best one in hindsight x t+1= argmin x2Kt X s=1f s(x):We've seen that regret T=O(T) because FTL is \unstable".Idea: Can we stabilize it by adding aregularizer?

AdagradAdamOnline-To-Batch

MotivationRegularization Functions

We will consider regularization functionsR:K !R.Ris strongly-convex, smooth and twice-dierentiable.Thus,r2R(x)0.

AdagradAdamOnline-To-Batch

MotivationRFTL

Algorithm 1Regularized Follow The Leader1:Input: Stepsize >0, regularization functionRand a convex, com-

pact set decision setK

2:Letx1= argminx2KfR(x)g.

3:fort= 1 toTdo

4:Predictxt, observeftand letrt=rft(xt).

5:Update

x t+1= argmin x2KftX s=1r

Tsx+R(x)g:

AdagradAdamOnline-To-Batch

MotivationRFTL

Theorem (RFTL Regret)

18u2 K:regretT2TP

t=1kr tk2t+R(u)R(x1) .2If8t:krtktGR, optimizeto get regret

T2DRGRp2T:3Can also write

regret

Tmaxu2Ks2

X tkr tk2tBR(ukxt):

AdagradAdamOnline-To-Batch

MotivationOnline Mirror Descent

Recall OMD from 3 weeks ago:

Algorithm 2Online Mirror Descent (Agile version)1:Input: Stepsize >0, regularization functionR

2:Lety1s.t.rR(y1) = 0.

3:Letx1= argminx2KBR(xky1).

4:fort = 1 to Tdo

5:Predictxt, observeftand letrt=rft(xt).

6:Updateyt+1as

rR(yt+1) =rR(xt)rt:

7:Project according toBR

x t+1= argmin x2KBR(xkyt+1):

AdagradAdamOnline-To-Batch

MotivationOnline Mirror Descent

Suppose we chooseR(x) =12

kxx0k22for an arbitraryx02 K.

What do we get?

AdagradAdamOnline-To-Batch

MotivationOnline Mirror Descent

Suppose we chooseR(x) =12

kxx0k22for an arbitraryx02 K.

What do we get?Online Gradient Descent!

Regret bound

regret T1

D2R+ 2X

tkr tk2t 2GDpT

AdagradAdamOnline-To-Batch

MotivationOnline Mirror Descent

R(x) =12

kxx0k22.rR(x) =xx0.Update y t=xt1rt1 x t= K(yt)Projection with respect toBR (exercises 2, 3) x t= argmin x2KBR(xkyt+1) = argmin x2K12 (xyt+1)Tr2R(z)(xyt+1) = argmin x2K12 kxyt+1k22Thus, OMD is equivalent to OGD with this choice ofR.

AdagradAdamOnline-To-Batch

MotivationOptimal Regularization

What have we seen so far?

Using a regularization function to make FTL stable.

OGD is a special case of OMD withR(x) =12

kxx0k22.The regularization functionRaects our regret bound. Question: Which regularization function should we choose to minimize regret?

AdagradAdamOnline-To-Batch

MotivationOptimal Regularization

What have we seen so far?

Using a regularization function to make FTL stable.

OGD is a special case of OMD withR(x) =12

kxx0k22.The regularization functionRaects our regret bound. Question: Which regularization function should we choose to minimize regret? Depends onKand the cost functions.

So we'll learn the optimal regularizer online!

AdagradAdamOnline-To-Batch

AlgorithmPreliminaries

Goal: Learn a regularizer that adapts to the sequence of cost

functions. In some sense, an optimal regularizer to use in hindsight.We will consider all strongly convex regularizers with a xed and

bounded Hessian inH

8x2 K:r2R(x) =r22 H

where

H=fX2Rnn:Tr(X)1;X0g

AdagradAdamOnline-To-Batch

AlgorithmAdagrad

Algorithm 3AdaGrad: Adaptive (sub)Gradient Method (Full Matrix Ver- sion)1:Input: Stepsize >0, >0,x12 K.

2:Initialize:S0=G0=0 .

3:fort = 1 to Tdo

4:Predictxt, observe lossftand letrt=rft(xt).

5:Update

S t=St1+rtrTt G t=S12 t+I y t+1=xtG1trt x t+1= argmin x2Kkxyt+1k2G t

AdagradAdamOnline-To-Batch

AlgorithmAdagrad

Algorithm 4AdaGrad: Adaptive (sub)Gradient Method (Diagonal Ver- sion)1:Input: Stepsize >0, >0,x12 K.

2:Initialize:S0=G0= 0.

3:Initialize:S1:0= 0.

4:fort = 1 to Tdo

5:Predictxt, observe lossftand letrt=rft(xt).

6:Update

1:t= [S1:t1rt]

H t;i=kS1:t;ik2 G t=diag(Ht) +I y t+1=xtG1trt x t+1= argmin x2Kkxyt+1k2G t

AdagradAdamOnline-To-Batch

Regret BoundsRegret

We will show the following regret bound

Theorem

Letfxtgbe the sequence of decisions dened by AdaGrad.

LetD= maxu2Kkux1k2. Choose=12nD,=D.

Then, for anyx2 K

regret

T(AdaGrad)2Dsmin

H2HX tkr tk2H+ 1

AdagradAdamOnline-To-Batch

Regret BoundsOptmial Regularizer

Proposition

(Exercise 11) LetA0. Consider the following problem minimize:Tr(X1A) subject to:X0

Tr(X)1

Then, minimizer isX=A12

=Tr(A12 )and the minimum objective value is Tr 2(A12

AdagradAdamOnline-To-Batch

Regret BoundsOptmial Regularizer

Corollary

1r min H2HP tkr tk2H=Tr(GTI) =Tr(GT)n.2Optimial regularizer:(GTI)=Tr(GTI).

AdagradAdamOnline-To-Batch

Regret BoundsOptmial Regularizer

Proof.

kr tk2H=rTtH1rtX tkr tk2H=X tr

TtH1rt

X tTr(rTtH1rt) X tTr(H1rtrTt) =Tr(X tH

1rtrTt)

=Tr(H1X tr trTt) =Tr(H1(GTI)2)

AdagradAdamOnline-To-Batch

quotesdbs_dbs21.pdfusesText_27

[PDF] adam a method for stochastic optimization bibtex

[PDF] adam a method for stochastic optimization citation

[PDF] adam a method for stochastic optimization iclr

[PDF] adam a method for stochastic optimization iclr 2015 bibtex

[PDF] adam learning rate batch size

[PDF] adam optimizer keras

[PDF] adam sandler

[PDF] adam: a method for stochastic optimization dblp

[PDF] adaptability in mobile computing

[PDF] adaptable design definition

[PDF] adaptation and modification examples

[PDF] adaptation in mobile computing slideshare

[PDF] adaptation of teaching learning material for inclusive education

[PDF] adaptations and accommodations for sensory impairments

[PDF] adaptations for ell students

[PDF] Adagrad Adam and Online-to-Batch

AdagradAdamOnline-To-Batch

Adagrad, Adam and Online-to-Batch

Raunak Kumar

University of British Columbia

MLRG, 2017 Summer

July 4, 2017

AdagradAdamOnline-To-Batch

Overview

1Adagrad

Motivation

Algorithm

Regret Bounds

Motivation

Algorithm

Regret Bounds

Comparison with Adagrad

3Online-To-Batch

Brief Overview

AdagradAdamOnline-To-Batch

Adagrad

AdagradAdamOnline-To-Batch

MotivationIntroduction

Stochastic optimization (brief).

Online convex optimization.

AdagradAdamOnline-To-Batch

MotivationStochastic Optimization

We usually deal with sparse feature vectors.

Examples

B luesky v so rangesky .

AdagradAdamOnline-To-Batch

MotivationStochastic Optimization

F requentf eature!low learning rate.

Infreq uentfeature !high learning rate.

AdagradAdamOnline-To-Batch

MotivationFTL

AdagradAdamOnline-To-Batch

MotivationRegularization Functions

AdagradAdamOnline-To-Batch

MotivationRFTL

2:Letx1= argminx2KfR(x)g.

3:fort= 1 toTdo

4:Predictxt, observeftand letrt=rft(xt).

5:Update

Tsx+R(x)g:

AdagradAdamOnline-To-Batch

MotivationRFTL

Theorem (RFTL Regret)

18u2 K:regretT2TP

T2DRGRp2T:3Can also write

Tmaxu2Ks2

AdagradAdamOnline-To-Batch

MotivationOnline Mirror Descent

Recall OMD from 3 weeks ago:

2:Lety1s.t.rR(y1) = 0.

3:Letx1= argminx2KBR(xky1).

4:fort = 1 to Tdo

5:Predictxt, observeftand letrt=rft(xt).

6:Updateyt+1as

7:Project according toBR

AdagradAdamOnline-To-Batch

MotivationOnline Mirror Descent

Suppose we chooseR(x) =12

What do we get?

AdagradAdamOnline-To-Batch

MotivationOnline Mirror Descent

Suppose we chooseR(x) =12

What do we get?Online Gradient Descent!

Regret bound

D2R+ 2X

AdagradAdamOnline-To-Batch

MotivationOnline Mirror Descent

R(x) =12

AdagradAdamOnline-To-Batch

MotivationOptimal Regularization

What have we seen so far?

OGD is a special case of OMD withR(x) =12

AdagradAdamOnline-To-Batch

MotivationOptimal Regularization

What have we seen so far?