Adaptive Subgradient Methods for Online Learning and Stochastic
Before introducing our adaptive gradient algorithm which we term ADAGRAD
Adaptive Gradient Methods AdaGrad / Adam
Adagrad AdaDelta
AdaGrad stepsizes: Sharp convergence over nonconvex landscapes
Abstract. Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the
Adagrad Adam and Online-to-Batch
04-Jul-2017 Adagrad. Adam. Online-To-Batch. Motivation. Stochastic Optimization. Standard stochastic gradient algorithms follow a predetermined scheme.
Why ADAGRAD Fails for Online Topic Modeling
lyzing large datasets and ADAGRAD is a widely-used technique for tuning learning rates during online gradient optimization.
Adagrad - An Optimizer for Stochastic Gradient Descent
The Adagrad optimizer in contrast modifies the learning rate adapting to the direction of the descent towards the optimum value. In other words
(Nearly) Dimension Independent Private ERM with AdaGrad Rates
In this paper we propose noisy-AdaGrad a novel optimization algorithm that leverages gradient pre-conditioning and knowledge of the subspace in which
Adaptive Gradient Methods AdaGrad / Adam
Adagrad AdaDelta
AdaGrad stepsizes: Sharp convergence over nonconvex landscapes
Abstract. Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the
Adaptive Subgradient Methods for Online Learning and Stochastic
Our algorithm called ADAGRAD
AdagradAdamOnline-To-Batch
Adagrad, Adam and Online-to-Batch
Raunak Kumar
University of British Columbia
MLRG, 2017 Summer
July 4, 2017
AdagradAdamOnline-To-Batch
Overview
1Adagrad
Motivation
Algorithm
Regret Bounds
2AdamMotivation
Algorithm
Regret Bounds
Comparison with Adagrad
3Online-To-Batch
Brief Overview
AdagradAdamOnline-To-Batch
Adagrad
AdagradAdamOnline-To-Batch
MotivationIntroduction
Let's begin by motivating Adagrad from 2 dierent viewpoints:Stochastic optimization (brief).
Online convex optimization.
AdagradAdamOnline-To-Batch
MotivationStochastic Optimization
We usually deal with sparse feature vectors.
Infrequently occurring features are highly informative.Examples
B luesky v so rangesky .
T F-IDF:W ordwis important in documentdif it occurs frequently indbut not in the entire corpus.AdagradAdamOnline-To-Batch
MotivationStochastic Optimization
Standard stochastic gradient algorithms follow a predetermined scheme.Ignore the characteristics of the observed data. Idea: Use a learning rate dependent on the frequency of the features.F requentf eature!low learning rate.
Infreq uentfeature !high learning rate.
AdagradAdamOnline-To-Batch
MotivationFTL
In online convex optimization, Follow The Leader (FTL) chooses the next decision as the best one in hindsight x t+1= argmin x2Kt X s=1f s(x):We've seen that regret T=O(T) because FTL is \unstable".Idea: Can we stabilize it by adding aregularizer?AdagradAdamOnline-To-Batch
MotivationRegularization Functions
We will consider regularization functionsR:K !R.Ris strongly-convex, smooth and twice-dierentiable.Thus,r2R(x)0.
AdagradAdamOnline-To-Batch
MotivationRFTL
Algorithm 1Regularized Follow The Leader1:Input: Stepsize >0, regularization functionRand a convex, com-
pact set decision setK2:Letx1= argminx2KfR(x)g.
3:fort= 1 toTdo
4:Predictxt, observeftand letrt=rft(xt).
5:Update
x t+1= argmin x2KftX s=1rTsx+R(x)g:
AdagradAdamOnline-To-Batch
MotivationRFTL
Theorem (RFTL Regret)
18u2 K:regretT2TP
t=1kr tk2t+R(u)R(x1) .2If8t:krtktGR, optimizeto get regretT2DRGRp2T:3Can also write
regretTmaxu2Ks2
X tkr tk2tBR(ukxt):AdagradAdamOnline-To-Batch
MotivationOnline Mirror Descent
Recall OMD from 3 weeks ago:
Algorithm 2Online Mirror Descent (Agile version)1:Input: Stepsize >0, regularization functionR2:Lety1s.t.rR(y1) = 0.
3:Letx1= argminx2KBR(xky1).
4:fort = 1 to Tdo
5:Predictxt, observeftand letrt=rft(xt).
6:Updateyt+1as
rR(yt+1) =rR(xt)rt:7:Project according toBR
x t+1= argmin x2KBR(xkyt+1):AdagradAdamOnline-To-Batch
MotivationOnline Mirror Descent
Suppose we chooseR(x) =12
kxx0k22for an arbitraryx02 K.What do we get?
AdagradAdamOnline-To-Batch
MotivationOnline Mirror Descent
Suppose we chooseR(x) =12
kxx0k22for an arbitraryx02 K.What do we get?Online Gradient Descent!
Regret bound
regret T1D2R+ 2X
tkr tk2t 2GDpTAdagradAdamOnline-To-Batch
MotivationOnline Mirror Descent
R(x) =12
kxx0k22.rR(x) =xx0.Update y t=xt1rt1 x t= K(yt)Projection with respect toBR (exercises 2, 3) x t= argmin x2KBR(xkyt+1) = argmin x2K12 (xyt+1)Tr2R(z)(xyt+1) = argmin x2K12 kxyt+1k22Thus, OMD is equivalent to OGD with this choice ofR.AdagradAdamOnline-To-Batch
MotivationOptimal Regularization
What have we seen so far?
Using a regularization function to make FTL stable.OGD is a special case of OMD withR(x) =12
kxx0k22.The regularization functionRaects our regret bound. Question: Which regularization function should we choose to minimize regret?AdagradAdamOnline-To-Batch
MotivationOptimal Regularization
What have we seen so far?
Using a regularization function to make FTL stable.OGD is a special case of OMD withR(x) =12
kxx0k22.The regularization functionRaects our regret bound. Question: Which regularization function should we choose to minimize regret? Depends onKand the cost functions.So we'll learn the optimal regularizer online!
AdagradAdamOnline-To-Batch
AlgorithmPreliminaries
Goal: Learn a regularizer that adapts to the sequence of costfunctions. In some sense, an optimal regularizer to use in hindsight.We will consider all strongly convex regularizers with a xed and
bounded Hessian inH8x2 K:r2R(x) =r22 H
whereH=fX2Rnn:Tr(X)1;X0g
AdagradAdamOnline-To-Batch
AlgorithmAdagrad
Algorithm 3AdaGrad: Adaptive (sub)Gradient Method (Full Matrix Ver- sion)1:Input: Stepsize >0, >0,x12 K.2:Initialize:S0=G0=0 .
3:fort = 1 to Tdo
4:Predictxt, observe lossftand letrt=rft(xt).
5:Update
S t=St1+rtrTt G t=S12 t+I y t+1=xtG1trt x t+1= argmin x2Kkxyt+1k2G tAdagradAdamOnline-To-Batch
AlgorithmAdagrad
Algorithm 4AdaGrad: Adaptive (sub)Gradient Method (Diagonal Ver- sion)1:Input: Stepsize >0, >0,x12 K.2:Initialize:S0=G0= 0.
3:Initialize:S1:0= 0.
4:fort = 1 to Tdo
5:Predictxt, observe lossftand letrt=rft(xt).
6:Update
S1:t= [S1:t1rt]
H t;i=kS1:t;ik2 G t=diag(Ht) +I y t+1=xtG1trt x t+1= argmin x2Kkxyt+1k2G tAdagradAdamOnline-To-Batch
Regret BoundsRegret
We will show the following regret bound
Theorem
Letfxtgbe the sequence of decisions dened by AdaGrad.LetD= maxu2Kkux1k2. Choose=12nD,=D.
Then, for anyx2 K
regretT(AdaGrad)2Dsmin
H2HX tkr tk2H+ 1AdagradAdamOnline-To-Batch
Regret BoundsOptmial Regularizer
Proposition
(Exercise 11) LetA0. Consider the following problem minimize:Tr(X1A) subject to:X0Tr(X)1
Then, minimizer isX=A12
=Tr(A12 )and the minimum objective value is Tr 2(A12AdagradAdamOnline-To-Batch
Regret BoundsOptmial Regularizer
Corollary
1r min H2HP tkr tk2H=Tr(GTI) =Tr(GT)n.2Optimial regularizer:(GTI)=Tr(GTI).AdagradAdamOnline-To-Batch
Regret BoundsOptmial Regularizer
Proof.
kr tk2H=rTtH1rtX tkr tk2H=X trTtH1rt
X tTr(rTtH1rt) X tTr(H1rtrTt) =Tr(X tH1rtrTt)
=Tr(H1X tr trTt) =Tr(H1(GTI)2)AdagradAdamOnline-To-Batch
quotesdbs_dbs21.pdfusesText_27[PDF] adam a method for stochastic optimization citation
[PDF] adam a method for stochastic optimization iclr
[PDF] adam a method for stochastic optimization iclr 2015 bibtex
[PDF] adam learning rate batch size
[PDF] adam optimizer keras
[PDF] adam sandler
[PDF] adam: a method for stochastic optimization dblp
[PDF] adaptability in mobile computing
[PDF] adaptable design definition
[PDF] adaptation and modification examples
[PDF] adaptation in mobile computing slideshare
[PDF] adaptation of teaching learning material for inclusive education
[PDF] adaptations and accommodations for sensory impairments
[PDF] adaptations for ell students