A Discrete Variational Derivation of Accelerated Methods in PDF

5 Nov 2021 Department of Applied Mathematics Materials Science ... Calle Madre de Dios 53

flashcode.pdf

Yvan Monka – Académie de Strasbourg – www.maths-et-tiques.fr. FLASHCODE. (inspiré de J. et L. DENIERE – La géométrie pour le plaisir – Editions DENIERE

DM de maths sur Pyramides et cônes Correction Exercice 1 : On sait

Calculer une valeur approchée au millimètre de la hauteur de ce cône. (On pense à notre ami Pythagore !) On sait que SAO est un triangle rectangle en O. D'après

UNE HISTOIRE DE TOTO

Yvan Monka – Académie de Strasbourg – www.maths-et-tiques.fr. UNE HISTOIRE DE TOTO. Commentaires : Étude de situations modélisées par des suites

ATTENTION AUX CREDITS !

Yvan Monka – Académie de Strasbourg – www.maths-et-tiques.fr La formule qui permet de calculer les mensualités à rembourser lorsqu'on souscrit un crédit ...

TRAVAILLER LORAL EN MATHÉMATIQUES : RÉSOLUTION DE

Problèmes de Dan MEYER : Site Math et Tiques : Page 2. ?Quand ? Temps 1. Constitution des groupes. Présentation

CRÉE TON ŒUVRE DART (4)

Yvan Monka – Académie de Strasbourg – www.maths-et-tiques.fr. Hors du cadre de la classe aucune reproduction

Repèrage_Consignes Minion

Yvan Monka – Académie de Strasbourg – www.maths-?et-?tiques.fr. UN MINION. Par Aurélie Marzolf et Julien Pillet - Collège Henri Wallon à Vigneux-sur-Seine.

Attendus de fin dannée de CE2

Il mesure des longueurs en nombres entiers d'unité avec une règle graduée (en dm cm et mm). ? Il trace des segments de longueurs données en nombres

DM de maths sur les parallélogrammes 5ième A rendre avant le 27

DM de maths sur les parallélogrammes 5ième A rendre avant le 27 avril EXERCICE 1 : Reproduire le parallélogramme MARS de centre E.

Discrete Variational Calculus for Accelerated Optimization

Cedric M. Camposcedric.mcampos@urjc.es

Departamento de Matematica Aplicada, Ciencia e Ingeniera de los Materiales y Tecnologa Electronica

Universidad Rey Juan Carlos

Calle Tulipan s/n, 28933 Mostoles, Spain

Alejandro Mahilloalmahill@unizar.es

Departamento de Matematicas

Instituto Universitario de Matematicas y Aplicaciones

Universidad de Zaragoza

C. de Pedro Cerbuna, 12, 50009, Zaragoza, Spain

David Martn de Diegodavid.martin@icmat.es

Instituto de Ciencias Matematicas (CSIC-UAM-UC3M-UCM)

Calle Nicolas Cabrera 13-15, 28049 Madrid, Spain

Abstract

Many of the new developments in machine learning are connected with gradient-based opti- mization methods. Recently, these methods have been studied using a variational perspec- tive (Betancourt et al., 2018). This has opened up the possibility of introducing variational and symplectic methods using geometric integration. In particular, in this paper, we intro- duce variational integrators (Marsden and West, 2001) which allow us to derive dierent methods for optimization. Using both, Hamilton's and Lagrange-d'Alembert's principle, we derive two families of optimization methods in one-to-one correspondence that generalize Polyak's heavy ball (Polyak, 1964) and Nesterov's accelerated gradient (Nesterov, 1983), the second of which mimics the behavior of the latter reducing the oscillations of classical momentum methods. However, since the systems considered are explicitly time-dependent, the preservation of symplecticity of autonomous systems occurs here solely on the bers.

Several experiments exemplify the result.

Keywords:Polyak's heavy ball, Nesterov's accelerated gradient, momentum methods, variational integrators, Bregman Lagrangians

1. Introduction

Many of the literature on machine learning and data analysis is connected with gradient- based optimization methods (see Polak, 1997; Nesterov, 2018; and references therein). The computations often involve large data and parameter sets and then, not only the compup- tational eciency is a crucial point, but the optimization theory also plays a fundamental role. A typical optimization problem is: argminf(x); x2Q;(1.1) where we assume thatQis a convex set inRnandfis a continuously dierentiable convex function with Lipschitzian gradient. In this case one of the most extended algorithms for ©2022 Cedric M. Campos, Alejandro Mahillo and David Martn de Diego.

License: CC-BY 4.0, seehttps://creativecommons.org/licenses/by/4.0/.arXiv:2106.02700v3 [math.OC] 20 Nov 2022

C.M. Campos, A. Mahillo and D. Mart

n de Diego (1.1) is Nesterov's accelerated gradient (Nesterov, 1983; Su et al., 2016) which may take the following form: y k+1=xkrf(xk) x k+1=yk+1+kk+ 3(yk+1yk) starting from an initial conditionx0(see more details in Sections 2 and 7). An impor- tant observation was made by Su et al. (2016) showing that the continuous limit of Nes- terov's method is a time-dependent second order dierential equation. Moreover, Wibisono et al. (2016) show that this system of dierential equations has a variational origin (see also Wibisono, 2016). In particular, they take as point of departure this variational ap- proach that captures acceleration in continuous time considering a particular type of time- dependent Lagrangian functions, called Bregman Lagrangians (see Section 3). In a recent paper, Betancourt et al. (2018) introduce symplectic (and presymplectic) integrators for the dierential equations associated with accelerated optimizations methods (see Sanz-Serna and Calvo, 1994; Hairer et al., 2010; Blanes and Casas, 2016, for an intro- duction to symplectic integration). They use the Hamiltonian formalism since it is possible to extend the phase space to turn the system into a time-independent Hamiltonian system and apply there standard symplectic techniques (see Marthinsen and Owren, 2016; Celle- doni et al., 2020). For recent improvements of this approach using adaptive Hamiltonian variational integrators, see Duruisseaux et al. (2021). In our paper we set an alternative route: The idea is to use variational integrators adapted to an explicit time-dependent framework and external forces (see Marsden and West, 2001, and references therein) to derive a whole family of optimizations methods. The theory of discrete variational mechanics has reached maturity in recent years by combining results of dierential geometry, classical mechanics and numerical integration. Roughly speaking, the continuous LagrangianL:TQ!Ris substituted by a discrete Lagrangian L d:QQ!R. Observe that, by replacing the standard velocity phase spaceTQwith QQ, we are discretizing a velocity vector by two (in principle) close points. With the unique information of the discrete Lagrangian we can dene the discrete action sum and, applying standard variational techniques, we derive a system of second order dierence equations known as discrete Euler-Lagrange equations. The numerical order of the methods is obtained using variational error analysis (see Marsden and West, 2001; Patrick and Cuell,

2009). Moreover, it is possible to derive a discrete version of Noether's theorem relating the

symmetries of the discrete Lagrangian with conserved quantities. The derived methods are automatically symplectic and, perhaps more importantly, easily adapted to other situations as, for instance, Lie group integrators, time-dependent Lagrangians, forced systems, optimal control theory, holonomic and nonholonomic mechanics, eld theories, etc. The Lagrangian functions depicted in Section 3, Bregman Lagrangians, are those ex- plicitly time-dependent that typically arise on accelerated optimization. The geometry for time-dependent systems is dierent from symplectic geometry, in particular, the phase space is odd dimensional. In this case, an appropriate geometric framework is given by cosymplectic geometry (see Libermann, 1959; Cappelletti-Montano et al., 2013; and ref- erences therein). In Section 4 we introduce the cosymplectic structure associated to a time-dependent Hamiltonian system (induced by a time-dependent Lagrangian) and also 2 Discrete Variational Calculus for Accelerated Optimization an interesting symplectic preservation property associated to the restriction of the Hamil- tonian ow to the bres of the projection onto the time variable (Theorem 1). Having in mind this geometrical framework we introduce in Section 5 discrete variational mechanics for time-dependent Lagrangians with xed time step (compare with Marsden and West,

2001, for variable time step). Moreover, we recover the symplectic character on bres of the

continuous Hamiltonian ow. We show the possibility to construct variational integrators using similar techniques to the developed for the autonomous case that, in some interest- ing cases, are in addition explicit and, consequently, reduce the computational cost. An example of such methods is the second-order dierence equation x k+1=xkkrf(xk) +k(xkxk1); a type of momentum-descent method widely studied in the literature and whose origin goes back to Polyak (1964). Momentum methods allow to accelerate gradient descent by taking into account the \speed" achieved by the method at the last update. However, because of that speed, momentum methods can overpass the minimum. Nesterov's method tries to an- ticipate future information reducing the typical oscillations of classical momentum methods towards the minimum. In Section 6 we adapt our construction of variational integrators to add external forces using discrete Lagrange-d'Alembert's principle (see Marsden and West,

2001). Upon this machinery, we derive in Section 7 two families of momentum methods

in mutual bijective correspondence one of which corresponds to Nesterov's method (see Theorem 6). Finally, for Section 8, many methods and numerical simulations have been implemented in Julia v1.8.2. We optimize several test functions with our methodology and other methods that appeared recently in the literature. One of the test functions is reused afterwards for a machine learning example.

2. From Gradient Descent to Nesterov's Accelerated Gradient

In this section we give a historical perspective of Nesterov's accelerated gradient from gradi- ent descent with a threefold objective: First, properly introduce the methods of interest and their properties, second, give an overall view of the elements to take under consideration, and, third, set some of the notation. Although the rst method that comes to mind to solve the optimization problem (1.1) is Newton-Raphson, the rst \dynamical" one is due to Cauchy (1847). His method, known as Gradient Descent (GD), is the one-step method x k+1=xkrf(xk);(2.1) whereis the step size parameter or, as it is referred in the machine learning community, the learning rate. It is readily seen that this method is a simple discretization of the rst order ODE _x=rf(x);(2.2) from which it takes its dynamical nature. What is perhaps not so readily seen is that, given an initial conditionx0, the trajectories obtained from both equations,xkandx(t), converge to the argument of minimax. In particular,xkconverges linearly toxwhile the function valuesf(xk) do so to the global minimumf(x) at a rate ofO(1=k) (Polyak, 1964, 1987). 3

C.M. Campos, A. Mahillo and D. Mart

n de Diego An initial improvement over GD was given by Polyak (1964): He introduced a novel two-step method, Polyak's Heavy Ball (PHB), also known as Classical Momentum (CM) after Sutskever et al. (2013). As it was originally presented, PHB/CM takes the form of the two-step method x k+1=xkP(xk) +(xkxk1) (2.3) wherePis a functional operator for which a root is sought and;are \small" positive constants that condition the convergence of the method. In comparison with (2.1), (2.3) adds a new term,xkxk1, the momentum of the discrete motion that incorporates past information in an amount controlled by, the so called momentum coecient. WhenPis conservative, that is, whenP=rf, Polyak showed that, although the method's trajectory still converges linearly as GD's, it does so faster than GD's, that is, with a smaller geometric ratio (Polyak, 1964, 1987). The continuous analogue of (2.3) is the second order ODE x+(t)_x+(t)P(x) = 0 (2.4) that turns out to be the equation of motion of a Lagrangian system whenP=rf(Lemma

4). Thenx(t) traces the motion of a point mass in a well given byf. We therefore dropP

and stick here on withrf. A further an crucial step towards improving GD (and PHB/CM) was given in 1983 by Nesterov, a former student of Polyak. He presented a new method, coined after him as Nesterov's Accelerated Gradient (NAG), similar to PHB/CM but with a slight change of unexpected consequences. A naive derivation from (2.3) is almost immediate: Introduce a new variableykin (2.3) so it can be easily rewritten as the equivalent method y k+1=xkkrf(xk) (2.5a) x k+1=yk+1+k(xkxk1) (2.5b) where discrete-time dependence has been added to the coecients;for convenience. Replace thex's of the momentum term (right hand side of the second equation (2.5b)) by y's to get the new and non-equivalent method yk+1= xkkrf(xk) (2.6a) xk+1= yk+1+k(yk+1yk) (2.6b) where the bars are added to distinguish more easily both methods and underline that the sequences of points that they dene are in fact dierent. This latter method (2.6) is NAG as it is usually presented. Nesterov showed in turn that its method accelerates the convergence rate of the function values down toO(1=k2) (see Nesterov, 1983, 2018). The original values ofk;kgiven by Nesterov are rather intricate, a simpler and com- monly used version is yk+1= xkrf(xk) (2.7a) xk+1= yk+1+kk+ 3(yk+1yk) (2.7b) 4 Discrete Variational Calculus for Accelerated Optimization with >0 constant. As it is shown in Su et al. (2016), a continuous analogue of (2.7) is x+3t _x+rf(x) = 0;(2.8) which is but a particular case of PHB/CM's continuous analogue (2.4). Besides Su et al. (2016) also show that the function values converge to the minimum at an inverse quadratic rate, that is,f(x(t)) =f(x) +O(1=t2). More generally (Remark 12), (2.6) is a natural discretization of a perturbed ODE of the form x+(t)_x+(t)rf(x) ="F(x;_x;t);(2.9) which also is the equation of motion of a Lagrangian system (Lemma 4). In fact, it is this variational origin that Wibisono et al. (2016) take as point of departure. Once a particular type of time-dependent Lagrangian functions is considered, a subfamily of the so called Bregman Lagrangians, the variational approach captures acceleration in continuous-time into the derived discrete schemes achieving, in this case, a function value convergence rate ofO(t1n) withn3 (see also Wibisono, 2016).

3. Bregman Lagrangians

A Bregman Lagrangian is roughly speaking a time-dependent mechanical Lagrangian whose kinetic part is close to be a metric. They are built upon Bregman divergences (Bregman,

1967), a particular case of divergence functions. Bregman Lagrangians allow to dene

variational problems whose solutions minimize an objective function at an exponential rate (Betancourt et al., 2018). Adivergence functionover a manifoldQis a twice dierentiable functionB:QQ! R +such that for allx;y2Qwe have: •B(x;y)0 andB(x;x) = 0; •@xB(x;x) =@yB(x;x); and •@2xyB(x;x) is negative-denite. Divergence functions appear as pseudo-distances that are non-negative but are not, in general, symmetric. A typical divergence function overQ=Rnassociated to a dierentiable strictly convex function :Rn!Ris theBregman divergence: B (x;y) = (x)(y) hd(y);xyi: Observe that it is the remainder of the rst order Taylor expansion of aroundyevaluated atx, a sort of Hessian metric. Given a Bregman divergence overRn, let us consider the time-dependent kinetic energy

K(x;_x;t) =B(x+e(t)_x;x)

and the time-dependent potential energy

U(x;t) =e(t)f(x);

C.M. Campos, A. Mahillo and D. Mart

n de Diego from which dene theBregman LagrangianL:TRnTR!Rby

L(x;_x;t) =e(t)+

(t)K(x;_x;t)U(x;t) =e(t)+ (t) (x+e(t)_x)(x)e(t)hd(x);_xi e(t)f(x) where the time-dependent functions(t);(t); (t) are chosen to produce dierent algo- rithms. These functions verify what Wibisono et al. (2016) refer to asideal scaling con- ditions, namely, _ (t) =e(t)and_(t)e(t):(3.1) The rst condition greatly simplies several expressions that can be derived from the Breg- man Lagrangian. For instance, when _ (t) =e(t)is met, the associated Euler-Lagrange equations reduce to r 2 x+e(t)_xddt x+e(t)_x +e(t)+(t)rf(x) = 0: The second condition ensures convergence of the underlying trajectories to the minimum at rate non-slower thanO(e(t))

In the particular case where (x) =12

kxk2, for whichB(x;y) =12 kxyk2, the Bregman

Lagrangian takes the simple form

L(x;_x;t) =a(t)12

k_xk2b(t)f(x);(3.2) witha(t) =e (t)(t)andb(t) =e(t)+(t)+ (t).

4. Geometry of the Time-Dependent Lagrangian and Hamiltonian

Formalisms

Since Bregman Lagrangians are time-dependent, in this section, we introduce some needed geometric ingredients about non-autonomous mechanics and highlight some of their main invariance properties (see Abraham and Marsden, 1978; Libermann and Marle, 1987; de

Leon and R. Rodrigues, 1987).

LetQbe a manifold andTQits tangent bundle. Coordinates (xi) onQinduce coordi- nates (xi;_xi) onTQ. Therefore we have natural coordinates (xi;_xi;t) onTQRwhich is the velocity phase space for time-dependent systems. Given two instants (time values)a;b2R, witha < b, and corresponding positions x a;xb2Q, consider the set of curves: C

2a;b=C2([a;b];xa;xb) =f: [a;b]!Qj2 C2with(a) =xa; (b) =xbg:

Given a time-dependent Lagrangian functionL:TQR!R, dene the action functional J

L:C2a;b!R

L() =Z

b a

L(0(t);t)dt(4.1)

where0: [a;b]!TQ. 6 Discrete Variational Calculus for Accelerated Optimization Using variational calculus, the critical points ofJLare locally characterized by the solutions of theEuler-Lagrange equations: ddt @L@_xi @L@x i= 0;1in= dimQ:(4.2) For time-dependent Lagrangians it is possible to check that the energyEL:TQR!R, E

L= LL= _xi@L@_xiL;

where is the Liouville vector eld onTQ(Libermann and Marle, 1987), is not, in general, preserved sincedELdt=@L@t We now pass to the Hamiltonian formalism using theLegendre transformation

FL:TQR!TQR;

whereTQis the cotangent bundle ofQwhose natural coordinates are (xi;pi). The Leg- endre transformation is locally given by

FL(xi;_xi;t) =

x i;@L@_xi;t We assume that the Legendre transformation is a dieomorphism (that is, the La- grangian is hyperregular) and dene the Hamiltonian functionH:TQR!Rby

H=EL(FL)1;

which induces the cosymplectic structure (

H;R) onTQRwith

dt := pr2dt ;

H=d(pr1QHdt) =

Q+ dH^dt;

where pr i,i= 1;2, are the projections to each Cartesian factor andQdenotes the Liouville

1-form onTQ(Abraham and Marsden, 1978), given in induced coordinates byQ=pidxi.

We also denote by

Q=dpr1Qthe pullback of the canonical symplectic 2-form!Q= dQonTQ. In coordinates,

Q= dxi^dpi. (Observe that now

Qis presymplectic

since ker Q= spanf@=@tg.) Therefore in induced coordinates (xi;pi;t):

H= dxi^dpi+ dH^dt; R= dt:

We dene theevolution vector eldEH2X(TQR) by

i EH

H= 0; iEHdt= 1 (4.3)

In local coordinates the evolution vector eld is:

E H=@@t +@H@p i@@x i@H@x i@@p i: 7

C.M. Campos, A. Mahillo and D. Mart

n de Diego

The integral curves ofEHare given by:

_ t= 1;_xi=@H@p i;_pi=@H@x i:(4.4) The integral curves ofEHare precisely the curves of the formt7! FL(0(t);t) where :I!Qis a solution of the Euler-Lagrange equations forL:TQR!R.

From Equation (4.3) we deduce that the

ow ofEHveries the following preservation properties L EH

H=LEH(

Q+ dH^dt) = 0;LEHdt= 0:(4.5)

Denote by

s:U TQR!TQRthe ow of the evolution vector eldEH, where

Uis an open subset ofTQR. Observe that

s(q;t) = (t;s(q);t+s); q2TqQ; where t;s(q) = pr1(s(q;t)). Therefore from the ow ofEHwe induce a map t;s:UtTQ!TQ whereUt=fq2TQj(q;t)2 Ug. Observe that if we know t;sfor allt, we can recover the ow sofEH.

From Equations (4.5) we deduce that

Q+ dH^dt) =

Q+ dH^dt;s(dt) =dt:(4.6)

The following theorem relates the preservation properties (4.6) with the symplecticity of the map familyft;s:TQ!TQg. Theorem 1We have thatt;s:UtTQ!TQis a symplectomorphism, that is, t;s!Q=!Q. ProofFirst, observe that any vectorY(q;t)2T(q;t)(TQR) admits a unique decompo- sition: Y (q;t)=Yq(t) +Yt(q); whereYq(t)2TqTQandYt(q)2TtR. Moreover, we have thathdt;Yq(t)i= 0. Therefore, if we restrict ourselves to vectors tangent to the pr

2-bersY(q;t)2T(q;t)pr12(t) =

V (q;t)pr2then we have the decomposition Y (q;t)=Yq(t) + 0t=Yq(t)2V(q;t)pr2TqTQ: From the second preservation property given in (4.5) we deduce that

0 =h(dt)(q;t);Yq(t)i=h(sdt)(q;t);Yq(t)i=h(dt)s(q;t);Ts(Yq(t))i

Ts(Yq(t)) =Tt;s(Yq(t)) + 0t+sTt;s(Yq(t)):

quotesdbs_dbs47.pdfusesText_47

[PDF] MATHS DM Exercice

[PDF] maths dm fonction

[PDF] Maths Dm le chateau des bories

[PDF] maths DM merci d'avance

[PDF] Maths dm pour demain

[PDF] Maths dm rentree

[PDF] Maths dm repère orthonormé

[PDF] maths dm sil vous plait pour jeudi 13 décembre j'ai du mal a le faire j'ai vraiment besoin d'aide

[PDF] Maths dm somme

[PDF] maths dm sur le calcul

[PDF] maths dm sur les fonctions

[PDF] Maths dm sur les racines carrés

[PDF] maths dm trajet

[PDF] MATHS DM URGENT

[PDF] Maths du niveau quatrième

[PDF] A Discrete Variational Derivation of Accelerated Methods in

Cedric M. Camposcedric.mcampos@urjc.es

Universidad Rey Juan Carlos

Calle Tulipan s/n, 28933 Mostoles, Spain

Alejandro Mahilloalmahill@unizar.es

Departamento de Matematicas

Universidad de Zaragoza

C. de Pedro Cerbuna, 12, 50009, Zaragoza, Spain

David Martn de Diegodavid.martin@icmat.es

Calle Nicolas Cabrera 13-15, 28049 Madrid, Spain

Abstract

Several experiments exemplify the result.

1. Introduction

C.M. Campos, A. Mahillo and D. Mart

2009). Moreover, it is possible to derive a discrete version of Noether's theorem relating the

2001, for variable time step). Moreover, we recover the symplectic character on bres of the

2001). Upon this machinery, we derive in Section 7 two families of momentum methods

2. From Gradient Descent to Nesterov's Accelerated Gradient

C.M. Campos, A. Mahillo and D. Mart

4). Thenx(t) traces the motion of a point mass in a well given byf. We therefore dropP

3. Bregman Lagrangians

1967), a particular case of divergence functions. Bregman Lagrangians allow to dene

K(x;_x;t) =B(x+e(t)_x;x)

U(x;t) =e(t)f(x);

C.M. Campos, A. Mahillo and D. Mart

L(x;_x;t) =e(t)+

In the particular case where (x) =12

Lagrangian takes the simple form

L(x;_x;t) =a(t)12

4. Geometry of the Time-Dependent Lagrangian and Hamiltonian

Formalisms

Leon and R. Rodrigues, 1987).

2a;b=C2([a;b];xa;xb) =f: [a;b]!Qj2 C2with(a) =xa; (b) =xbg:

L:C2a;b!R

L() =Z

L(0(t);t)dt(4.1)

L= LL= _xi@L@_xiL;

FL:TQR!TQR;

FL(xi;_xi;t) =

H=EL(FL)1;

H;R) onTQRwith

H=d(pr1QHdt) =

Q+ dH^dt;

1-form onTQ(Abraham and Marsden, 1978), given in induced coordinates byQ=pidxi.

We also denote by

Q= dxi^dpi. (Observe that now

Qis presymplectic

H= dxi^dpi+ dH^dt; R= dt:

We dene theevolution vector eldEH2X(TQR) by

H= 0; iEHdt= 1 (4.3)

In local coordinates the evolution vector eld is:

C.M. Campos, A. Mahillo and D. Mart

The integral curves ofEHare given by:

From Equation (4.3) we deduce that the

H=LEH(

Q+ dH^dt) = 0;LEHdt= 0:(4.5)

Denote by

Uis an open subset ofTQR. Observe that

From Equations (4.5) we deduce that

Q+ dH^dt) =

Q+ dH^dt;s(dt) =dt:(4.6)

2-bersY(q;t)2T(q;t)pr12(t) =

0 =h(dt)(q;t);Yq(t)i=h(sdt)(q;t);Yq(t)i=h(dt)s(q;t);Ts(Yq(t))i

Ts(Yq(t)) =Tt;s(Yq(t)) + 0t+sTt;s(Yq(t)):