# Chapter 2

## Chapter 2

22 mai 2017 be seen from the second derivative (if it exists). 6. Page 7. DMM summer 2017 ... Calculate the gradient of fA
optimization

## The Matrix Cookbook

15 nov. 2012 determinant derivative of inverse matrix
matrixcookbook

b − Ax2. 2. = (b − Ax)T (b − Ax). = bT b − (Ax)T b − bT Ax + xT AT Ax. = bT b − 2bT Ax + xT AT Ax Because mixed second partial derivatives satisfy.
lecture

## The Matrix Cookbook

determinant derivative of inverse matrix
Matrix Cookbook

## Techniques of Integration

apparent that the function you wish to integrate is a derivative in some EXAMPLE 8.1.2 Evaluate ∫ sin(ax + b) dx assuming that a and b are constants ...
calculus Techniques of Integration

## Week 3 Quiz: Differential Calculus: The Derivative and Rules of

Question 2: Find limx→2f(x): f(x) = 1776. (A) +∞. (B) 1770. (C) −∞. (D) Does not exist! (E) None of the above. Answer: (E) The limit of any constant

## Assignment 2 — Solutions

If a1b2 = a2b1 show that this equation reduces to the form y′ = g(ax + by). Solution Substituting a2 = λa1 and b2 = λb1 into the equation yields:.
Assignment Solutions

## Introduction to Linear Algebra 5th Edition

To see that action I will write b1
linearalgebra

1 mar. 2016
ORF S Lec gh

## Order and Degree and Formation of Partial Differential Equations

When a differential equation contains one or more partial derivatives of an (viii) z = ax e' +. 1. 2. Sol. (1) We are given z = (2x + a) (2 y + b).
partial differential equations unit

244053 Chapter 2 OptimizationGradients, convexity, and ALS

DMM, summer 2017Pauli MiettinenContents•Background •Gradient descent •Stochastic gradient descent •Newton's method •Alternating least squares •KKT conditions2

DMM, summer 2017Pauli MiettinenMotivation•We can solve basic least-squares linear systems using SVD •But what if we have •missing values in the data •extra constraints for feasible solutions •more complex optimization problems (e.g. regularizers) •etc3

DMM, summer 2017Pauli MiettinenGradients, Hessians, and convexity4

DMM, summer 2017Pauli MiettinenDerivatives and local optima•The derivative of a function f: ℝ → ℝ, denoted f', explains its rate of change

•If it exists •The second derivative f'' is the change of rate of change5Ä ()=lim h0+

### Ä(+h)#Ä()

h

DMM, summer 2017Pauli MiettinenDerivatives and local optima•A stationary point of differentiable f is x s.t. f'(x) = 0 •f achieves its extremes in stationary points or in points where derivative doesn't exist, or at infinities (Fermat's theorem) •Whether this is (local) maximum or minimum can be seen from the second derivative (if it exists)6

DMM, summer 2017Pauli MiettinenPartial derivative•If f is multivariate (e.g. f: ℝ3 ), we can consider it as a family of functions •E.g. f(x, y) = x2 + y has functions f x (y) = x2 + y and f y (x) = x2 + y •Partial derivative w.r.t. one variable keeps other variables constant 7Ä (,y)=Ä y ()=2 DMM, summer 2017Pauli MiettinenGradient•Gradient is the derivative for multivariate functions f: ℝn

• •Here (and later), we assume that the derivatives exist •Gradient is a function ∇f: ℝn

ℝn •∇f(x) points "up" in the function at point x 8Ä= 1 2 n

DMM, summer 2017Pauli MiettinenHessian•Hessian is a square matrix of all second-order partial derivatives of a function

f: ℝn •As usual, we assume the derivatives exist10H(Ä)= 2 2 1 2 1 2 2 1 n 2 2 1 2 2 2 2 2 n 2 n 1 2 n 2 2 2 n DMM, summer 2017Pauli MiettinenJacobian matrix•If f: ℝm ℝn , then its Jacobian (matrix) is an nm matrix of partial derivatives in form •Jacobian is the best linear approximation of f •H(f(x)) = J(∇f(x))T 11J= 1 1 1 2 1 m 2 1 2 2 2 m n 1 n 2 n m DMM, summer 2017Pauli MiettinenExamples12Ä(,y)= 2 +2y+y (,y)=2+2y y (,y)=2+1

22
20

2 y

### J(Ä)=

DMM, summer 2017Pauli MiettinenGradient's properties•Linearity: ∇(αf + βg)(x) + α∇f(x) + β∇g(x) •Product rule: ∇(fg)(x) = f(x)∇g(x) + g(x)∇f(x) •Chain rule: •If f: ℝn

and g: ℝm ℝn , then ∇(f∘g)(x) = J(g(x))T ∇f(y)) where y = g(x) •If f is as above and h: ℝ → ℝ, then ∇(h∘f)(x) = h'(f(x))∇f(x) 13IMPORTANT! Chapter 2 OptimizationGradients, convexity, and ALS

DMM, summer 2017Pauli MiettinenContents•Background •Gradient descent •Stochastic gradient descent •Newton's method •Alternating least squares •KKT conditions2

DMM, summer 2017Pauli MiettinenMotivation•We can solve basic least-squares linear systems using SVD •But what if we have •missing values in the data •extra constraints for feasible solutions •more complex optimization problems (e.g. regularizers) •etc3

DMM, summer 2017Pauli MiettinenGradients, Hessians, and convexity4

DMM, summer 2017Pauli MiettinenDerivatives and local optima•The derivative of a function f: ℝ → ℝ, denoted f', explains its rate of change

•If it exists •The second derivative f'' is the change of rate of change5Ä ()=lim h0+

### Ä(+h)#Ä()

h

DMM, summer 2017Pauli MiettinenDerivatives and local optima•A stationary point of differentiable f is x s.t. f'(x) = 0 •f achieves its extremes in stationary points or in points where derivative doesn't exist, or at infinities (Fermat's theorem) •Whether this is (local) maximum or minimum can be seen from the second derivative (if it exists)6

DMM, summer 2017Pauli MiettinenPartial derivative•If f is multivariate (e.g. f: ℝ3 ), we can consider it as a family of functions •E.g. f(x, y) = x2 + y has functions f x (y) = x2 + y and f y (x) = x2 + y •Partial derivative w.r.t. one variable keeps other variables constant 7Ä (,y)=Ä y ()=2 DMM, summer 2017Pauli MiettinenGradient•Gradient is the derivative for multivariate functions f: ℝn

• •Here (and later), we assume that the derivatives exist •Gradient is a function ∇f: ℝn

ℝn •∇f(x) points "up" in the function at point x 8Ä= 1 2 n

DMM, summer 2017Pauli MiettinenHessian•Hessian is a square matrix of all second-order partial derivatives of a function

f: ℝn •As usual, we assume the derivatives exist10H(Ä)= 2 2 1 2 1 2 2 1 n 2 2 1 2 2 2 2 2 n 2 n 1 2 n 2 2 2 n DMM, summer 2017Pauli MiettinenJacobian matrix•If f: ℝm ℝn , then its Jacobian (matrix) is an nm matrix of partial derivatives in form •Jacobian is the best linear approximation of f •H(f(x)) = J(∇f(x))T 11J= 1 1 1 2 1 m 2 1 2 2 2 m n 1 n 2 n m DMM, summer 2017Pauli MiettinenExamples12Ä(,y)= 2 +2y+y (,y)=2+2y y (,y)=2+1

22
20

2 y

### J(Ä)=

DMM, summer 2017Pauli MiettinenGradient's properties•Linearity: ∇(αf + βg)(x) + α∇f(x) + β∇g(x) •Product rule: ∇(fg)(x) = f(x)∇g(x) + g(x)∇f(x) •Chain rule: •If f: ℝn

and g: ℝm ℝn , then ∇(f∘g)(x) = J(g(x))T ∇f(y)) where y = g(x) •If f is as above and h: ℝ → ℝ, then ∇(h∘f)(x) = h'(f(x))∇f(x) 13IMPORTANT!