convex optimization brief overview

Contents

Introduction

Mathematical optimization problem

\begin{aligned} m i n i m i z e & f_{0} (x) \\ s u b j e c t t o & f_{i} (x) ⩽ b_{i}, i = 1, \dots, m . \end{aligned}

$\begin{align} minimize\quad & f_0(x)\\ subject\ to\quad & f_i(x)\leqslant b_i,\quad i=1, \dots ,m. \end{align}$

quadratic programming

\begin{aligned} m i n i m i z e & \frac{1}{2} x^{T} P x + q^{T} x + r, P \in S_{+}^{n} \\ s u b j e c t t o & A x ⩽ b \end{aligned}

$\begin{align} minimize \quad & \frac{1}{2} x^TPx+q^Tx+r\ , \quad P \in S^n_{+} \\ subject\ to \quad & Ax\leqslant b \end{align}$

Standard and inequality form semidefinite programs

\begin{aligned} m i n i m i z e & t r (C X) \\ s u b j e c t t o & t r (A_{i} X) = b_{i}, i = 1, \dots, p \\ X ⪰ 0 \end{aligned}

$\begin{align} minimize \quad & tr{(CX)} \\ subject\ to \quad & tr{(A_iX)}=b_i\ , \quad i=1,\dots, p \\ & X \succeq 0 \\ \end{align}$

X \in S_{+}^{n}, C \in R^{n \times n}, A_{i} \in R^{n \times n}, b i \in R

$X \in S^n_{+}\ , \quad C \in R^{n \times n} \ ,\quad A_i \in R^{n\times n} \ ,\quad bi \in R$

Semidefinite programs

\begin{aligned} m i n i m i z e & c^{T} x \\ s u b j e c t t o & x_{1} F_{1} + \dots + x_{n} F_{n} + G ⩽ 0 \\ A x = b \end{aligned}

$\begin{align} minimize \quad & c^Tx \\ subject\ to \quad & x_1F_1 + \dots + x_nF_n + G \leqslant 0 \\ & Ax=b \end{align}$

Convex function

θ \in (0, 1), f (θ x + (1 - θ) y) \leq θ f (x) + (1 - θ) f (y)

$\theta \in (0, 1),\ f(\theta x + (1 - \theta)y)\leq \theta f(x) + (1-\theta)f(y)$

convex function plus convex function is also convex function
convex function minus convex function may not be convex function

f (x), g (x) i s c o n v e x h (x) = f (x) - g (x) h (θ x + (1 - θ) y) = f (θ x + (1 - θ) y) - g (θ x + (1 - θ) y)

$f(x),g(x)\ is\ convex\\ h(x) = f(x)-g(x) \\ h\left(\theta x + (1-\theta)y\right)=f\left(\theta x + (1-\theta)y\right)-g\left(\theta x + (1-\theta)y\right)$

Inequality can not subtract another Inequality, so above formula is not true. For example:

f (x) = x^{2}, g (x) = x^{3}, x \in [0, 1] h (x) = f (x) - g (x) = x^{2} - x^{3}, x \in [0, 1] h^{'} (x) = 2 x - 3 x^{2} = 0 ⟹ x = 0, \frac{2}{3}

$f(x) = x^{2},\ g(x) = x^{3},\ x\in [0, 1] \\ h(x) = f(x) - g(x) = x^{2} - x^{3},\ x \in [0,1] \\ h'(x) = 2x-3x^{2}=0 \quad \Longrightarrow \quad x=0,\frac{2}{3}$

There are two minimum points, it is not a convex function.

Duality

Lagrange function

We consider an optimization problem in the standard form:

\begin{aligned} m i n i m i z e & f_{0} (x) \\ s u b j e c t t o & f_{i} (x) \leq 0, i = 1, 2, \dots, m \\ h_{i} (x) = 0, i = 1, 2, \dots, p \end{aligned}

$\begin{align} \mathrm{minimize}\quad & f_0(x) \\ \mathrm{subject\ to} \quad & f_i(x)\leq 0,\quad i=1,2,\dots,m \\ & h_i(x) = 0,\quad i=1,2,\dots,p \end{align}$

Lagrange function

L (x, λ, v) = f_{0} (x) + \sum_{i = 1}^{m} λ_{i} f_{i} (x) + \sum_{i = 1}^{p} v_{i} h_{i} (x)

$L(x,\lambda,v)=f_0(x)+\sum\limits_{i=1}^{m}\lambda_if_i(x) + \sum\limits_{i=1}^{p}v_ih_i(x)$

Lagrange dual function

g (λ, v) = inf_{x \in D} L (x, λ, v) = inf_{x \in D} (f_{0} (x) + \sum_{i = 1}^{m} λ_{i} f_{i} (x) + \sum_{i = 1}^{p} v_{i} h_{i} (x))

$g(\lambda,v) = \inf\limits_{x\in \mathcal{D}}L(x,\lambda,v)=\inf\limits_{x\in \mathcal{D}}\left( f_0(x)+\sum\limits_{i=1}^{m}\lambda_if_i(x) + \sum\limits_{i=1}^{p}v_ih_i(x) \right)$

Lagrange dual problem

\begin{aligned} m a x m i z e & g (λ, v) \\ s u b j e c t t o & λ ⪰ 0 \end{aligned}

$\begin{align} \mathrm{maxmize}\quad & g(\lambda,v) \\ \mathrm{subject\ to} \quad & \lambda \succeq 0 \end{align}$

Convex conjugate

Let $X$ be a real topological vector space and let $X^{*}$ be the dual space to $X$ . Denote by

⟨ \cdot, \cdot ⟩ : X^{*} \times X \to R

$\langle \cdot ,\cdot \rangle :X^{*}\times X\to \mathbb {R}$

the canonical dual pairing, which is defined by

(x^{*}, x) \mapsto x^{*} (x) .

$\left(x^{*},x\right)\mapsto x^{*}(x).$

For a function

f : X \to R \cup {- \infty, + \infty}

$f:X\to \mathbb {R} \cup \{-\infty ,+\infty \}$

taking values on the extended real number line, its convex conjugate is the function

f^{*} : X^{*} \to R \cup {- \infty, + \infty}

$f^{*}:X^{*}\to \mathbb {R} \cup \{-\infty ,+\infty \}$

whose value at

x^{*} \in X^{*}

$x^{*}\in X^{*}$

is defined to be the supremum:

f^{*} (y) := sup {⟨ y, x ⟩ - f (x) : x \in X},

$f^{*}\left(y\right):=\sup \left\{\left\langle y,x\right\rangle -f(x)~\colon ~x\in X\right\},$

or, equivalently, in terms of the infimum:

f^{*} (y) := - inf {f (x) - ⟨ y, x ⟩ : x \in X} .

$f^{*}\left(y\right):=-\inf \left\{f(x)-\left\langle y,x\right\rangle ~\colon ~x\in X\right\}.$

This definition can be interpreted as an encoding of the convex hull of the function's epigraph in terms of its supporting hyperplanes.

Weak duality

The optimal value of the Lagrange dual problem, which we denote $d^{*}$ , is, by definition, the best lower bound on $p^{*}$ that can be obtained from the Lagrange dual function. In particular, we have the simple but important inequality

d^{*} \leq p^{*}

$d^{*} \leq p^{*}$

which holds even if the original problem is not convex. This property is called weak duality.

A starved camel is bigger than a horse.

Strong duality

d^{*} = p^{*}

$d^{*}=p^{*}$

Slater’s condition

\begin{aligned} m i n i m i z e & f_{0} (x) \\ m a x m i m i z e & f_{i} (x) \leq 0, i = 1, \dots, m \\ A x = b, (h_{i} (x) = 0), i = 1, \dots, p \end{aligned}

$\begin{align} \mathrm{minimize} & \quad f_{0}(x) \\ \mathrm{maxmimize} & \quad f_i(x) \leq 0,\ i = 1,\dots,m \\ & \quad Ax=b,\ (h_i(x)=0),\ i=1,\dots,p \end{align}$

\exists x \in r e l i n t D f_{i} (x) < 0, i = 1, \dots, m A x = b

$\exists x \in \mathbf{relint} \mathcal{D} \\ f_{i}(x) < 0,\ i = 1,\dots,m \quad Ax=b$

Saddle-point interpretation

sup_{λ ⪰ 0} inf_{x} L (x, λ) = inf_{x} sup_{λ ⪰ 0} L (x, λ)

$\sup\limits_{\lambda \succeq 0} \inf\limits_{x} L(x,\lambda) = \inf\limits_{x} \sup\limits_{\lambda \succeq 0} L(x,\lambda)$

$(x, \lambda, v)$ over the Lagrange function $(L)$ is saddle point, is the necessary and sufficient condition of that $(x, \lambda, v)$ is the primal and dual optimal points, in the other words, $p^{*}-d^{*}=0$ .

Compelementary slackness

\begin{aligned} f_{0} (x^{*}) & = g (λ^{*}, v^{*}) \\ = inf_{x} {f_{0} (x) + \sum_{i = 1}^{m} λ_{i}^{*} f_{i} (x) + \sum_{i = 1}^{p} v_{i}^{*} h_{i} (x)} \\ \leq f_{0} (x^{*}) + \sum_{i = 1}^{m} λ_{i}^{*} f_{i} (x^{*}) + \sum_{i = 1}^{p} v_{i}^{*} h_{i} (x^{*}) \\ \leq f_{0} (x^{*}) \end{aligned}

$\begin{align} f_{0}(x^{*}) & = g(\lambda^{*},v^{*}) \\ & = \inf\limits_x \left\{ f_{0}(x) + \sum\limits_{i=1}^{m}\lambda_i^{*}f_{i}(x) + \sum\limits_{i=1}^{p}v_{i}^{*}h_{i}(x) \right\} \\ & \leq f_{0}(x^{*}) + \sum\limits_{i=1}^{m}\lambda_i^{*}f_{i}(x^{*}) + \sum\limits_{i=1}^{p}v_{i}^{*}h_{i}(x^{*}) \\ & \leq f_{0}(x^{*}) \end{align}$

important conclusion is that

\sum_{i = 1}^{m} λ_{i}^{*} f_{i} (x^{*}) = 0

$\sum\limits_{i=1}^{m}\lambda_i^{*}f_{i}(x^{*}) = 0$

We can express the complementary slackness condition as

{\begin{aligned} λ_{i}^{*} > 0 & ⟹ f_{i} (x^{*}) = 0 \\ f_{i} (x^{*}) < 0 & ⟹ λ_{i}^{*} = 0 \end{aligned}

$\left\{\begin{align} \lambda_{i}^{*}>0 &\Longrightarrow f_{i}(x^{*})=0 \\ f_{i}(x^{*})<0 &\Longrightarrow \lambda_{i}^{*} = 0 \end{align}\right.$

KKT optimality conditions

\begin{aligned} f_{i} (x^{*}) & \leq 0, i = 1, \dots, m \\ h_{i} (x^{*}) & = 0, i = 1, \dots, p \\ λ_{i}^{*} & \geq 0, i = 1, \dots, m \\ λ_{i}^{*} f_{i} (x^{*}) & = 0, i = 1, \dots, m \\ \nabla f_{0} (x^{*}) + \sum_{i = 1}^{m} λ_{i}^{*} \nabla f_{i} (x^{*}) + \sum_{i = 1}^{p} v_{i}^{*} \nabla h_{i} (x^{*}) & = 0, \end{aligned}

$\begin{align} f_{i}(x^{*}) & \leq 0, \ i=1,\dots,m \\ h_{i}(x^{*}) & = 0, \ i = 1,\dots,p \\ \lambda_{i}^{*} & \geq 0, \ i=1,\dots,m \\ \lambda_{i}^{*}f_{i}(x^{*}) & = 0, \ i=1,\dots,m \\ \nabla f_{0}(x^{*})+\sum\limits_{i=1}^{m}\lambda_{i}^{*}\nabla f_{i}(x^{*})+\sum\limits_{i=1}^{p}v_{i}^{*}\nabla h_{i}(x^{*}) & = 0, \end{align}$

which are called the Karush-Kuhn-Tucker (KKT) conditions.

For any optimization problem with differentiable objective and constraint functions for which strong duality obtains, any pair of primal and dual optimal points must satisfy the KKT conditions.(necessary condition)

necessary condition: When $(x, \lambda, v)$ is the primal and dual optimal points, optimization problem is differentiable objective and constraint functions for which strong duality obtains, then $(x, \lambda, v)$ must satisfy the KKT conditions.

When the primal problem is convex, the KKT conditions are also sufficient for the points to be primal and dual optimal.

sufficient condition: When $(x, \lambda, v)$ is satisfied with KKT and the primal problem is convex, then $(x, \lambda, v)$ is the primal and dual optimal points.

Perturbation and sensitivity analysis

Perturbed version

\begin{aligned} m i n i m i z e & f_{0} (x) \\ m a x m i m i z e & f_{i} (x) \leq u_{i}, i = 1, \dots, m \\ h_{i} (x) = w_{i}, i = 1, \dots, p \end{aligned}

$\begin{align} \mathrm{minimize} & \quad f_{0}(x) \\ \mathrm{maxmimize} & \quad f_i(x) \leq u_{i},\ i = 1,\dots,m \\ & \quad h_i(x)=w_{i},\ i=1,\dots,p \end{align}$

We define $p^{⋆}(u, v)$ as the optimal value of the perturbed problem

p^{*} (u, w) = inf_{x} {f_{0} (x) | \exists x \in D, \begin{aligned} f_{i} (x) \leq u_{i}, i = 1, \dots, m \\ h_{i} (x) = w_{i}, i = 1, \dots, p \end{aligned}} p^{*} (0, 0) = p^{*}

$p^{*}(u,w)=\inf\limits_{x} \left\{ f_{0}(x)\Big|\ \exists x \in \mathcal{D},\ {\begin{align} f_{i}(x) \leq u_{i},\ i=1,\dots,m \\ h_{i}(x) = w_{i},\ i = 1,\dots,p \end{align}} \right\} \\ p^{*}(0, 0) = p^{*}$

When the original problem is convex, the function $p^{⋆}(u,v)$ is a convex function of $u$ and $v$ .

Now we assume that strong duality holds, and that the dual optimum is attained. This is the case if the original problem is convex, and Slater’s condition is satisfied. Let $(\lambda^{⋆}, v^{⋆})$ be optimal for the dual of the unperturbed problem. Then for all $u$ and $v$ we have

p^{*} (u, w) \geq p^{*} (0, 0) - {λ^{*}}^{T} u - {v^{*}}^{T} w

$p^{*}(u, w) \geq p^{*}(0, 0) - {\lambda^{*}}^{T}u -{v^{*}}^{T}w$

Algorithm

Steepest descent method

min_{x} f (x)

$\min\limits_{x} f(x)$

At the first, we are going to make an equivalent substitution.

min_{v} f (x^{(k)} + v)

$\min\limits_{v}f(x^{(k)} + v)$

Taylor expansion:

f (x^{(k)} + v) \approx f (x^{(k)}) + \nabla f^{T} (x^{(k)}) \cdot v

$f(x^{(k)} + v) \approx f(x^{(k)})+ \nabla f^{T}(x^{(k)})\cdot v$

Then we get the direction:

d^{(k + 1)} = \arg min_{v} {f (x^{(k)}) + \nabla f^{T} (x^{(k)}) \cdot v | ‖ v ‖ = 1}

$d^{(k+1)}=\arg\min\limits_{v} \{f(x^{(k)})+ \nabla f^{T}(x^{(k)})\cdot v\ \Big|\ \|v\|=1 \}$

Dual norm

‖ z ‖_{*} = sup {z^{T} x | ‖ x ‖ \leq 1}

$\|z\|_{*} = \sup \{ z^{T}x \ \Big|\ \|x\|\leq 1 \}$

We just need the direction, that's all.

Newton's method

The second-order Taylor approximation the hat f of the real f at the point $x^{(k)}$ is

f (x) \approx \hat{f} (x) = f (x^{(k)}) + \nabla f (x^{(k)})^{T} (x - x^{(k)}) + \frac{1}{2} (x - x^{(k)})^{T} \nabla^{2} f (x^{(k)}) (x - x^{(k)}) \nabla^{2} f (x) = S_{+ +}^{n}, x \in D

$f(x) \approx \hat{f}(x) = f(x^{(k)}) + \nabla f(x^{(k)})^{T}(x-x^{(k)}) + \frac{1}{2} (x-x^{(k)})^{T}\nabla^{2}f(x^{(k)})(x-x^{(k)}) \\ \nabla^{2} f (x) = \mathcal{S}_{++}^{n},\quad x \in \mathcal{D}$

The $\hat{f}(x)$ is convex quadratic function of $x$ .

The optimal necessary and sufficient condition:

\nabla \hat{f} (x) = \frac{\partial \hat{f} (x)}{\partial x} = 0 ⇓ \nabla f (x^{(k)}) + \nabla^{2} f (x^{(k)}) (x - x^{(k)}) = 0

$\nabla \hat{f}(x) = \frac{\partial \hat{f}(x)}{\partial x} = 0 \\ \Downarrow \\ \nabla f(x^{(k)}) + \nabla^{2}f(x^{(k)})(x-x^{(k)})=0$

The Hesse matrix is invertible due to that it is positive definiteness matrix.

x = x^{(k)} - {[\nabla^{2} f (x^{(k)})]}^{- 1} \nabla f (x^{(k)}) x_{n e w} = x^{(k + 1)} = x

$x = x^{(k)} - \left[\nabla^{2}f(x^{(k)}) \right]^{-1}\nabla f(x^{(k)}) \\ x_{new} = x^{(k+1)}=x$

The vector

Δ x_{n t} = - {[\nabla^{2} f (x^{(k)})]}^{- 1} \nabla f (x^{(k)})

$\Delta x_{nt} = - \left[\nabla^{2}f(x^{(k)}) \right]^{-1}\nabla f(x^{(k)})$

is called the Newton step.

Interpretation

We can get the optimal of $\hat{f}$ , but $\hat{f}$ is a approximate function to real $f$ .

If the real $f(x)$ is quadratic, then $x^{(new)}$ is the exact minimizer of $f$ . If the function $f$ is nearly quadratic, intuition suggests that $x^{(new)}$ should be a very good estimate of the minimizer of $f$ .

Since the real $f$ is twice differentiable, the quadratic model of the $\hat{f}$ will be very accurate when $x$ is near $x^{⋆}$ . It follows that when $x$ is near $x^{⋆}$ , the point $x^{(new)}$ should be a very good estimate of $x^{⋆}$ .

Interpretation

Also, the $\hat{f}'$ is the linear approximation of the real $f'$ at the point $x$ . Even so, the direction will approximately point to the real $f$ optimal.

Interpretation

Appendices

The minimization over $x$ :

g (λ) = inf_{x \in D} L (x, λ)

$g(\lambda)=\inf\limits_{x\in \mathbb{D}}L(x,\lambda)$

The maximization over $\lambda \geq 0$ :

h (x) = sup_{λ \geq 0} L (x, λ)

$h(x) = \sup\limits_{\lambda \geq 0} L(x,\lambda)$

Matrix split

x = x^{+} - x^{-}, x^{+} \geq 0, x^{-} \geq 0

$x=x^{+}-x^{-},\ x^{+} \geq 0,\ x^{-} \geq 0$

Dual norm

‖ z ‖_{*} = sup {z^{T} x | ‖ x ‖ \leq 1}

$\|z\|_{*} = \sup \{ z^{T}x \ \Big|\ \|x\|\leq 1 \}$

Convex optimization brief overview

Introduction

Convex function

Duality

Lagrange function

Convex conjugate

Weak duality

Strong duality

Slater’s condition

Saddle-point interpretation

Compelementary slackness

KKT optimality conditions

Perturbation and sensitivity analysis

Algorithm

Steepest descent method

Newton's method

Appendices