Lecture 4 Model Fitting and Optimization¶

Optimization¶

Target¶

\[ \begin{aligned} & \text{Minimize } & f_0(x), \\ & \text{Subject to } &f_i(x) & \le 0,\ i = 1, \dots, m, \\ & & g_i(x) & = 0,\ i = 1,\dots, p. \end{aligned} \]

\(x \in \mathbb{R}^n\) is a vector variable to be chosen.
\(f_0\) is the objective function to be minimized.
\(f_1, \dots, f_m\) are the inequality constraint functions.
\(g_1, \dots, g_p\) are the equality constraint functions.

Model Fitting¶

A typical approach: Minimize the Mean Square Error (MSE)

\[ \hat x = \mathop{\arg\min}\limits_x \sum\limits_i(b_i - a_i^Tx)^2. \]

Reasons to choose MSE¶

Key Assumptions: MSE is MLE with Gaussian noise.

From Maximum Likelihood Estimation (MLE), the data is assumed to be with Gaussian noise.

\[ b_i = a_i^Tx + n,\ \ n \sim G(0, \sigma). \]

The likelihood of observing \((a_i, b_i)\) is

\[ P[(a_i, b_i)|x] = P[b_i - a_i^Tx] \propto \exp\left(-\frac{(b_i - a_i^Tx)^2}{2\sigma^2}\right). \]

If the data points are independent,

\[ \begin{aligned} P[(a_1, b_1)(a_2, b_2)\dots|x] & = \prod\limits_iP[(a_i, b_i)|x] = \prod\limits_iP[b_i - a_{i^T}x] \\ & \propto \exp\left(-\frac{\sum_i(b_i - a_i^Tx)^2}{2\sigma^2}\right) = \exp\left(-\frac{||Ax-b||^2_2}{2\sigma^2}\right). \end{aligned} \]

\[ \begin{aligned} \hat x &= \mathop{\arg\max}\limits_x P[(a_1, b_1)(a_2, b_2)\dots|x] \\ &= \mathop{\arg\max}\limits_x \exp\left(-\frac{||Ax-b||^2_2}{2\sigma^2}\right) = \mathop{\arg\min}\limits_x||Ax - b||_2^2. \end{aligned} \]

Numerical Methods¶

Analytical Solution 解析解¶

The derivative of \(||Ax-b||^2_2\) is \(2A^T(Ax - b)\), let it be \(0\). Then we get \(\hat x\) satisfying

\[ A^TAx = A^Tb. \]

But, if no analytical solution, we should only consider the approximate solution.

Approximate Solution 近似解¶

Method

\(x \leftarrow x_0\) (Initialization),
while not converge,
1. \(p \leftarrow \text{descending\_direction(x)}\),
2. \(\alpha \leftarrow \text{descending\_step(x)}\),
3. \(x \leftarrow x + \alpha p\).

Gradient Descent (GD)¶

Steepest Descent Method¶

\[ F(x_0 + \Delta x) \approx F(x_0) + J_F\Delta x. \]

Direction \(\Delta x = -J^T_F\).
Step
- To minimize \(\phi(a)\).
  Backtracking Algorithm
  - Initialize \(\alpha\) with a large value.
  - Decrease \(\alpha\) until \(\phi(\alpha)\le\phi(0) + \gamma\phi'(0)\alpha\).
Advantage
- Easy to implement.
- Perform well when far from the minimum.
Disadvantage
- Converge slowly when near the minimum, which wastes a lot of computation.

Newton Method¶

\[ F(x_0 + \Delta x) \approx F(x_0) + J_F\Delta x + \frac12\Delta x^TH_F\Delta x. \]

Direction \(\Delta x = -H_F^{-1}J^T_F\).
Advantage
- Faster convergence.
Disadvantage
- Hessian matrix requires a lot of computation.

Gauss-Newton Method¶

For \(\hat x = \mathop{\arg\min}\limits_x ||Ax-b||^2_2 \overset{\Delta}{=\!=}\mathop{\arg\min}\limits_x||R(x)||^2_2\), expand \(R(x)\).

\[ \begin{aligned} ||R(x_k+\Delta x)||^2_2 &\approx ||R(x_k) + J_R\Delta x||^2_2 \newline &= ||R(x_k)||^2_2 + 2R(x_k)^TJ_R\Delta x + \Delta x^TJ^T_RJ_R\Delta x. \end{aligned} \]

Direction \(\Delta x = -(J_R^TJ_R)^{-1}J_R^TR(x_k) = -(J_R^TJ_R)^{-1}J_F^T\)
- Compared to Newton Method, \(J_R^TJ_R\) is used to approximate \(H_F\).
Advantage
- No need to compute Hessian matrix.
- Faster to converge.
Disadvantage
- \(J^T_RJ_R\) may be singular.

Levenberg-Marquardt (LM)¶

Direction \(\Delta x = -(J_R^TJ_R + \lambda I)^{-1}J_R^TR(x_k)\).
- \(\lambda \rightarrow \infty\) Steepest Descent.
- \(\lambda \rightarrow 0\) Gauss-Newton.
Advantage
- Start and converge quickly.

Local Minimum and Global Minimum

Convex Optimization¶

Remains

Robust Estimation¶

Inlier obeys the model assumption. Outlier differs significantly rom the assumption.

Outlier makes MSE fail. To reduce its effect, we can use other loss functions, called robust functions.

RANSAC (RANdom SAmple Concensus)¶

The most powerful method to handle outliers.

Key ideas

The distribution of inliers is similar while outliers differ a lot.
Use data point pairs to vote.

ill-posed Problem 病态问题 / 多解问题¶

Ill-posed problems are the problems that the solution is not unique. To make it unique:

L2 regularization
- Suppress redundant variables.
- \(\min_x||Ax-b||^2_2,\ s.t. ||x||_2 \le 1\). (the same as \(\min_x||Ax-b||^2_2 + \lambda||x||_2\))
L1 regularization
- Make \(x\) sparse.
- \(\min_x||Ax-b||^2_2,\ s.t. ||x||_1 \le 1\). (the same as \(\min_x||Ax-b||^2_2 + \lambda||x||_1\))

Graphcut and MRF¶

A key idea: Neighboring pixels tend to take the same label.

Images as Graphs

A vertex for each pixel.
An edge between each pair, weighted by the affinity or similarity between its two vertices.
- Pixel Dissimilarity \(S(\textbf{f}_i, \textbf{f}_j) = \sqrt{\sum_k(f_{ik} - f_{jk})^2}\).
- Pixel Affinity \(w(i, j) = A(\textbf{f}_i, \textbf{f}_j) = \exp\left(-\dfrac{1}{2\sigma^2}S(\textbf{f}_i, \textbf{f}_j)\right)\).
Graph Notation \(G = (V, E)\).
- \(V\): a set of vertices.
- \(E\): a set of edges.

Graph Cut¶

Cut \(C=(V_A, V_B)\) is a parition of vertices \(V\) of a graph \(G\) into two disjoint subsets \(V_A\) and \(V_B\).

Cost of Cut: \(\text{cut}(V_A, V_B) = \sum\limits_{u\in V_A, v\in V_B} w(u, v)\).

Problem with Min-Cut

Min-cut is bias to cut small and isolated segments.

Solution: Normalized Cut

Compute how strongly verices \(V_A\) are associated with vertices \(V\).

\[ \text{assoc}(V_A, V) = \sum\limits_{u\in V_A, v\in V} w(u, v), \]

\[ \text{NCut}(V_A, V_B) = \frac{\text{cut}(V_A, V_B)}{\text{assoc}(V_A, V)} + \frac{\text{cut}(V_A, V_B)}{\text{assoc}(V_B, V)}. \]

Markow Random Field (MRF)¶

Graphcut is an exception of MRF.

Question

最后更新: 2023.11.21 17:00:13 CST
创建日期: 2022.11.09 00:35:37 CST