- Primal & Dual Problem
- How to re-formulate Primal to Dual Problem (easier-to-solve)
Given
- Training Sample: $\mathcal{D}={(\mathbf{x}{t}, y{t})}_{t=1}^{N}$
- Instances:
$\mathbf{x}_{t}\in \mathbb{R}^m$ - Labels:
$y_{t}\in{+1,-1}$ Do
- Instances:
- Train a prediction function:
$$h:\mathcal{X}\mapsto{+1,-1}$$ One intuition is to give a Linear Discriminant Function: $$ f(\mathbf{x})=\mathbf{w}^\top\mathbf{x}+b $$ where the decision boundary is given by: $$ \mathbf{w}^\top\mathbf{x}+b=0 $$
- That is,
$\mathbf{w}$ is perpendicular to any arbitrary line in the hyperplane; - which means that
$\mathbf{w}$ is normal to the hyperplane.
We need to know the distance from an arbitrary data to the hyperplane.
Consider an arbitrary data point
-
$\mathbf{x}_p$ is the orthogonal projection of point$\mathbf{x}$ on the hyperplane. -
$\frac{\mathbf{w}}{|\mathbf{w}|}$ is the unit vector with the same direction as$\mathbf{w}$ . -
$\rho$ is the distance from the point to the hyperplane.
We could derive
The function
The prediction function is defined as: $$ \begin{align} h(\mathbf{x}) &= \text{sign}(f(\mathbf{x})) \ \ &=\text{sign}(\mathbf{w}^\top\mathbf{x}+b) \end{align} $$
Given:
-
An example
$(\mathbf{x}_t, y_t)\in\mathbb{R}^m\times {+1,-1}$ . Do: -
[i] The functional margin of this example with respect to the hyperplane
$\mathbf{w}^\top\mathbf{x}+b=0$ is given by: $$ \begin{align} \rho_t &= y_t\cdot f(\mathbf{x}_t) \ \ &= y_t\cdot (\mathbf{w}^\top\mathbf{x}_t+b) \end{align} $$ Functional margin can tell if a data point is Incorrectly Classified. -
$y_t$ is the ground truth. -
If the data point
$\mathbf{x}_t$ is correctly classified,$y_t$ and$\mathbf{w}^\top\mathbf{x}_t+b$ should have the same signs; -
i.e. if the data point
$\mathbf{x}_t$ is correctly classified,$\rho_t\geq0$ , otherwise$\rho_t <0$ . -
[i] The geometric margin of this example with respect to the hyperplane
$\mathbf{w}^\top\mathbf{x}+b=0$ is given by: $$ \begin{align} \rho &= y_t\cdot\frac{f(\mathbf{x})}{|\mathbf{w}|} \ \ &=y_t\cdot\frac{\mathbf{w}^\top\mathbf{x}+b}{|\mathbf{w}|} \end{align} $$ This tells not only if the data point is incorrectly classified, but also preserves the Euclidean Distance from the point to the hyperplane.
Note that, for arbitrary
That is, this hyperplane is defined that:
- the minimum functional margin from an arbitrary point to the hyper plane is exactly
$1$ . - i.e., the minimum euclidean distance from an arbitrary point to the hyperplane is exactly
$\frac{1}{|\mathbf{w}|}$ .
The canonical form regulates that:
- The classification should not only be correct;
- It should also be robust.
![[SVM_InifiniteOptions.png|350]]
- [I] Note that: If the two classes are separated enough, we can easily find infinite options that satisfies the canonical form.
However, to ensure the model's robustness facing unseen data, we need to search for a hyper-plane that:
- Maximizes the geometric margin
$\frac{1}{|\mathbf{w}|}$ ; - while maintaining the property
$y_i(\mathbf{w}^\top\mathbf{x}+b)\geq 1, \ \forall i$
We need a classifier that gives us the max margin.
We need to find an optimized weight
We use the Primal Lagrangian to combine the optimization and the constraint. The Primal Lagrangian is given by: $$ \mathcal{L}(\mathbf{w}, b, \alpha)=\frac{1}{2}|\mathbf{w}|^{2}+\sum_{i=1}^{N}\alpha_{i}\Bigl(1-y_{i}(\mathbf{w}^\top\mathbf{x}+b)\Bigr) $$ We need to Minimize this function.
We introduce a Lagrange multiplier
- Penalizing the objective if the constraint is violated.
- If the objective is violated,
$1-y_i(\mathbf{w}^\top\mathbf{x}+b)$ would be larger than$0$ . - The minimization is then prevented slightly.
- If the objective is violated,
- The more it violates, the more term is added to the Lagrangian function, scaled by the penalization factor of
$\alpha_{i}$ .
The original Primal Lagrangian has a total of
- A weight
$\mathbf{w}$ - A bias
$b$ - The
$N$ Lagrangian parameters${\alpha_N}_{1}^{N}$ corresponding to$N$ datapoints. With this many parameters, minimizing this function is costly.
However, we could mimic the optimal solutions by converting the original Primal Lagrangian into a Dual Problem. To solve the Dual Problem:
- First, optimize
$\mathcal{L}(\mathbf{w},b,\alpha)$ w.r.t.$\mathbf{w}$ and$b$ .-
Assume that all the Lagrangian parameters
$\alpha_i$ are already found. - Then, minimize the function w.r.t.
$\mathbf{w}$ and$b$ , regarding all the$\alpha_i$ as constants. - The original function
$\mathcal{L}(\mathbf{w},b,\alpha)$ is thus converted to$\mathcal{G}(\alpha)$ , which is a function that only contains the Lagrangian parameters.
-
Assume that all the Lagrangian parameters
- After this, optimize
$\mathcal{G}(\alpha)$ w.r.t.${\alpha}_{i}^{N}$ .
The effect of the Dual Problem could be witnessed, but is is not yet been formally proven. Note that the solution to the Dual Problem could be similar to the Primal Problem, but they are highly likely not be exactly the same.
To optimize, we minimize $\mathcal{L}$ by setting $\dfrac{\partial\mathcal{L}}{\partial \mathbf{w}}=0$ and $\dfrac{\partial \mathcal{L}}{\partial b}=0$ .
$$
\begin{align}
&\frac{\partial\mathcal{L}}{\partial\mathbf{w}} =0
\ \
\implies& \frac{\partial}{\partial\mathbf{w}}\Bigl[\frac{1}{2}|\mathbf{w}|^2+\sum_{i=1}^{N}\alpha_i\Bigl(1-y_i(\mathbf{w}^\top\mathbf{x}i+b)\Bigr)\Bigr] = 0
\ \
\implies & \mathbf{w}+\sum{i=1}^{N}-\alpha_iy_i\mathbf{x}i = 0
\ \
\implies & \mathbf{w}
= \sum{i=1}^{N}(\alpha_i \cdot y_i)\cdot\mathbf{x}_i
\end{align}
$$
$$ \begin{align} & \frac{\partial\mathcal{L}}{\partial b} = 0 \ \ \implies& \frac{\partial}{\partial b}\Bigl[\frac{1}{2}|\mathbf{w}|^2+\sum_{i=1}^{N}\alpha_i\Bigl(1-y_i(\mathbf{w}^\top\mathbf{x}i+b)\Bigr)\Bigr] = 0 \ \ \implies& 0+\frac{\partial}{\partial b}\sum{i=1}^{N}\alpha_i-\alpha_i y_i\mathbf{w}^\top\mathbf{x}i-\alpha_i y_ib = 0 \ \ \implies& \sum{i=1}^{N}0-\alpha_i y_i=0 \ \ \implies& \sum_{i=1}^{N}\alpha_i\cdot y_i=0 \end{align} $$
- [] In summary, the first optimization yields the following: $$ \begin{align} & \mathbf{w}^{}=\sum_{i=1}^{N}(\alpha_{i}\cdot y_{i})\cdot\mathbf{x}{i} \ & \sum{i=1}^{N}\alpha_{i}\cdot y_{i} = 0 \end{align} $$
Substituting $\mathbf{w}^{}$ in $\mathcal{L}$ yields: $$ \begin{align} \mathcal{L}(\mathbf{w}^{},b,\alpha) &= \frac{1}{2}|\mathbf{w}^{}|^2+\sum_{i=1}^{N}\alpha_{i}\Bigl(1-y_{i}(\mathbf{w}^{\top}\mathbf{x}_{i}+b)\Bigr) \end{align} $$ Respectively:
-
First, we substitute the first term: $$ \begin{align} \frac{1}{2}|\mathbf{w}^{*}|^2 &= \frac{1}{2} \Bigl[\sum_{i=1}^{N}(\alpha_{i}\cdot y_{i})\cdot\mathbf{x}{i}\Bigr]^\top \Bigl[\sum{i=1}^{N}(\alpha_{i}\cdot y_{i})\cdot\mathbf{x}{i}\Bigr] \ &= \frac{1}{2}\sum{i=1}^{N}\sum_{j=1}^{N} (\alpha_{i}\cdot\alpha_{j}) \cdot (y_{i}\cdot y_{j}) \cdot (\mathbf{x}{i}^\top\mathbf{x}{j}) \end{align} $$
-
Then, we substitute the second term: $$ \begin{align} \sum_{i=1}^{N}\alpha_{i}\Bigl(1-y_{i}(\mathbf{w}^{*\top}\mathbf{x}{i}+b)\Bigr) &= \sum{i=1}^{N}\alpha_{i} \Bigl[1-y_{i}(\Bigl[\sum_{j=1}^{N}\alpha_{j}\cdot y_{j} \cdot \mathbf{x}{j}\Bigr]^\top\mathbf{x}{i}+b)\Bigr] \ &= \sum_{i=1}^{N}\alpha_{i}-\sum_{i=1}^{N}\alpha_{i}\cdot y_{i}(\Bigl[\sum_{j=1}^{N}\alpha_{j}\cdot y_{j} \cdot \mathbf{x}{j}\Bigr]^\top\mathbf{x}{i}+b) \ &= \sum_{i=1}^{N}\alpha_{i}-\sum_{i=1}^{N}\sum_{j=1}^{N}(\alpha_{i}\cdot \alpha_{j})\cdot(y_{i}\cdot y_{j})\cdot(\mathbf{x}{i}^\top\mathbf{x}{j}) \end{align} $$
Therefore, the optimal function with respect to the weights
- [*] The new function is given by: $$ \mathcal{G}(\alpha)=-\frac{1}{2}\sum_{i=1}^{N}\sum_{j=1}^{N}(\alpha_{i}\cdot \alpha_{j})\cdot(y_{i}\cdot y_{j})\cdot(\mathbf{x}{i}^\top\mathbf{x}{j})+\sum_{i=1}^{N}\alpha_{i} $$ with the constraints: $$ \sum_{i=1}^{N}\alpha_{i}y_{i}=0 $$
By the constraint, we know that:
$$
\begin{bmatrix}
\alpha_1 & \alpha_2 & \cdots & \alpha_N
\end{bmatrix}
\begin{bmatrix}
y_1 \ y_2 \ \vdots \ y_N
\end{bmatrix}=0
$$
The original
- $\alpha=\begin{bmatrix}\alpha_1 \ \alpha_2 \ \vdots \ \alpha_N\end{bmatrix}$
To optimize
Remark: The Primal Form:
$$
\mathcal{L}(\mathbf{w}, b, \alpha)=\frac{1}{2}|\mathbf{w}|^{2}+\sum_{i=1}^{N}\alpha_{i}\Bigl(1-y_{i}(\mathbf{w}^\top\mathbf{x}+b)\Bigr)
$$
We optimized
The optimized Lagrangian parameters will satisfy:
-
$\alpha_i\neq 0$ :- if
$\mathbf{x}_i$ is a support vector. - i.e.,
$1-y_i(\mathbf{w}^\top\mathbf{x}+b)=0$
- if
-
$\alpha_i=0$ :- if
$\mathbf{x}_i$ is not a support vector. - i.e.,
$1-y_i(\mathbf{w}^\top\mathbf{x}+b)\neq 0$ . Only support vectors influence the computation of$\mathbf{w}$ .
- if
Parametrically, it follows the KKT complimentary slackness condition of: $$ \alpha_i\Bigl(1-y_i(\mathbf{w}^\top\mathbf{x}+b)\Bigr)=0 $$
Remark: In the Primal Optimization Problem, we want to find a hyperplane such that: $$ \begin{align} \frac{1}{|\mathbf{w}|} &\text{ is maximized} \ \ \forall i,\ y_i(\mathbf{w}^\top\mathbf{x}_i+b)\geq 1 &\text{ is maintained} \end{align} $$
Problem: Such hyperplane may not always exist. ![[SVM_SoftMargin_Intuition.png|300]]
We assumed in 10.2 that the data is always separated enough.
- However, there may be cases where datas are slightly mixed together.
- i.e., not linearly separable.
The data samples are divided into 3 categories: $$ \begin{align} \text{Correctly Classified: }& \ y_i(\mathbf{w}^\top\mathbf{x}_i+b)\geq 1. \ \ \text{Correct but violated margin: }& \ y_i(\mathbf{w}^\top\mathbf{x}_i+b)\in [0,1]. \ \ \text{Incorrectly Classified: }& \ y_i(\mathbf{w}^\top\mathbf{x}_i+b)\leq 0. \end{align} $$
Since a unified limit can't cover all the points, we vary the limits for each point.
- [i] We introduce slack variables
$\xi_1,\xi_2,\cdots,\xi_N \geq 0$ to all data samples in$\mathcal{X}$ . - We allow the property to be violated by sample-wise manner.
- 允许这一性质被不同程度地违反。等于是给某些数据“开后门”。 $$ y_i(\mathbf{x}^\top\mathbf{x}+b)\geq1-\xi_i $$ Where we assign: $$ \begin{align} \text{Correctly Classified: }& \ \xi_i=0 \ \ \text{Correct but violated margin: }& \ \xi_i \in [0,1] \ \ \text{Incorrectly Classified: }& \ \xi_i \geq 1 \end{align} $$
Given such permissions, even if some datapoints may violate the hard margin:
Find a hyperplane where: $$ \begin{align} \text{Minimize: }& \ \frac{1}{2}|\mathbf{w}|^2 + C\sum_{i=1}^{N}\xi_i^2 \ \ \text{Maintain: }& \ \forall i,\ y_i(\mathbf{w}^\top\mathbf{x}_i+b)\geq 1 -\xi_i \end{align} $$
- [i] Here, parameter
$C$ is a user-selected regularization parameter. - It is a tradeoff between:
- A larger margin, and
- A smaller classification error
- Effects of parameter
$C$ :- Smaller
$C$ : Larger Margin, More Error - Larger
$C$ : Smaller Margin, Fewer Error
- Smaller
![[SVM_Parameter_C.png]]
- [b] Optimization Problem
- [b] Primal Lagrangian
\frac{1}{2}|\mathbf{w}|^2 + \frac{C}{2}\sum_{i=1}^{N}\xi_i^2 + \sum_{i=1}^{N}\alpha_i\Bigl[(1-\xi_i)-y_i(\mathbf{w}^\top\mathbf{x}_i+b)\Bigr] $$
Optimization of the primal Lagrangian:
- [b] Dual Problem
where:
- [b] Optimization Problem
- [b] Primal Lagrangian
$$ \begin{align} \mathcal{L}^{(L_1)} & = \frac{1}{2}|\mathbf{w}|^2 + \frac{C}{2}\sum_{i=1}^{N}\xi_i^2 + \sum_{i=1}^{N}\alpha_i\Bigl[(1-\xi_i)-y_i(\mathbf{w}^\top\mathbf{x}_i+b)\Bigr]
\sum_{i=1}^{N}\beta_i\xi_i \ \ &= \mathcal{L}^{(L_2)}-\sum_{i=1}^{N}\beta_i\xi_i \end{align} $$
Here, a second set of Lagrangian multipliers
- [b] Dual Problem