当前位置：首页 > news >正文

最优化：建模、算法与理论|02 Optimization Modeling and Typical Examples(1)

news 2025/8/18 10:34:54

最优化：建模、算法与理论|02

Optimization Modeling and Typical Examples

文章目录

最优化：建模、算法与理论|02
- The Technology Of Modeling
- - Design Of Objective Function
  - Design of Constraints
- Regression Analysis
- - General form
  - Linear Regression Model
  - Regularized linear regression model
  - Logistic regression
  - SVM

The Technology Of Modeling

Design Of Objective Function

Least Squares Method

The least squares method is commonly used in modeling linear (non - linear) system - of - equations problems. Let $ϕi(x):Rn→R,i=1,2,⋯,m\phi_i(x):\mathbb{R}^n\rightarrow\mathbb{R}, i = 1,2,\cdots,m$ be $n$ - variable functions, and there is the following system of equations:
$bi=ϕi(x),i=1,2,⋯,mb_i=\phi_i(x),\ i = 1,2,\cdots,m$
where $bi∈Rb_i\in\mathbb{R}$ are known real numbers.

The solution of the system of equations is widely used in practice, but this problem is not always solvable.First, the number of equations $m$ can exceed the number of independent variables $n$ , so the solution of the system of equations may not exist; second, due to factors such as measurement errors, the equality relationship of the system of equations may not hold precisely.

In order to solve for $x$ under such practical circumstances, the idea of the least squares method is to minimize the squared $ℓ2\ell_2$ - norm of the errors, that is:

$min⁡x∈Rn∑i=1m(bi−ϕi(x))2\min_{x\in\mathbb{R}^n}\sum_{i = 1}^m\left(b_i-\phi_i(x)\right)^2$

In particular, when all $ϕi\phi_i$ are linear functions, we call problem a linear least squares problem; otherwise, it is called a non - linear least squares problem.

However,the least squares method is not always the most reasonable. If we want to minimize the sum of the absolute values of the deviations, the corresponding model is:
$min⁡x∈Rn∑i=1m∣bi−ϕi(x)∣\min_{x\in\mathbb{R}^n}\sum_{i = 1}^m\left|b_i-\phi_i(x)\right|$

If we want to minimize the maximum deviation, the corresponding optimization model is:
$min⁡x∈Rnmax⁡i∣bi−ϕi(x)∣\min_{x\in\mathbb{R}^n}\max_i\left|b_i - \phi_i(x)\right|$

Regularization

To make the solution have certain smoothness and overcome the ill - posed nature of the problem the improved model is:

$min⁡x∈Rn∑i=1m(bi−ϕi(x))2+μ∥x∥22\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \big(b_i - \phi_i(x)\big)^2 + \mu \|x\|_2^2$

Here, $μ>0\mu > 0$ is a balancing parameter. To obtain a sparse solution, we use the $ℓ0\ell_0$ norm to construct:

$min⁡x∈Rn∑i=1m(bi−ϕi(x))2+μ∥x∥0\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \big(b_i - \phi_i(x)\big)^2 + \mu \|x\|_0$

Since the $ℓ0\ell_0$ norm is non - convex and hard to compute, we replace it with the $ℓ1\ell_1$ norm for sparsity:

$min⁡x∈Rn∑i=1m(bi−ϕi(x))2+μ∥x∥1\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \big(b_i - \phi_i(x)\big)^2 + \mu \|x\|_1$

For image processing, $x$ itself may not be sparse, but its transformed version often is. We use a transformation $\mathbb{R}^n \to \mathbb{R}^p$ (e.g., wavelets, total variation) and model sparsity in the transformed space:

$min⁡x∈Rn∑i=1m(bi−ϕi(x))2+μ∥W(x)∥0\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \big(b_i - \phi_i(x)\big)^2 + \mu \|W(x)\|_0$

and

$min⁡x∈Rn∑i=1m(bi−ϕi(x))2+μ∥W(x)∥1\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \big(b_i - \phi_i(x)\big)^2 + \mu \|W(x)\|_1$

Regularization can controls the sparsity of the solution and the magnitude of the parameters.

Maximum Likelihood Estimation

Consider a simple case: suppose we know data comes from a specific distribution but are unaware of its parameters. Let $p (a; x)$ denote the probability distribution or density function, where $x$ is the unknown parameter. To estimate $x$ , we draw i.i.d. samples ${ai,i=1,2,…,n}\{a_i, i = 1,2,\dots,n\}$ . The likelihood function for parameter $x$ is defined as the joint probability (or density) of observing ${a_i\}$ :

$\prod_{i=1}^n p(a_i; x)$

Here, $L (x)$ represents the joint probability of the sample under parameter $x$ , with $x$ now as the variable to optimize.

The maximum likelihood estimate $x^\hat{x}$ is the parameter that maximizes $L (x)$ :

$x^∈argmaxx∈Xln⁡L(x)\hat{x} \in \text{argmax}_{x \in \mathcal{X}} \ln L(x)$

For most models, $ln⁡L(x)\ln L(x)$ has a known analytical form, but its global optimum may still require numerical methods.

Cost, Loss, and Reward Functions

Many operations research problems reduce to minimizing cost (or loss) or maximizing reward. For example, in portfolio optimization, we minimize risk (loss) or maximize return (reward). These functions directly define the objective in optimization models.

Functionals and Variational Methods

Many problems in physics and chemistry can be formulated as minimizing an energy functional.

A functional is a mapping from a set of functions to the real numbers (or complex numbers).

Example 1: The integral of a function over an interval, e.g., $\int_a^b f(x) \, \mathrm{d}x$ , where $f (x)$ is a function in some function space (e.g., continuous functions on $[a, b]$ ), and $J [f]$ is a scalar.

Example 2: The energy of a physical system, often expressed as a functional of the system’s state function. For a string vibrating with displacement $u (x, t)$ , the total energy might be
$\int \left( \frac{1}{2} \rho \left( \frac{\partial u}{\partial t} \right)^2 + \frac{1}{2} T \left( \frac{\partial u}{\partial x} \right)^2 \right) \mathrm{d}x$

where $ρ\rho$ is mass density and $T$ is tension. Here, $E [u]$ depends on the entire function $u (x, t)$ , so it is a functional.

Variational methods are mathematical techniques used to find functions that minimize or maximize a given functional. They are rooted in the variational principle: many natural phenomena (e.g., physical systems) tend to evolve toward states that extremize (minimize or maximize) a specific functional (e.g., energy, action).

The “Variation” of a Functional

To find the extremum of a functional $J [f]$ , we consider small perturbations (called variations) of the function $f$ , denoted $δf\delta f$ . A variation $δf\delta f$ is a small, arbitrary function that vanishes at boundary points (to satisfy constraints).

For $J [f]$ to have an extremum at $f = f_0 $, the first variation of $J $ must vanish:

$δJ[f0]=0\delta J[f_0] = 0$

This condition is analogous to the derivative being zero for functions (where extrema occur at critical points).

Euler-Lagrange EquationEuler-Lagrange Equation

For a functional of the form:

$\int_a^b L(x, f(x), f'(x)) \, \mathrm{d}x$
where $L$ (called the Lagrangian) depends on the independent variable $x $, the function $f (x)$ , and its derivative $f^{'} (x)$ , the condition $δJ=0\delta J = 0$ leads to the Euler-Lagrange equation:
$∂L∂f−ddx(∂L∂f′)=0\frac{\partial L}{\partial f} - \frac{\mathrm{d}}{\mathrm{d}x} \left( \frac{\partial L}{\partial f'} \right) = 0$

This ordinary differential equation (ODE) must be satisfied by the function $f (x)$ that extremizes $J [f]$ .

Relaxation
When it is not easy to solve the original problem, a commonly used technique in optimization is relaxation. The basic idea of this technique is: under the condition of retaining some properties of the original problem, use simple terms to replace the difficult - to - handle terms in the objective function, thereby making the problem easier to solve. For example, the $ℓ0\ell_0$ norm is non - differentiable and non - convex. Since the $ℓ1\ell_1$ norm is an approximation of the $ℓ0\ell_0$ norm to a certain extent, in practice, the $ℓ1\ell_1$ norm is often used to replace the $ℓ0\ell_0$ norm in the objective function. Because the $ℓ1\ell_1$ norm is convex, the theoretical analysis and algorithm design of the corresponding model will be simpler. For the low - rank optimization problem, the rank corresponds to the number of non - zero singular values of a matrix, and it is also non - convex and non - differentiable. A common way is to replace it with the $ℓ1\ell_1$ norm of the nuclear norm of the matrix (the vector composed of the singular values of the matrix), thus obtaining a convex relaxation term that is easier to handle.

Another relaxation strategy is to use a lower bound $f_R(x)$ of the objective function in problem to replace $f (x)$ for solution, where $f_R(x)$ should satisfy:
(1) $fR(x)≤f(x),∀x∈Xf_R(x) \leq f(x), \forall x \in \mathcal{X}$ ;
(2) $f_R(x)$ has a simple structure.

Note here that the relaxed problem and the original problem are not necessarily equivalent. Previously, we repeatedly mentioned that the replacement of the $ℓ0\ell_0$ norm with the $ℓ1\ell_1$ norm is carried out under certain conditions, and the $ℓ2\ell_2$ norm generally cannot be used as a relaxation of the $ℓ0\ell_0$ norm.

Design of Constraints

Physical Nature of the Problem
According to the specific forms of actual problems, the decision variables of an optimization problem need to satisfy various constraints. For example, in electronic structure calculations, we assume that orbital functions are orthogonal to each other. In a chemical reaction process, the change in the concentration of each component over time corresponds to a system of differential equations. In shape optimization design, we sometimes need to optimize the shape of an object. For instance, in aircraft wing design, affected by the airflow around the wing, the change in the wing’s shape corresponds to a differential equation. In many cases, we also have non - negative constraints, such as the pixel values of an image, the concentration of an object, and the quantity of available resources. When the observation has noise or more robustness is required, we can also relax equality constraints to inequality constraints.
Equivalent Transformation

For an optimization problem, if its objective function is in the form of a composite function, such as
$min_{x} f(Ax + b),$
we often introduce a variable $y$ and an equality constraint $y = A x + b$ and consider its equivalent constrained problem:
$y=Ax+b.\min_{y} f(y), \text{ s.t. } y = Ax + b.$

For the optimization problem
$min_{x} f(x)=h(x)+r(x),$
we usually introduce $y$ and the constraint $x = y$ , and transform it into
$x=y.\min_{x} h(x)+r(y), \text{ s.t. } x = y.$

In this way, the objective function is decomposed. For inequality constraints, we can also introduce slack variables to transform them into equality constraints and simple non - negative or non - positive constraints. For example, for the constraint $c(x)≤0c(x)\leq0$ , by introducing $y≥0y\geq0$ , we can equivalently transform it into
$\ y\geq0.$

In addition, we can also use the definition of the epigraph to obtain the equivalent form of the problem. For the optimization problem
$min_{x} f(x),$
according to the definition of the epigraph, it is equivalent to the optimization problem
$f(x)≤t.\min_{x,t} t, \text{ s.t. } f(x)\leq t.$

Performing equivalent transformation on the constraints of an optimization problem helps us study the mathematical properties of the problem more conveniently. Problems that seem dissimilar on the surface may become optimization problems of the same type after equivalent transformation, and ultimately enable us to find a unified algorithm.

Relaxation

In the context of optimization (as mentioned in your material, especially related to constraint design and problem - solving), relaxation is a powerful technique.

The core idea of relaxation is: when the original optimization problem is difficult to solve (for example, due to non - convexity, non - differentiability, or complex constraints), we modify the problem in a controlled way. We replace the “hard - to - handle” parts (such as complex constraints or non - convex objective functions) with “softer”, more manageable constructs. But we do this while still retaining some key properties of the original problem, so that the relaxed problem is easier to analyze and solve, and can provide useful information (like bounds) for the original problem.

Regression Analysis

In real - life, we often need to make predictions about unknown things based on existing information or knowledge. In machine learning, the task of supervised learning is to learn a model according to a given dataset containing input information, so that the model can make good predictions for new input data. According to whether the type of the output variable is continuous or discrete, classical supervised learning includes two types of problems: regression and classification. Let’s introduces how to establish a model for regression problems.

General form

Generally, its general form can be written as follows:

$b=f(a)+ϵb=f(a)+\epsilon$

In it, $\in \mathbb{R}^d$ is the independent variable, $\in \mathbb{R}$ is the response variable, and $ϵ∈R\epsilon \in \mathbb{R}$ is the error (or noise) of the model.

To narrow the range of function f in the function space, parameterize f as
$x)+\varepsilon$
where $x∈Rnx\in\mathbb{R}^n$ is a parameter. This reduces the selection range of f to a finite - dimensional space $Rn\mathbb{R}^n$ , and solving f becomes solving parameter $x$ .

Linear Regression Model

In the parameterized regression model, the simplest one is the linear regression model. Let $w_i, b_i)$ be the observed independent variable and response variable, and different data points are independent of each other. Then for each data point,

$bi=wi1x1+wi2x2+⋯+wi,n−1xn−1+xn+εi,i=1,2,⋯,mb_i = w_{i1}x_1 + w_{i2}x_2 + \cdots + w_{i,n - 1}x_{n - 1} + x_n + \varepsilon_i, \quad i = 1,2,\cdots, m$
where $x_i$ is the parameter to be determined, $εi\varepsilon_i$ is a certain noise and different data points are independent of each other. Write the input features in the training set plus the constant term 1 as $a_i=(w_i^T, 1)^T$ , and let $x=(x1,x2,⋯,xn)T∈Rnx=(x_1, x_2, \cdots, x_n)^T \in \mathbb{R}^n$ . Then the linear regression model can be abbreviated as $bi=aiTx+εi(3.2.2)b_i = a_i^T x + \varepsilon_i \tag{3.2.2}$ In this subsection, the constant term 1 in the linear regression model corresponds to the element $x_n$ . Below, we will construct corresponding optimization models under different conditions.
Write the input features in the training set as an $\times n$ matrix $A$ , and write the labels $b_i$ and noises $εi\varepsilon_i$ in vector form, that is

$\begin{bmatrix}a_1^T\\a_2^T\\\vdots\\a_m^T\end{bmatrix}, \quad b = \begin{bmatrix}b_1\\b_2\\\vdots\\b_m\end{bmatrix}, \quad \varepsilon = \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_m\end{bmatrix}$

Then we get the matrix form of model
$\varepsilon \tag{3.2.3}$
Assume that $εi\varepsilon_i$ is white Gaussian noise, that is $εi∼N(0,σ2)\varepsilon_i \sim \mathcal{N}(0, \sigma^2)$ , then
$p(bi∣ai;x)=12πσ2exp⁡(−(bi−aiTx)22σ2)p(b_i \mid a_i; x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(b_i - a_i^T x)^2}{2\sigma^2}\right)$
Then the log - likelihood function is
$ℓ(x)=ln⁡∏i=1mp(bi∣ai;x)=−m2ln⁡(2π)−mln⁡σ−∑i=1m(bi−aiTx)22σ2\ell(x) = \ln \prod_{i = 1}^m p(b_i \mid a_i; x) = -\frac{m}{2} \ln(2\pi) - m \ln \sigma - \sum_{i = 1}^m \frac{(b_i - a_i^T x)^2}{2\sigma^2}$
The maximum likelihood estimation is to maximize the log - likelihood function. After removing the constant terms, we get the following least - squares problem: $min⁡x∈Rn12∥Ax−b∥22(3.2.4)\min_{x \in \mathbb{R}^n} \frac{1}{2}\|Ax - b\|_2^2 \tag{3.2.4}$ Note that when constructing the maximum likelihood estimation, there is no need to know the variance $σ2\sigma^2$ of $εi\varepsilon_i$ . The most important significance of the above transformation is that it establishes the connection between the least - squares method and regression analysis: when the error is assumed to be white Gaussian noise, the least - squares solution is the maximum likelihood solution of the linear regression model.

When $εi\varepsilon_i$ is not white Gaussian noise, solving the linear regression model and solving the least - squares model are not equivalent. Therefore, we need to use the likelihood function to construct objective functions corresponding to other noises. For example, the model constructed under certain noises is actually the least - absolute deviation problem:
$min⁡x∈Rn∥Ax−b∥1\min_{x \in \mathbb{R}^n} \|Ax - b\|_1$

Regularized linear regression model

Tikhonov Regularize

To balance the fitting property of the model and the smoothness of the solution, Tikhonov regularization, or ridge regression, adds an $ℓ2\ell_2$ -norm squared regularization term. Assuming $e_i$ is Gaussian white noise, the linear regression model with the $ℓ2\ell_2$ -norm squared regularization term is actually solving the following problem:

$min⁡x∈Rn12∥Ax−b∥22+μ∥x∥22\min_{x \in \mathbb{R}^n} \frac{1}{2} \| Ax - b \|_2^2 + \mu \| x \|_2^2$

Due to the existence of the regularization term, the objective function of this problem is a strongly convex function, and the properties of the solution are improved. In the ridge regression problem , the regularization term $x \|_2^2$ actually penalizes $x $ with a large norm. Therefore, another common variant is, given a parameter $σ>0\sigma > 0$ , to solve:

$min⁡x∈Rn12∥Ax−b∥22s.t.∥x∥2≤σ\min_{x \in \mathbb{R}^n} \frac{1}{2} \| Ax - b \|_2^2 \quad \text{s.t.} \quad \| x \|_2 \leq \sigma$

LASSO Problem and Its Variants

If the solution $x$ we expect to obtain is sparse, then we can consider adding an $ℓ1\ell_1$ -norm as the regularization term.LASSO problem is :

$min⁡x∈Rn12∥Ax−b∥22+μ∥x∥1\min_{x \in \mathbb{R}^n} \frac{1}{2} \| Ax - b \|_2^2 + \mu \| x \|_1$

where $ \mu > 0 $ is a known real number, and $ x $ is the parameter to be estimated. The LASSO problem penalizes the sparsity of the solution through the $ \ell_1 $-norm. If $ x $ is sparse, then the predicted value $ b_i $ is only related to some elements of $ a_i $ that are non - zero. Thus, among the original $ n $ features of the data points, the features that play a role in prediction correspond to the non - zero components of $ x $. Therefore, **the LASSO model serves the function of feature extraction. **

Similarly , we can also consider the problem:

$min⁡x∈Rn12∥Ax−b∥22s.t.∥x∥1≤σ\min_{x \in \mathbb{R}^n} \frac{1}{2} \| Ax - b \|_2^2 \quad \text{s.t.} \quad \| x \|_1 \leq \sigma$

Considering the existence of noise $ε\varepsilon$ , we can also, given $ν>0\nu > 0$ , consider the model:

$min⁡x∈Rn∥x∥1s.t.∥Ax−b∥2≤ν\min_{x \in \mathbb{R}^n} \| x \|_1 \quad \text{s.t.} \quad \| Ax - b \|_2 \leq \nu$

The traditional LASSO problem requires $x$ to be a sparse solution. However, the sparsity in practical problems can have many expressions. If the parameter $x$ has group - sparsity, that is, the components of $x$ can be divided into $G$ groups, and the parameters within each group must be non - zero or zero simultaneously, then the solution of the traditional LASSO problem cannot meet such a requirement. For this reason, people have proposed the group LASSO model:

$min⁡x∈Rn12∥Ax−b∥22+μ∑ℓ=1Gnℓ∥xIℓ∥2\min_{x \in \mathbb{R}^n} \frac{1}{2}\|Ax - b\|_2^2 + \mu \sum_{\ell = 1}^G \sqrt{n_\ell}\|x_{\mathcal{I}_\ell}\|_2$

where $Iℓ\mathcal{I}_\ell$ is the index set belonging to the $ℓ\ell$ - th group of variables, $∣Iℓ∣=nℓ|\mathcal{I}_\ell| = n_\ell$ , and $∑ℓ=1Gnℓ=n\sum_{\ell = 1}^G n_\ell = n$ . where $Iℓ\mathcal{I}_\ell$ is the index set belonging to the $ℓ\ell$ - th group of variables, $∣Iℓ∣=nℓ|\mathcal{I}_\ell| = n_\ell$ , and $∑ℓ=1Gnℓ=n\sum_{\ell = 1}^G n_\ell = n$ .

If we need to ensure both group - level and single - feature - level sparsity, we can consider combining two regularization terms, that is, the so - called sparse group LASSO model:

$min⁡x∈Rn12∥Ax−b∥22+μ1∑ℓ=1Gnℓ∥xIℓ∥2+μ2∥x∥1\min_{x \in \mathbb{R}^n} \frac{1}{2}\|Ax - b\|_2^2 + \mu_1 \sum_{\ell = 1}^G \sqrt{n_\ell}\|x_{\mathcal{I}_\ell}\|_2 + \mu_2\|x\|_1$

where $μ1,μ2>0\mu_1, \mu_2> 0$ are given constants. The regularization term of the sparse group LASSO model consists of two parts: $x\|_1$ is to ensure the sparsity of $x$ itself, and $∑ℓ=1Gnℓ∥xIℓ∥2\sum_{\ell = 1}^G \sqrt{n_\ell}\|x_{\mathcal{I}_\ell}\|_2$ is to ensure the group - sparsity of $x$ .

For some practical problems, the feature $x$ itself is not sparse, but it is sparse under some transformation. Therefore, we also need to adjust the regularization term accordingly. The general problem form can be:

$min⁡x∈Rn12∥Ax−b∥22+μ∥Fx∥1\min_{x \in \mathbb{R}^n} \frac{1}{2}\|Ax - b\|_2^2 + \mu\|Fx\|_1$

Logistic regression

In classification problems, the output variable takes values in a discrete space. For binary classification problems, the predicted variable has only two values, namely $- 1$ and $1$ . We briefly introduce a classic classification model in machine learning: the logistic regression model.Given a feature $a$ , logistic regression assumes that the probability that this sample belongs to class $1$ is:
$\mid a; x) = P(t = 1 \mid a) = \theta(a^T x),$
where the Sigmoid function is:
$θ(z)=11+exp⁡(−z).\theta(z) = \frac{1}{1 + \exp(-z)}.$
Then, the probability of belonging to class $- 1$ is:
$\mid a; x) = 1 - p(1 \mid a; x) = \theta(-a^T x).$
Thus, for $\in \{-1, 1\}$ , the above probability can be concisely written as:
$\mid a; x) = \theta(b \cdot a^T x).$
Assume that the data pairs ${ai,bi},i=1,2,…,m\{a_i, b_i\}, i = 1, 2, \ldots, m$ are independently and identically distributed. Then, given $a1,a2,…,ama_1, a_2, \ldots, a_m$ , the joint probability density of $b1,b2,…,bmb_1, b_2, \ldots, b_m$ is: $p(b1,b2,…,bm∣a1,a2,…,am)=∏i=1mp(bi∣ai).p(b_1, b_2, \ldots, b_m \mid a_1, a_2, \ldots, a_m) = \prod_{i=1}^m p(b_i \mid a_i).$ Substituting the form of $p(bi∣ai)p(b_i \mid a_i)$ , we get:
$p(b1,b2,…,bm∣a1,a2,…,am)=1∏i=1m(1+exp⁡(−bi⋅aiTx)).p(b_1, b_2, \ldots, b_m \mid a_1, a_2, \ldots, a_m) = \frac{1}{\prod_{i=1}^m (1 + \exp(-b_i \cdot a_i^T x))}.$
Then, the maximum likelihood estimation is to solve the following optimization problem:
$min⁡x∈Rn∑i=1mln⁡(1+exp⁡(−bi⋅aiTx)).\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \ln\left(1 + \exp(-b_i \cdot a_i^T x)\right).$

Similar to sparse optimization, common models also need to consider the properties of the parameter $x$ and add a regularization term. For example:

Tikhonov Regularization Model
$min⁡x∈Rn∑i=1mln⁡(1+exp⁡(−bi⋅aiTx))+λ∥x∥22\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \ln\left(1 + \exp(-b_i \cdot a_i^T x)\right) + \lambda \|x\|_2^2$
$ℓ1\ell_1$ Norm Regularization Model
$min⁡x∈Rn∑i=1mln⁡(1+exp⁡(−bi⋅aiTx))+λ∥x∥1\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \ln\left(1 + \exp(-b_i \cdot a_i^T x)\right) + \lambda \|x\|_1$

Assume that the data pairs ${a_i, b_i\}$ are generated by the random variable pair ${α,β}\{\alpha, \beta\}$ . Then, the loss function can be written in the mean form:
$E[ln⁡(1+exp⁡(−β⋅αTx))].\mathbb{E}\left[ \ln\left(1 + \exp(-\beta \cdot \alpha^T x)\right) \right].$

Problems Tikhonnov and $ℓ1\ell_1$ can be written in the sampling form of the following stochastic optimization problem:
$min⁡x∈RnE[ln⁡(1+exp⁡(−β⋅αTx))]+λr(x),\min_{x \in \mathbb{R}^n} \mathbb{E}\left[ \ln\left(1 + \exp(-\beta \cdot \alpha^T x)\right) \right] + \lambda r(x),$
where $r (x)$ is the regularization term, and $λ\lambda$ is the regularization parameter. Many machine learning problems can be written as a more general stochastic optimization problem and its discrete version:
$min⁡x∈Rnf(x)+λr(x),\min_{x \in \mathbb{R}^n} f(x) + \lambda r(x),$
where:
$\stackrel{\text{def}}{=} \mathbb{E}[F(x, \xi)] = \int_\Omega F(x, \xi(\omega)) \mathrm{d}P(\omega), \quad \text{or} \quad f(x) \stackrel{\text{def}}{=} \frac{1}{m} \sum_{i=1}^m f_i(x),$
$ξ:Ω→W\xi: \Omega \to W$ is a random variable defined on a given probability space $(Ω,F,P)(\Omega, \mathcal{F}, P)$ , $W$ is a measurable space, $\mathbb{R}^n \times W \to \mathbb{R}$ , and $fi:Rn→R,i=1,2,…,mf_i: \mathbb{R}^n \to \mathbb{R}, i = 1, 2, \ldots, m$ correspond to a certain loss function.

SVM

Support Vector Machine (SVM) is another widely applied binary classification model. Given sample points $a_i, b_i)$ in the training dataset $D$ , where $ai∈Rna_i \in \mathbb{R}^n$ and $bi∈{−1,1}b_i \in \{-1, 1\}$ , the basic idea of SVM is to find a hyperplane that divides the sample points in $Rn\mathbb{R}^n$ into two classes. We first consider a simple case: assume the training dataset is linearly separable, that is, the positive and negative classes are just divided to the two sides of the hyperplane $x^Tw + y = 0$ .
It can be seen that after a linearly separable dataset is given, the hyperplanes that meet the “division requirement” are generally not unique. An ideal hyperplane should have the following characteristic: the distances from data points to this plane are relatively large. Knowledge from geometry tells us that the distance from a point $w$ in the sample space to the hyperplane $x^Tw + y = 0$ is
$\frac{|x^Tw + y|}{\|x\|_2}$
If this hyperplane classifies the sample point $a_i, b_i)$ correctly, then
$bi(aiTx+y)>0,i=1,2,…,mb_i(a_i^Tx + y) > 0, \quad i = 1, 2, \ldots, m$
To find the ideal hyperplane $x^Tw + y = 0$ to separate the data points, we require that the minimum distance from the points of the two classes of data to this hyperplane be as large as possible. Then the following original model can be established:
$max⁡x,y,γγ,s.t.bi(aiTx+y)∥x∥2≥γ,i=1,2,…,m\max_{x, y, \gamma} \gamma, \\ \text{s.t.} \quad \frac{b_i(a_i^Tx + y)}{\|x\|_2} \geq \gamma, \quad i = 1, 2, \ldots, m$
The meaning of $γ\gamma$ in problem is obvious: it represents the minimum value of the distances from all sample points to the hyperplane $x^Tw + y = 0$ , and the goal is to maximize it. Next, we perform an equivalent transformation on problem to make its form more natural. Notice that the constraints in problem are equivalent to
$bi(aiTx+y)≥γ∥x∥2,i=1,2,…,mb_i(a_i^Tx + y) \geq \gamma\|x\|_2, \quad i = 1, 2, \ldots, m$
And scaling $x$ and $y$ by the same positive multiple will not affect the constraints and the objective function of this problem. According to this characteristic, we reduce the feasible region of problem and force $∥x∥2=1γ\|x\|_2 = \frac{1}{\gamma}$ . Then this problem is equivalent to
$max⁡x,y,γγ,s.t.bi(aiTx+y)≥γ∥x∥2,i=1,2,…,m,∥x∥2=1γ\begin{align*} &\max_{x, y, \gamma} \gamma, \\ \text{s.t.} &\quad b_i(a_i^Tx + y) \geq \gamma\|x\|_2, \quad i = 1, 2, \ldots, m, \\ \quad\quad&\quad\quad \|x\|_2 = \frac{1}{\gamma} \end{align*}$
Finally, we eliminate the variable $γ\gamma$ and the constraint $∥x∥2=1γ\|x\|_2 = \frac{1}{\gamma}$ to obtain the optimization problem
$min⁡x,y12∥x∥22,s.t.bi(aiTx+y)≥1,i=1,2,…,m\begin{align*} \min_{x, y} &\frac{1}{2}\|x\|_2^2, \\ \text{s.t.} \quad b_i&(a_i^Tx + y) \geq 1, \quad i = 1, 2, \ldots, m \end{align*}$
By solving problem to get $x, y$ , we call the $a_i$ that make $b_i(a_i^Tx + y) = 1$ hold support vectors. It is not difficult to find that the parameters $x, y$ of the hyperplane are completely determined by the support vectors. In other words, removing the data points that are not support vectors will not affect the solution of problem .

When the assumption of linear separability does not hold, we introduce non - negative slack variables $ξi\xi_i$ for each data point, allowing for misclassified points. Then the constraints in problem are changed to
$bi(aiTx+y)∥x∥2≥γ(1−ξi),ξi≥0,i=1,2,…,m\frac{b_i(a_i^Tx + y)}{\|x\|_2} \geq \gamma(1 - \xi_i), \xi_i \geq 0, \quad i = 1, 2, \ldots, m$
Here, the relative form $γ(1−ξi)\gamma(1 - \xi_i)$ is used to represent the “distance” of misclassified points. Obviously, there should not be too many misclassified points. We control the degree of error through the function $∑i=1mξi\sum_{i = 1}^m \xi_i$ of slack variables. After equivalent transformation similar to the previous one, we finally obtain the problem
$min⁡x,y,ξ12∥x∥22+μ∑i=1mξi,s.t.bi(aiTx+y)≥1−ξi,ξi≥0,i=1,2,…,m\min_{x, y, \xi} \frac{1}{2}\|x\|_2^2 + \mu\sum_{i = 1}^m \xi_i, \\ \text{s.t.} \quad b_i(a_i^Tx + y) \geq 1 - \xi_i, \\ \quad\quad\quad\quad \xi_i \geq 0, \quad i = 1, 2, \ldots, m$
where $μ>0\mu > 0$ is a penalty coefficient. Increasing the value of $μ\mu$ will increase the penalty for misclassification, and decreasing the value of $μ\mu$ will decrease the penalty for misclassification. The above two optimization problems are special forms of quadratic programming.

Above problem is also equivalent to the unconstrained optimization problem:
$min⁡x,y12∥x∥22+μ∑i=1mmax⁡{1−bi(aiTx+y),0}\min_{x, y} \frac{1}{2}\|x\|_2^2 + \mu\sum_{i = 1}^m \max\{1 - b_i(a_i^Tx + y), 0\}$ Notice that $max\{1 - b_i(a_i^Tx + y), 0\}$ is the penalty for the quantity that does not satisfy the inequality $bi(aiTx+y)≥1b_i(a_i^Tx + y) \geq 1$ . Therefore, problem can also be regarded as solving the penalty function method of problem . Although $max\{z, 0\}$ is not differentiable, the simple form of problem (3.4.4) also provides many possibilities for algorithm design. It can be seen that introducing different penalty functions can construct different SVM models. In addition, when there are redundant features in the training data, people also consider replacing the $∥x∥\|x\|$ term with the $ℓ1\ell_1$ norm for solution: $min⁡x∈Rn∥x∥1+μ∑i=1mmax⁡{1−bi(aiTx+y),0}\min_{x \in \mathbb{R}^n} \|x\|_1 + \mu\sum_{i = 1}^m \max\{1 - b_i(a_i^Tx + y), 0\}$