当前位置: 首页 > news >正文

最优化:建模、算法与理论|02 Optimization Modeling and Typical Examples(1)

最优化:建模、算法与理论|02

Optimization Modeling and Typical Examples

文章目录

  • 最优化:建模、算法与理论|02
    • The Technology Of Modeling
      • Design Of Objective Function
      • Design of Constraints
    • Regression Analysis
      • General form
      • Linear Regression Model
      • Regularized linear regression model
      • Logistic regression
      • SVM

The Technology Of Modeling

Design Of Objective Function

  1. Least Squares Method

The least squares method is commonly used in modeling linear (non - linear) system - of - equations problems. Let ϕi(x):Rn→R,i=1,2,⋯,m\phi_i(x):\mathbb{R}^n\rightarrow\mathbb{R}, i = 1,2,\cdots,mϕi(x):RnR,i=1,2,,m be nnn - variable functions, and there is the following system of equations:
bi=ϕi(x),i=1,2,⋯,mb_i=\phi_i(x),\ i = 1,2,\cdots,m bi=ϕi(x), i=1,2,,m
where bi∈Rb_i\in\mathbb{R}biR are known real numbers.

The solution of the system of equations is widely used in practice, but this problem is not always solvable.First, the number of equations mmm can exceed the number of independent variables nnn, so the solution of the system of equations may not exist; second, due to factors such as measurement errors, the equality relationship of the system of equations may not hold precisely.

In order to solve for xxx under such practical circumstances, the idea of the least squares method is to minimize the squared ℓ2\ell_22 - norm of the errors, that is:

min⁡x∈Rn∑i=1m(bi−ϕi(x))2\min_{x\in\mathbb{R}^n}\sum_{i = 1}^m\left(b_i-\phi_i(x)\right)^2xRnmini=1m(biϕi(x))2

In particular, when all ϕi\phi_iϕi are linear functions, we call problem a linear least squares problem; otherwise, it is called a non - linear least squares problem.

However,the least squares method is not always the most reasonable. If we want to minimize the sum of the absolute values of the deviations, the corresponding model is:
min⁡x∈Rn∑i=1m∣bi−ϕi(x)∣\min_{x\in\mathbb{R}^n}\sum_{i = 1}^m\left|b_i-\phi_i(x)\right|xRnmini=1mbiϕi(x)

If we want to minimize the maximum deviation, the corresponding optimization model is:
min⁡x∈Rnmax⁡i∣bi−ϕi(x)∣\min_{x\in\mathbb{R}^n}\max_i\left|b_i - \phi_i(x)\right|xRnminimaxbiϕi(x)

  1. Regularization

To make the solution have certain smoothness and overcome the ill - posed nature of the problem the improved model is:

min⁡x∈Rn∑i=1m(bi−ϕi(x))2+μ∥x∥22\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \big(b_i - \phi_i(x)\big)^2 + \mu \|x\|_2^2 xRnmini=1m(biϕi(x))2+μx22

Here, μ>0\mu > 0μ>0 is a balancing parameter. To obtain a sparse solution, we use the ℓ0\ell_00 norm to construct:

min⁡x∈Rn∑i=1m(bi−ϕi(x))2+μ∥x∥0\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \big(b_i - \phi_i(x)\big)^2 + \mu \|x\|_0 xRnmini=1m(biϕi(x))2+μx0

Since the ℓ0\ell_00 norm is non - convex and hard to compute, we replace it with the ℓ1\ell_11norm for sparsity:

min⁡x∈Rn∑i=1m(bi−ϕi(x))2+μ∥x∥1\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \big(b_i - \phi_i(x)\big)^2 + \mu \|x\|_1 xRnmini=1m(biϕi(x))2+μx1

For image processing, xxx itself may not be sparse, but its transformed version often is. We use a transformation W:Rn→RpW: \mathbb{R}^n \to \mathbb{R}^pW:RnRp (e.g., wavelets, total variation) and model sparsity in the transformed space:

min⁡x∈Rn∑i=1m(bi−ϕi(x))2+μ∥W(x)∥0\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \big(b_i - \phi_i(x)\big)^2 + \mu \|W(x)\|_0 xRnmini=1m(biϕi(x))2+μW(x)0

and

min⁡x∈Rn∑i=1m(bi−ϕi(x))2+μ∥W(x)∥1\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \big(b_i - \phi_i(x)\big)^2 + \mu \|W(x)\|_1 xRnmini=1m(biϕi(x))2+μW(x)1

Regularization can controls the sparsity of the solution and the magnitude of the parameters.

  1. Maximum Likelihood Estimation

Consider a simple case: suppose we know data comes from a specific distribution but are unaware of its parameters. Let p(a;x)p(a; x)p(a;x) denote the probability distribution or density function, where xxx is the unknown parameter. To estimate xxx, we draw i.i.d. samples {ai,i=1,2,…,n}\{a_i, i = 1,2,\dots,n\}{ai,i=1,2,,n}. The likelihood function for parameter xxx is defined as the joint probability (or density) of observing {ai}\{a_i\}{ai}:

L(x)=∏i=1np(ai;x)L(x) = \prod_{i=1}^n p(a_i; x) L(x)=i=1np(ai;x)

Here, L(x)L(x)L(x) represents the joint probability of the sample under parameter xxx, with xxx now as the variable to optimize.

The maximum likelihood estimate x^\hat{x}x^ is the parameter that maximizes L(x)L(x)L(x):

x^∈argmaxx∈Xln⁡L(x)\hat{x} \in \text{argmax}_{x \in \mathcal{X}} \ln L(x) x^argmaxxXlnL(x)

For most models, ln⁡L(x)\ln L(x)lnL(x) has a known analytical form, but its global optimum may still require numerical methods.

  1. Cost, Loss, and Reward Functions

Many operations research problems reduce to minimizing cost (or loss) or maximizing reward. For example, in portfolio optimization, we minimize risk (loss) or maximize return (reward). These functions directly define the objective in optimization models.

  1. Functionals and Variational Methods

Many problems in physics and chemistry can be formulated as minimizing an energy functional.

A functional is a mapping from a set of functions to the real numbers (or complex numbers).

Example 1: The integral of a function over an interval, e.g., J[f]=∫abf(x)dxJ[f] = \int_a^b f(x) \, \mathrm{d}xJ[f]=abf(x)dx, where f(x)f(x)f(x)is a function in some function space (e.g., continuous functions on [a,b][a,b][a,b]), and J[f]J[f]J[f]is a scalar.

Example 2: The energy of a physical system, often expressed as a functional of the system’s state function. For a string vibrating with displacement u(x,t)u(x,t)u(x,t), the total energy might be
E[u]=∫(12ρ(∂u∂t)2+12T(∂u∂x)2)dxE[u] = \int \left( \frac{1}{2} \rho \left( \frac{\partial u}{\partial t} \right)^2 + \frac{1}{2} T \left( \frac{\partial u}{\partial x} \right)^2 \right) \mathrm{d}x E[u]=(21ρ(tu)2+21T(xu)2)dx

where ρ\rhoρ is mass density and TTT is tension. Here, E[u]E[u]E[u]depends on the entire function u(x,t)u(x,t)u(x,t), so it is a functional.

Variational methods are mathematical techniques used to find functions that minimize or maximize a given functional. They are rooted in the variational principle: many natural phenomena (e.g., physical systems) tend to evolve toward states that extremize (minimize or maximize) a specific functional (e.g., energy, action).

The “Variation” of a Functional

To find the extremum of a functional J[f]J[f]J[f], we consider small perturbations (called variations) of the function fff, denotedδf\delta fδf. A variation δf\delta fδf is a small, arbitrary function that vanishes at boundary points (to satisfy constraints).

For J[f]J[f]J[f] to have an extremum at $f = f_0 $, the first variation of $J $ must vanish:

δJ[f0]=0\delta J[f_0] = 0 δJ[f0]=0

This condition is analogous to the derivative being zero for functions (where extrema occur at critical points).

Euler-Lagrange EquationEuler-Lagrange Equation

For a functional of the form:

J[f]=∫abL(x,f(x),f′(x))dxJ[f] = \int_a^b L(x, f(x), f'(x)) \, \mathrm{d}x J[f]=abL(x,f(x),f(x))dx
where LLL (called the Lagrangian) depends on the independent variable $x $, the function f(x)f(x)f(x), and its derivative f′(x)f'(x)f(x), the condition δJ=0\delta J = 0δJ=0 leads to the Euler-Lagrange equation:
∂L∂f−ddx(∂L∂f′)=0\frac{\partial L}{\partial f} - \frac{\mathrm{d}}{\mathrm{d}x} \left( \frac{\partial L}{\partial f'} \right) = 0 fLdxd(fL)=0

This ordinary differential equation (ODE) must be satisfied by the function f(x)f(x)f(x) that extremizes J[f]J[f]J[f].

  1. Relaxation
    When it is not easy to solve the original problem, a commonly used technique in optimization is relaxation. The basic idea of this technique is: under the condition of retaining some properties of the original problem, use simple terms to replace the difficult - to - handle terms in the objective function, thereby making the problem easier to solve. For example, the ℓ0\ell_00 norm is non - differentiable and non - convex. Since the ℓ1\ell_11 norm is an approximation of the ℓ0\ell_00 norm to a certain extent, in practice, the ℓ1\ell_11 norm is often used to replace the ℓ0\ell_00 norm in the objective function. Because the ℓ1\ell_11 norm is convex, the theoretical analysis and algorithm design of the corresponding model will be simpler. For the low - rank optimization problem, the rank corresponds to the number of non - zero singular values of a matrix, and it is also non - convex and non - differentiable. A common way is to replace it with the ℓ1\ell_11 norm of the nuclear norm of the matrix (the vector composed of the singular values of the matrix), thus obtaining a convex relaxation term that is easier to handle.

Another relaxation strategy is to use a lower bound fR(x)f_R(x)fR(x) of the objective function in problem to replace f(x)f(x)f(x) for solution, where fR(x)f_R(x)fR(x) should satisfy:
(1) fR(x)≤f(x),∀x∈Xf_R(x) \leq f(x), \forall x \in \mathcal{X}fR(x)f(x),xX;
(2) fR(x)f_R(x)fR(x) has a simple structure.

Note here that the relaxed problem and the original problem are not necessarily equivalent. Previously, we repeatedly mentioned that the replacement of the ℓ0\ell_00 norm with the ℓ1\ell_11 norm is carried out under certain conditions, and the ℓ2\ell_22 norm generally cannot be used as a relaxation of the ℓ0\ell_00 norm.

Design of Constraints

  1. Physical Nature of the Problem
    According to the specific forms of actual problems, the decision variables of an optimization problem need to satisfy various constraints. For example, in electronic structure calculations, we assume that orbital functions are orthogonal to each other. In a chemical reaction process, the change in the concentration of each component over time corresponds to a system of differential equations. In shape optimization design, we sometimes need to optimize the shape of an object. For instance, in aircraft wing design, affected by the airflow around the wing, the change in the wing’s shape corresponds to a differential equation. In many cases, we also have non - negative constraints, such as the pixel values of an image, the concentration of an object, and the quantity of available resources. When the observation has noise or more robustness is required, we can also relax equality constraints to inequality constraints.

  2. Equivalent Transformation

For an optimization problem, if its objective function is in the form of a composite function, such as
min⁡xf(Ax+b),\min_{x} f(Ax + b),xminf(Ax+b),
we often introduce a variable yyy and an equality constraint y=Ax+by = Ax + by=Ax+b and consider its equivalent constrained problem:
min⁡yf(y),s.t. y=Ax+b.\min_{y} f(y), \text{ s.t. } y = Ax + b.yminf(y), s.t. y=Ax+b.

For the optimization problem
min⁡xf(x)=h(x)+r(x),\min_{x} f(x)=h(x)+r(x),xminf(x)=h(x)+r(x),
we usually introduce yyy and the constraint x=yx = yx=y, and transform it into
min⁡xh(x)+r(y),s.t. x=y.\min_{x} h(x)+r(y), \text{ s.t. } x = y.xminh(x)+r(y), s.t. x=y.

In this way, the objective function is decomposed. For inequality constraints, we can also introduce slack variables to transform them into equality constraints and simple non - negative or non - positive constraints. For example, for the constraint c(x)≤0c(x)\leq0c(x)0, by introducing y≥0y\geq0y0, we can equivalently transform it into
c(x)+y=0,y≥0.c(x)+y = 0, \ y\geq0.c(x)+y=0, y0.

In addition, we can also use the definition of the epigraph to obtain the equivalent form of the problem. For the optimization problem
min⁡xf(x),\min_{x} f(x),xminf(x),
according to the definition of the epigraph, it is equivalent to the optimization problem
min⁡x,tt,s.t. f(x)≤t.\min_{x,t} t, \text{ s.t. } f(x)\leq t.x,tmint, s.t. f(x)t.

Performing equivalent transformation on the constraints of an optimization problem helps us study the mathematical properties of the problem more conveniently. Problems that seem dissimilar on the surface may become optimization problems of the same type after equivalent transformation, and ultimately enable us to find a unified algorithm.

  1. Relaxation

In the context of optimization (as mentioned in your material, especially related to constraint design and problem - solving), relaxation is a powerful technique.

The core idea of relaxation is: when the original optimization problem is difficult to solve (for example, due to non - convexity, non - differentiability, or complex constraints), we modify the problem in a controlled way. We replace the “hard - to - handle” parts (such as complex constraints or non - convex objective functions) with “softer”, more manageable constructs. But we do this while still retaining some key properties of the original problem, so that the relaxed problem is easier to analyze and solve, and can provide useful information (like bounds) for the original problem.

Regression Analysis

In real - life, we often need to make predictions about unknown things based on existing information or knowledge. In machine learning, the task of supervised learning is to learn a model according to a given dataset containing input information, so that the model can make good predictions for new input data. According to whether the type of the output variable is continuous or discrete, classical supervised learning includes two types of problems: regression and classification. Let’s introduces how to establish a model for regression problems.

General form

Generally, its general form can be written as follows:

b=f(a)+ϵb=f(a)+\epsilonb=f(a)+ϵ

In it, a∈Rda \in \mathbb{R}^daRd is the independent variable, b∈Rb \in \mathbb{R}bR is the response variable, and ϵ∈R\epsilon \in \mathbb{R}ϵR is the error (or noise) of the model.

To narrow the range of function f in the function space, parameterize f as 
b=f(a;x)+εb = f(a; x)+\varepsilonb=f(a;x)+ε
where x∈Rnx\in\mathbb{R}^nxRn is a parameter. This reduces the selection range of f to a finite - dimensional space Rn\mathbb{R}^nRn, and solving f becomes solving parameter xxx.

Linear Regression Model

In the parameterized regression model, the simplest one is the linear regression model. Let (wi,bi)(w_i, b_i)(wi,bi) be the observed independent variable and response variable, and different data points are independent of each other. Then for each data point,

bi=wi1x1+wi2x2+⋯+wi,n−1xn−1+xn+εi,i=1,2,⋯,mb_i = w_{i1}x_1 + w_{i2}x_2 + \cdots + w_{i,n - 1}x_{n - 1} + x_n + \varepsilon_i, \quad i = 1,2,\cdots, mbi=wi1x1+wi2x2++wi,n1xn1+xn+εi,i=1,2,,m
where xix_ixi is the parameter to be determined, εi\varepsilon_iεi is a certain noise and different data points are independent of each other. Write the input features in the training set plus the constant term 1 as ai=(wiT,1)Ta_i=(w_i^T, 1)^Tai=(wiT,1)T, and let x=(x1,x2,⋯,xn)T∈Rnx=(x_1, x_2, \cdots, x_n)^T \in \mathbb{R}^nx=(x1,x2,,xn)TRn. Then the linear regression model can be abbreviated as bi=aiTx+εi(3.2.2)b_i = a_i^T x + \varepsilon_i \tag{3.2.2}bi=aiTx+εi(3.2.2) In this subsection, the constant term 1 in the linear regression model corresponds to the element xnx_nxn. Below, we will construct corresponding optimization models under different conditions.
Write the input features in the training set as an m×nm \times nm×n matrix AAA, and write the labels bib_ibi and noises εi\varepsilon_iεi in vector form, that is

A=[a1Ta2T⋮amT],b=[b1b2⋮bm],ε=[ε1ε2⋮εm]A = \begin{bmatrix}a_1^T\\a_2^T\\\vdots\\a_m^T\end{bmatrix}, \quad b = \begin{bmatrix}b_1\\b_2\\\vdots\\b_m\end{bmatrix}, \quad \varepsilon = \begin{bmatrix}\varepsilon_1\\\varepsilon_2\\\vdots\\\varepsilon_m\end{bmatrix}A=a1Ta2TamT,b=b1b2bm,ε=ε1ε2εm

Then we get the matrix form of model
b=Ax+ε(3.2.3)b = Ax + \varepsilon \tag{3.2.3}b=Ax+ε(3.2.3)
Assume that εi\varepsilon_iεi is white Gaussian noise, that is εi∼N(0,σ2)\varepsilon_i \sim \mathcal{N}(0, \sigma^2)εiN(0,σ2), then
p(bi∣ai;x)=12πσ2exp⁡(−(bi−aiTx)22σ2)p(b_i \mid a_i; x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp\left(-\frac{(b_i - a_i^T x)^2}{2\sigma^2}\right)p(biai;x)=2πσ21exp(2σ2(biaiTx)2)
Then the log - likelihood function is
ℓ(x)=ln⁡∏i=1mp(bi∣ai;x)=−m2ln⁡(2π)−mln⁡σ−∑i=1m(bi−aiTx)22σ2\ell(x) = \ln \prod_{i = 1}^m p(b_i \mid a_i; x) = -\frac{m}{2} \ln(2\pi) - m \ln \sigma - \sum_{i = 1}^m \frac{(b_i - a_i^T x)^2}{2\sigma^2}(x)=lni=1mp(biai;x)=2mln(2π)mlnσi=1m2σ2(biaiTx)2
The maximum likelihood estimation is to maximize the log - likelihood function. After removing the constant terms, we get the following least - squares problem: min⁡x∈Rn12∥Ax−b∥22(3.2.4)\min_{x \in \mathbb{R}^n} \frac{1}{2}\|Ax - b\|_2^2 \tag{3.2.4}xRnmin21Axb22(3.2.4) Note that when constructing the maximum likelihood estimation, there is no need to know the variance σ2\sigma^2σ2 of εi\varepsilon_iεi. The most important significance of the above transformation is that it establishes the connection between the least - squares method and regression analysis: when the error is assumed to be white Gaussian noise, the least - squares solution is the maximum likelihood solution of the linear regression model.

When εi\varepsilon_iεi is not white Gaussian noise, solving the linear regression model and solving the least - squares model are not equivalent. Therefore, we need to use the likelihood function to construct objective functions corresponding to other noises. For example, the model constructed under certain noises is actually the least - absolute deviation problem:
min⁡x∈Rn∥Ax−b∥1\min_{x \in \mathbb{R}^n} \|Ax - b\|_1 xRnminAxb1

Regularized linear regression model

  1. Tikhonov Regularize

To balance the fitting property of the model and the smoothness of the solution, Tikhonov regularization, or ridge regression, adds an ℓ2\ell_22-norm squared regularization term. Assuming eie_iei is Gaussian white noise, the linear regression model with the ℓ2\ell_22-norm squared regularization term is actually solving the following problem:

min⁡x∈Rn12∥Ax−b∥22+μ∥x∥22\min_{x \in \mathbb{R}^n} \frac{1}{2} \| Ax - b \|_2^2 + \mu \| x \|_2^2 xRnmin21Axb22+μx22

Due to the existence of the regularization term, the objective function of this problem is a strongly convex function, and the properties of the solution are improved. In the ridge regression problem , the regularization term ∥x∥22\| x \|_2^2x22 actually penalizes $x $ with a large norm. Therefore, another common variant is, given a parameter σ>0\sigma > 0σ>0, to solve:

min⁡x∈Rn12∥Ax−b∥22s.t.∥x∥2≤σ\min_{x \in \mathbb{R}^n} \frac{1}{2} \| Ax - b \|_2^2 \quad \text{s.t.} \quad \| x \|_2 \leq \sigma xRnmin21Axb22s.t.x2σ

  1. LASSO Problem and Its Variants

If the solution xxx we expect to obtain is sparse, then we can consider adding an ℓ1\ell_11-norm as the regularization term.LASSO problem is :

min⁡x∈Rn12∥Ax−b∥22+μ∥x∥1\min_{x \in \mathbb{R}^n} \frac{1}{2} \| Ax - b \|_2^2 + \mu \| x \|_1 xRnmin21Axb22+μx1

where $ \mu > 0 $ is a known real number, and $ x $ is the parameter to be estimated. The LASSO problem penalizes the sparsity of the solution through the $ \ell_1 $-norm. If $ x $ is sparse, then the predicted value $ b_i $ is only related to some elements of $ a_i $ that are non - zero. Thus, among the original $ n $ features of the data points, the features that play a role in prediction correspond to the non - zero components of $ x $. Therefore, **the LASSO model serves the function of feature extraction. **

Similarly , we can also consider the problem:

min⁡x∈Rn12∥Ax−b∥22s.t.∥x∥1≤σ\min_{x \in \mathbb{R}^n} \frac{1}{2} \| Ax - b \|_2^2 \quad \text{s.t.} \quad \| x \|_1 \leq \sigma xRnmin21Axb22s.t.x1σ

Considering the existence of noise ε\varepsilonε, we can also, given ν>0\nu > 0ν>0, consider the model:

min⁡x∈Rn∥x∥1s.t.∥Ax−b∥2≤ν\min_{x \in \mathbb{R}^n} \| x \|_1 \quad \text{s.t.} \quad \| Ax - b \|_2 \leq \nu xRnminx1s.t.Axb2ν

The traditional LASSO problem requires xxx to be a sparse solution. However, the sparsity in practical problems can have many expressions. If the parameter xxx has group - sparsity, that is, the components of xxx can be divided into GGG groups, and the parameters within each group must be non - zero or zero simultaneously, then the solution of the traditional LASSO problem cannot meet such a requirement. For this reason, people have proposed the group LASSO model:

min⁡x∈Rn12∥Ax−b∥22+μ∑ℓ=1Gnℓ∥xIℓ∥2\min_{x \in \mathbb{R}^n} \frac{1}{2}\|Ax - b\|_2^2 + \mu \sum_{\ell = 1}^G \sqrt{n_\ell}\|x_{\mathcal{I}_\ell}\|_2 xRnmin21Axb22+μ=1GnxI2

where Iℓ\mathcal{I}_\ellI is the index set belonging to the ℓ\ell - th group of variables, ∣Iℓ∣=nℓ|\mathcal{I}_\ell| = n_\ellI=n, and ∑ℓ=1Gnℓ=n\sum_{\ell = 1}^G n_\ell = n=1Gn=n. where Iℓ\mathcal{I}_\ellI is the index set belonging to the ℓ\ell - th group of variables, ∣Iℓ∣=nℓ|\mathcal{I}_\ell| = n_\ellI=n, and ∑ℓ=1Gnℓ=n\sum_{\ell = 1}^G n_\ell = n=1Gn=n.

If we need to ensure both group - level and single - feature - level sparsity, we can consider combining two regularization terms, that is, the so - called sparse group LASSO model:

min⁡x∈Rn12∥Ax−b∥22+μ1∑ℓ=1Gnℓ∥xIℓ∥2+μ2∥x∥1\min_{x \in \mathbb{R}^n} \frac{1}{2}\|Ax - b\|_2^2 + \mu_1 \sum_{\ell = 1}^G \sqrt{n_\ell}\|x_{\mathcal{I}_\ell}\|_2 + \mu_2\|x\|_1 xRnmin21Axb22+μ1=1GnxI2+μ2x1

where μ1,μ2>0\mu_1, \mu_2> 0μ1,μ2>0 are given constants. The regularization term of the sparse group LASSO model consists of two parts: ∥x∥1\|x\|_1x1 is to ensure the sparsity of xxx itself, and ∑ℓ=1Gnℓ∥xIℓ∥2\sum_{\ell = 1}^G \sqrt{n_\ell}\|x_{\mathcal{I}_\ell}\|_2=1GnxI2 is to ensure the group - sparsity of xxx.

For some practical problems, the feature xxx itself is not sparse, but it is sparse under some transformation. Therefore, we also need to adjust the regularization term accordingly. The general problem form can be:

min⁡x∈Rn12∥Ax−b∥22+μ∥Fx∥1\min_{x \in \mathbb{R}^n} \frac{1}{2}\|Ax - b\|_2^2 + \mu\|Fx\|_1 xRnmin21Axb22+μFx1

Logistic regression

In classification problems, the output variable takes values in a discrete space. For binary classification problems, the predicted variable has only two values, namely −1-11 and 111. We briefly introduce a classic classification model in machine learning: the logistic regression model.Given a feature aaa, logistic regression assumes that the probability that this sample belongs to class 111 is:
p(1∣a;x)=P(t=1∣a)=θ(aTx),p(1 \mid a; x) = P(t = 1 \mid a) = \theta(a^T x), p(1a;x)=P(t=1a)=θ(aTx),
where the Sigmoid function is:
θ(z)=11+exp⁡(−z).\theta(z) = \frac{1}{1 + \exp(-z)}. θ(z)=1+exp(z)1.
Then, the probability of belonging to class −1-11 is:
p(−1∣a;x)=1−p(1∣a;x)=θ(−aTx).p(-1 \mid a; x) = 1 - p(1 \mid a; x) = \theta(-a^T x). p(1a;x)=1p(1a;x)=θ(aTx).
Thus, for b∈{−1,1}b \in \{-1, 1\}b{1,1}, the above probability can be concisely written as:
p(b∣a;x)=θ(b⋅aTx).p(b \mid a; x) = \theta(b \cdot a^T x). p(ba;x)=θ(baTx).
Assume that the data pairs {ai,bi},i=1,2,…,m\{a_i, b_i\}, i = 1, 2, \ldots, m{ai,bi},i=1,2,,m are independently and identically distributed. Then, given a1,a2,…,ama_1, a_2, \ldots, a_ma1,a2,,am, the joint probability density of b1,b2,…,bmb_1, b_2, \ldots, b_mb1,b2,,bm is: p(b1,b2,…,bm∣a1,a2,…,am)=∏i=1mp(bi∣ai).p(b_1, b_2, \ldots, b_m \mid a_1, a_2, \ldots, a_m) = \prod_{i=1}^m p(b_i \mid a_i). p(b1,b2,,bma1,a2,,am)=i=1mp(biai). Substituting the form of p(bi∣ai)p(b_i \mid a_i)p(biai), we get:
p(b1,b2,…,bm∣a1,a2,…,am)=1∏i=1m(1+exp⁡(−bi⋅aiTx)).p(b_1, b_2, \ldots, b_m \mid a_1, a_2, \ldots, a_m) = \frac{1}{\prod_{i=1}^m (1 + \exp(-b_i \cdot a_i^T x))}. p(b1,b2,,bma1,a2,,am)=i=1m(1+exp(biaiTx))1.
Then, the maximum likelihood estimation is to solve the following optimization problem:
min⁡x∈Rn∑i=1mln⁡(1+exp⁡(−bi⋅aiTx)).\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \ln\left(1 + \exp(-b_i \cdot a_i^T x)\right). xRnmini=1mln(1+exp(biaiTx)).

Similar to sparse optimization, common models also need to consider the properties of the parameter xxx and add a regularization term. For example:

  1. Tikhonov Regularization Model
    min⁡x∈Rn∑i=1mln⁡(1+exp⁡(−bi⋅aiTx))+λ∥x∥22\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \ln\left(1 + \exp(-b_i \cdot a_i^T x)\right) + \lambda \|x\|_2^2 xRnmini=1mln(1+exp(biaiTx))+λx22
  2. ℓ1\ell_11 Norm Regularization Model
    min⁡x∈Rn∑i=1mln⁡(1+exp⁡(−bi⋅aiTx))+λ∥x∥1\min_{x \in \mathbb{R}^n} \sum_{i=1}^m \ln\left(1 + \exp(-b_i \cdot a_i^T x)\right) + \lambda \|x\|_1 xRnmini=1mln(1+exp(biaiTx))+λx1

Assume that the data pairs {ai,bi}\{a_i, b_i\}{ai,bi} are generated by the random variable pair {α,β}\{\alpha, \beta\}{α,β}. Then, the loss function can be written in the mean form:
E[ln⁡(1+exp⁡(−β⋅αTx))].\mathbb{E}\left[ \ln\left(1 + \exp(-\beta \cdot \alpha^T x)\right) \right]. E[ln(1+exp(βαTx))].

Problems Tikhonnov and ℓ1\ell_11 can be written in the sampling form of the following stochastic optimization problem:
min⁡x∈RnE[ln⁡(1+exp⁡(−β⋅αTx))]+λr(x),\min_{x \in \mathbb{R}^n} \mathbb{E}\left[ \ln\left(1 + \exp(-\beta \cdot \alpha^T x)\right) \right] + \lambda r(x), xRnminE[ln(1+exp(βαTx))]+λr(x),
where r(x)r(x)r(x) is the regularization term, and λ\lambdaλ is the regularization parameter. Many machine learning problems can be written as a more general stochastic optimization problem and its discrete version:
min⁡x∈Rnf(x)+λr(x),\min_{x \in \mathbb{R}^n} f(x) + \lambda r(x), xRnminf(x)+λr(x),
where:
f(x)=defE[F(x,ξ)]=∫ΩF(x,ξ(ω))dP(ω),orf(x)=def1m∑i=1mfi(x),f(x) \stackrel{\text{def}}{=} \mathbb{E}[F(x, \xi)] = \int_\Omega F(x, \xi(\omega)) \mathrm{d}P(\omega), \quad \text{or} \quad f(x) \stackrel{\text{def}}{=} \frac{1}{m} \sum_{i=1}^m f_i(x), f(x)=defE[F(x,ξ)]=ΩF(x,ξ(ω))dP(ω),orf(x)=defm1i=1mfi(x),
ξ:Ω→W\xi: \Omega \to Wξ:ΩW is a random variable defined on a given probability space (Ω,F,P)(\Omega, \mathcal{F}, P)(Ω,F,P), WWW is a measurable space, F:Rn×W→RF: \mathbb{R}^n \times W \to \mathbb{R}F:Rn×WR, and fi:Rn→R,i=1,2,…,mf_i: \mathbb{R}^n \to \mathbb{R}, i = 1, 2, \ldots, mfi:RnR,i=1,2,,m correspond to a certain loss function.

SVM

Support Vector Machine (SVM) is another widely applied binary classification model. Given sample points (ai,bi)(a_i, b_i)(ai,bi) in the training dataset DDD, where ai∈Rna_i \in \mathbb{R}^naiRn and bi∈{−1,1}b_i \in \{-1, 1\}bi{1,1}, the basic idea of SVM is to find a hyperplane that divides the sample points in Rn\mathbb{R}^nRn into two classes. We first consider a simple case: assume the training dataset is linearly separable, that is, the positive and negative classes are just divided to the two sides of the hyperplane xTw+y=0x^Tw + y = 0xTw+y=0.
It can be seen that after a linearly separable dataset is given, the hyperplanes that meet the “division requirement” are generally not unique. An ideal hyperplane should have the following characteristic: the distances from data points to this plane are relatively large. Knowledge from geometry tells us that the distance from a point www in the sample space to the hyperplane xTw+y=0x^Tw + y = 0xTw+y=0 is
d=∣xTw+y∣∥x∥2d = \frac{|x^Tw + y|}{\|x\|_2} d=x2xTw+y
If this hyperplane classifies the sample point (ai,bi)(a_i, b_i)(ai,bi) correctly, then
bi(aiTx+y)>0,i=1,2,…,mb_i(a_i^Tx + y) > 0, \quad i = 1, 2, \ldots, m bi(aiTx+y)>0,i=1,2,,m
To find the ideal hyperplane xTw+y=0x^Tw + y = 0xTw+y=0 to separate the data points, we require that the minimum distance from the points of the two classes of data to this hyperplane be as large as possible. Then the following original model can be established:
max⁡x,y,γγ,s.t.bi(aiTx+y)∥x∥2≥γ,i=1,2,…,m\max_{x, y, \gamma} \gamma, \\ \text{s.t.} \quad \frac{b_i(a_i^Tx + y)}{\|x\|_2} \geq \gamma, \quad i = 1, 2, \ldots, m x,y,γmaxγ,s.t.x2bi(aiTx+y)γ,i=1,2,,m
The meaning of γ\gammaγ in problem is obvious: it represents the minimum value of the distances from all sample points to the hyperplane xTw+y=0x^Tw + y = 0xTw+y=0, and the goal is to maximize it. Next, we perform an equivalent transformation on problem to make its form more natural. Notice that the constraints in problem are equivalent to
bi(aiTx+y)≥γ∥x∥2,i=1,2,…,mb_i(a_i^Tx + y) \geq \gamma\|x\|_2, \quad i = 1, 2, \ldots, m bi(aiTx+y)γx2,i=1,2,,m
And scaling xxx and yyy by the same positive multiple will not affect the constraints and the objective function of this problem. According to this characteristic, we reduce the feasible region of problem and force ∥x∥2=1γ\|x\|_2 = \frac{1}{\gamma}x2=γ1. Then this problem is equivalent to
max⁡x,y,γγ,s.t.bi(aiTx+y)≥γ∥x∥2,i=1,2,…,m,∥x∥2=1γ\begin{align*} &\max_{x, y, \gamma} \gamma, \\ \text{s.t.} &\quad b_i(a_i^Tx + y) \geq \gamma\|x\|_2, \quad i = 1, 2, \ldots, m, \\ \quad\quad&\quad\quad \|x\|_2 = \frac{1}{\gamma} \end{align*}s.t.x,y,γmaxγ,bi(aiTx+y)γx2,i=1,2,,m,x2=γ1
Finally, we eliminate the variable γ\gammaγ and the constraint ∥x∥2=1γ\|x\|_2 = \frac{1}{\gamma}x2=γ1 to obtain the optimization problem
min⁡x,y12∥x∥22,s.t.bi(aiTx+y)≥1,i=1,2,…,m\begin{align*} \min_{x, y} &\frac{1}{2}\|x\|_2^2, \\ \text{s.t.} \quad b_i&(a_i^Tx + y) \geq 1, \quad i = 1, 2, \ldots, m \end{align*}x,ymins.t.bi21x22,(aiTx+y)1,i=1,2,,m
By solving problem to get x,yx, yx,y, we call the aia_iai that make bi(aiTx+y)=1b_i(a_i^Tx + y) = 1bi(aiTx+y)=1 hold support vectors. It is not difficult to find that the parameters x,yx, yx,y of the hyperplane are completely determined by the support vectors. In other words, removing the data points that are not support vectors will not affect the solution of problem .

When the assumption of linear separability does not hold, we introduce non - negative slack variables ξi\xi_iξi for each data point, allowing for misclassified points. Then the constraints in problem are changed to
bi(aiTx+y)∥x∥2≥γ(1−ξi),ξi≥0,i=1,2,…,m\frac{b_i(a_i^Tx + y)}{\|x\|_2} \geq \gamma(1 - \xi_i), \xi_i \geq 0, \quad i = 1, 2, \ldots, m x2bi(aiTx+y)γ(1ξi),ξi0,i=1,2,,m
Here, the relative form γ(1−ξi)\gamma(1 - \xi_i)γ(1ξi) is used to represent the “distance” of misclassified points. Obviously, there should not be too many misclassified points. We control the degree of error through the function ∑i=1mξi\sum_{i = 1}^m \xi_ii=1mξi of slack variables. After equivalent transformation similar to the previous one, we finally obtain the problem
min⁡x,y,ξ12∥x∥22+μ∑i=1mξi,s.t.bi(aiTx+y)≥1−ξi,ξi≥0,i=1,2,…,m\min_{x, y, \xi} \frac{1}{2}\|x\|_2^2 + \mu\sum_{i = 1}^m \xi_i, \\ \text{s.t.} \quad b_i(a_i^Tx + y) \geq 1 - \xi_i, \\ \quad\quad\quad\quad \xi_i \geq 0, \quad i = 1, 2, \ldots, m x,y,ξmin21x22+μi=1mξi,s.t.bi(aiTx+y)1ξi,ξi0,i=1,2,,m
where μ>0\mu > 0μ>0 is a penalty coefficient. Increasing the value of μ\muμ will increase the penalty for misclassification, and decreasing the value of μ\muμ will decrease the penalty for misclassification. The above two optimization problems are special forms of quadratic programming.

Above problem is also equivalent to the unconstrained optimization problem:
min⁡x,y12∥x∥22+μ∑i=1mmax⁡{1−bi(aiTx+y),0}\min_{x, y} \frac{1}{2}\|x\|_2^2 + \mu\sum_{i = 1}^m \max\{1 - b_i(a_i^Tx + y), 0\} x,ymin21x22+μi=1mmax{1bi(aiTx+y),0} Notice that max⁡{1−bi(aiTx+y),0}\max\{1 - b_i(a_i^Tx + y), 0\}max{1bi(aiTx+y),0} is the penalty for the quantity that does not satisfy the inequality bi(aiTx+y)≥1b_i(a_i^Tx + y) \geq 1bi(aiTx+y)1. Therefore, problem can also be regarded as solving the penalty function method of problem . Although max⁡{z,0}\max\{z, 0\}max{z,0} is not differentiable, the simple form of problem (3.4.4) also provides many possibilities for algorithm design. It can be seen that introducing different penalty functions can construct different SVM models. In addition, when there are redundant features in the training data, people also consider replacing the ∥x∥\|x\|x term with the ℓ1\ell_11 norm for solution: min⁡x∈Rn∥x∥1+μ∑i=1mmax⁡{1−bi(aiTx+y),0}\min_{x \in \mathbb{R}^n} \|x\|_1 + \mu\sum_{i = 1}^m \max\{1 - b_i(a_i^Tx + y), 0\} xRnminx1+μi=1mmax{1bi(aiTx+y),0}

.......

http://www.lryc.cn/news/623825.html

相关文章:

  • [优选算法专题二滑动窗口——将x减到0的最小操作数]
  • 【adb端口5555】烽火hg680-gy_烽火hg680-gc安卓9线刷烧录包 解决用一段时间就提示升级的问题
  • Shell脚本-for循环语法结构
  • 【前端基础】19、CSS的flex布局
  • 蓝凌EKP产品:JSP 性能优化和 JSTL/EL要点检查列表
  • rt-thread audio框架移植stm32 adc+dac,用wavplayer录音和播放
  • 【从零开始学习Redis】项目实战-黑马点评D2
  • scikit-learn/sklearn学习|多任务套索回归MultiTaskLasso解读
  • Windows_Server软件定义网络架构
  • 【Linux系列】如何在 Linux 服务器上快速获取公网
  • 每日两道算法题:DAY3
  • uniappx 安卓端本地打包的一些总结
  • 【位运算】查询子数组最大异或值|2693
  • CNV检测--单细胞空间vs基因组WGS/WES
  • AutoSar BSW介绍
  • 《Nursing Research》(护理 SCI)LaTeX 模板详细教程:从入门到投稿(二)
  • http工作流程
  • 数据电台询价的询价要求
  • 数据链路层(1)
  • FX10/20 (CYUSB401X)开发笔记5 固件架构
  • 基于Netty的高并发WebSocket连接管理与性能优化实践指南
  • prototype 和 _ _ proto _ _的关联
  • multiboot 规范实践分析
  • 交叉编译 手动安装 SQLite 库 移植ARM
  • Python数据分析案例82——基于机器学习的航空公司满意度分析
  • 攻防世界—unseping(反序列化)
  • pytorch线性回归
  • (一)React企业级后台(Axios/localstorage封装/动态侧边栏)
  • iSCSI服务配置全指南(含服务器与客户端)
  • JMeter(进阶篇)