Least Squares Data Fitting

Learning Objectives

Set up a linear least-squares problem from a set of data
Use an SVD to solve the least-squares problem
Quantify the error in a least-squares problem

Links and Other Content

Nothing here yet.

Linear Regression with a Set of Data

Consider a set of \(m\) data points (where \(m>2\)), \(\{(t_1,y_1),(t_2,y_2),\dots,(t_m,y_m)\}\). Suppose we want to find a straight line that best fits these data points. Mathematically, we are finding \(a\) and \(b\) such that \[ y_i = a\,t_i + b, \quad \forall i \in [1,m]. \]

In matrix form, the resulting linear system is \[ \begin{bmatrix} t_1 & 1\\ t_2 & 1\\ \vdots & \vdots\\ t_m & 1 \end{bmatrix} \begin{bmatrix} a\\ b \end{bmatrix} = \begin{bmatrix} y_1\\ y_2\\ \vdots\\ y_m \end{bmatrix} \]

However, it is obvious that we have more equations than unknowns, and there is usually no exact solution to the above problem.

Generally, if we have a linear system \[ A \boldsymbol{x} = \boldsymbol{b} \] where \(A\) is an \(m\times n\) matrix and \(m>n\) then we call this system overdetermined and the equality is usually not exactly satisfiable as \(\boldsymbol{b}\) may not lie in the column space of \(A\).

Therefore, an overdetermined system is better written as \[ A \boldsymbol{x} \cong \boldsymbol{b}. \]

Linear Least-squares Problem

For an overdetermined system \(A \boldsymbol{x} \cong \boldsymbol{b}\), we are typically looking for a solution \(\boldsymbol{x}\) that minimizes the squared Euclidean norm of the residual vector \(\boldsymbol{r} = \boldsymbol{b} - A \boldsymbol{x}\), \[ \min_{\boldsymbol{x}} \|\boldsymbol{r}\|_2^2 = \min_{\boldsymbol{x}} \|\boldsymbol{b} - A \boldsymbol{x}\|_2^2. \]

This problem \(A \boldsymbol{x} \cong \boldsymbol{b}\) is called a linear least-squares problem, and the solution \(\boldsymbol{x}\) is called least-squares solution. We will first focus on linear least-squares problems; in later sections, we will also talk about nonlinear least-squares problems.

Data Fitting vs Interpolation

It is important to understand that interpolation and least-squares data fitting, while somewhat similar, are fundamentally different in their goals. In both problems we have a set of data points \((t_i, y_i)\), \(i=1,\ldots,m\), and we are attempting to determine the coefficients for a linear combination of basis functions.

Recall that with interpolation, we are looking for the linear combination of basis functions such that the resulting function passes through each of the data points exactly. So, for \(m\) unique data points, to interpolate the data points we would need \(m\) linearly independent basis functions (and the resulting linear system will be square and full rank, so it will have an exact solution).

In contrast, however, with least squares data fitting we have some model that we are trying to find the parameters for such that the model best fits the data points. So, for example, with linear least squares we may have 300 noisy data points that we want to model as a quadratic function. So, we are trying represent our data as \[y = x_0 + x_1 t + x_2 t^2,\] where \(x_0, x_1,\) and \(x_2\) are the unknowns we want to determine (the coefficients to our basis functions). Because there are significantly more data points than parameters, we do not expect that the function will exactly pass through the data points. For this example, with noisy data points we would not want our function to pass through the data points exactly as we are looking to model the general trend and not capture the noise.

Normal Equations

Consider the least squares problem, \(A \boldsymbol{x} \cong \boldsymbol{b}\), where \(A\) is \(m \times n\) real matrix (with \(m > n\)). As stated above, the least squares solution minimizes the squared 2-norm of the residual. So, we want to find the \(\mathbf{x}\) that minimizes the function:

\[\phi(\mathbf{x}) = \|\mathbf{r}\|_2^2 = (\mathbf{b} - A\mathbf{x})^\top (\mathbf{b} - A\mathbf{x}) = \mathbf{b}^\top \mathbf{b} - 2\mathbf{x}^\top A^\top \mathbf{b} + \mathbf{x}^\top A^\top A \mathbf{x}.\]

From calculus, we know that the minimizer occurs when \(\nabla \phi(\mathbf{x}) = 0\): \[\nabla \phi(\mathbf{x}) = -2 A^\top \mathbf{b} + 2 A^\top A \mathbf{x} = 0.\]

The resulting square (\(n \times n\)) linear system \[A^\top A \mathbf{x} = A^\top \mathbf{b}\] is called the system of normal equations. Because the system of normal equations is symmetric, the normal equations can be solved by computing and using a Cholesky factorization.

Although the least squares problem can be solved via the normal equations, the normal equations tend to worsen the conditioning of the problem. Specifically, \[\text{cond}(A^\top A) = (\text{cond}(A))^2.\] Because of this, although solving a least-squares problem using the normal equations is less costly than other methods for finding the least squares solution, it is often not a good choice.

Solving Least-Squares Problems Using SVD

Another way to solve the least-squares problem \(A \boldsymbol{x} \cong \boldsymbol{b}\) (where we are looking for \(\boldsymbol{x}\) that minimizes \(\|\boldsymbol{b} - A \boldsymbol{x}\|_2^2\)) is to use the singular value decomposition (SVD) of \(A\), \[ A = U \Sigma V^T \] to reduce this problem.

With the SVD of \(A\), we have \[ \begin{align*} \|\boldsymbol{b} - A \boldsymbol{x}\|_2^2 &= \|\boldsymbol{b} - U \Sigma V^T \boldsymbol{x}\|_2^2 & (1)\\ &= \|U^T (\boldsymbol{b} - U \Sigma V^T \boldsymbol{x})\|_2^2 & (2)\\ &= \|U^T \boldsymbol{b} - \Sigma V^T \boldsymbol{x}\|_2^2 \end{align*} \] We can go from (1) to (2) because multiplying a vector by an orthogonal matrix does not change the 2-norm of the vector.

Now let \[ \boldsymbol{y} = V^T \boldsymbol{x} \] \[ \boldsymbol{z} = U^T \boldsymbol{b}, \] then we are looking for \(\boldsymbol{y}\) that minimizes \[ \|\boldsymbol{z} - \Sigma \boldsymbol{y}\|_2^2. \]

Note that \[ \Sigma \boldsymbol{y} = \begin{bmatrix} \sigma_1 y_1\\ \sigma_2 y_2\\ \vdots \\ \sigma_n y_n \end{bmatrix},(\sigma_i = \Sigma_{i,i}) \] so we choose \[ y_i = \begin{cases} \frac{z_i}{\sigma_i} & \sigma_i \neq 0\\ 0 & \sigma_i = 0 \end{cases} \] which will minimize \(\|\boldsymbol{z} - \Sigma \boldsymbol{y}\|_2^2\).

Finally, we compute \[ \boldsymbol{x} = V \boldsymbol{y} \] to find \(\boldsymbol{x}\).

The expression of least-squares solution is \[ \boldsymbol{x} = \sum_{\sigma_i \neq 0} \frac{\boldsymbol{u}_i^T \boldsymbol{b}}{\sigma_i} \boldsymbol{v}_i \] where \(\boldsymbol{u}_i\) represents the \(i\)th column of \(U\) and \(\boldsymbol{v}_i\) represents the \(i\)th column of \(V\).

Determining Residual in Least-Squares Problem Using SVD

We've shown above how the SVD can be used to find the least-squares solution (the solution that minimizes the squared 2-norm of the residual) to the least squares problem \(A \boldsymbol{x} \cong \boldsymbol{b}\). We can also use the SVD to determine an exact expression for the value of the residual with the least-squares solution.

Assume in the SVD of \(A\), \(A = U \Sigma V^T\), the diagonal entries of \(\Sigma\) are in descending order (\(\sigma_1 \ge \sigma_2 \ge \dots \ge \sigma_n\)), and the first \(k\) diagonal entries of \(\Sigma\) are nonzeros while all others are zeros, then \[ \begin{align*} \|\boldsymbol{b} - A \boldsymbol{x}\|_2^2 &= \|\boldsymbol{z} - \Sigma \boldsymbol{y}\|_2^2\\ &= \sum_{i=1}^n (z_i - \sigma_i y_i)^2\\ &= \sum_{i=1}^k (z_i - \sigma_i \frac{z_i}{\sigma_i})^2 + \sum_{i=k+1}^n (z_i - 0)^2\\ &= \sum_{i=k+1}^n z_i^2 \end{align*} \]

Recall that \[ \boldsymbol{z} = U^T \boldsymbol{b}, \] so \[ \|\boldsymbol{b} - A \boldsymbol{x}\|_2^2 = \sum_{i=k+1}^n (\boldsymbol{u}_i^T \boldsymbol{b})^2 \] where \(\boldsymbol{u}_i\) represents the \(i\)th column of \(U\).

Example of a Least-squares Problem

Assume we have \(3\) data points, \(\{(t_i,y_i)\}=\{(1,1.2),(2,1.9),(3,1)\}\), we want to find a line that best fits these data points. The code for using SVD to solve this least-squares problem is:

import numpy as np
import numpy.linalg as la

A = np.array([[1,1],[2,1],[3,1]])
b = np.array([1.2,1.9,1])
U, s, V = la.svd(A)
V = V.T
y = np.zeros(len(A[0]))
z = np.dot(U.T,b)
k = 0
threshold = 0.01
while k < len(A[0]) and s[k] > threshold:
	y[k] = z[k]/s[k]
	k += 1
x = np.dot(V,y)
print("The function of the best line is: y = " + str(x[0]) + "x + " + str(x[1]))

Non-linear Least-squares Problem vs. Linear Least-squares Problem

The above linear least-squares problem is associated with an overdetermined linear system \(A \boldsymbol{x} \cong \boldsymbol{b}.\) This problem is called a linear one because the fitting function we are looking for is linear in the components of \(\boldsymbol{x}\). For example, if we are looking for a polynomial fitting function \[ f(t,\boldsymbol{x}) = x_1 + x_2t + x_3t^2 + \dotsb + x_nt^{n-1} \] to fit the data points \(\{(t_i,y_i)|1 \le i \le m\}\) (\(m > n\)), the problem is still a linear least-squares problem, because \(f(t,\boldsymbol{x})\) is linear in the components of \(\boldsymbol{x}\) (though \(f(t,\boldsymbol{x})\) is nonlinear in \(t\)).

If the fitting function \(f(t,\boldsymbol{x})\) for data points \(\{(t_i,y_i)|1 \le i \le m\}\) is nonlinear in the components of \(\boldsymbol{x}\), then the problem is a non-linear least-squares problem. For example, fitting sum of exponentials \[ f(t,\boldsymbol{x}) = x_1 e^{x_2 t} + x_2 e^{x_3 t} + \dotsb + x_{n-1} e^{x_n t} \] is a non-linear least-squares problem.

Non-linear Least-squares Problem Set Up

Given \(m\) data points, \(\{(t_1,y_1),(t_2,y_2),\dots,(t_m,y_m)\}\), we want to find a curve \(f(t,\boldsymbol{x})\) that best fits the data points. Mathematically, we are finding \(\boldsymbol{x}\) such that the squared Euclidean norm of the residual vector \[ \|\boldsymbol{r}(\boldsymbol{x})\|_2^2 \] is minimized, where the components of the residual vector are \[ r_i(\boldsymbol{x}) = y_i - f(t_i,\boldsymbol{x}). \]

Equivalently, we want to minimize \[ \phi(\boldsymbol{x}) = \frac{1}{2} \boldsymbol{r}^T(\boldsymbol{x})\boldsymbol{r}(\boldsymbol{x}). \]

To solve this optimization problem, we can use steepest descent, see Reference Page: Optimization. We compute the gradient vector \[ \nabla \phi(\boldsymbol{x}) = J^T(\boldsymbol{x})\boldsymbol{r}(\boldsymbol{x}) ,\] where \(J(\boldsymbol{x})\) is the Jacobian of \(\boldsymbol{r}(\boldsymbol{x})\).

Then, we use steepest descent as usual with this gradient vector and our objective function, \(\phi(\boldsymbol{x})\), iterating until we converge to a solution.

Example of a Non-linear Least Squares Problem

Consider fitting \(m\) data points, \((t_0, y_0), (t_1, y_1), \dots, (t_{m-1}, y_{m-1})\), with the curve \[f(t, \mathbf{x}) = x_0 + x_1 e^{-x_2 t}.\]

The components of the residual are given by: \[r_i(\mathbf{x}) = y_i - (x_0 + x_1 e^{-x_2 t_i}).\]

The gradient of \(\phi(\mathbf{x})\) is given by: \[\begin{bmatrix} -1 & -1 & \dots & -1 \\ -e^{-x_2 t_0} & -e^{-x_2 t_1} & \dots & -e^{-x_2 t_{m-1}} \\ x_1 t_0 e^{-x_2 t_0} & x_1 t_1 e^{-x_2 t_1} & \dots & x_1 t_{m-1} e^{-x_2 t_{m-1}} \end{bmatrix} \begin{bmatrix} y_0 - x_0 - x_1 e^{-x_2 t_0} \\ \vdots \\ y_{m-1} - x_0 - x_1 e^{-x_2 t_{m-1}} \end{bmatrix}. \]

With steepest descent, we would use a line search along the direction of the negative gradient to find the overall step and use this to find the next iterate.

Review Questions

What does the least-squares solution minimize?
For a given model and given data points, can you form the system \(A \boldsymbol{x} \cong \boldsymbol{b}\) for a least squares problem?
For a small problem, given some data points and a model, can you determine the least squares solution?
In general, what can we say about the value of the residual for the least squares solution?
What are the differences between least squares data fitting and interpolation?
Given the SVD of a matrix \(A\), how can we use the SVD to compute the residual of the least squares solution?
Given the SVD of a matrix \(A\), how can we use the SVD to compute the least squares solution? Be able to do this for a small problem.
Why would you use the SVD instead of normal equations to find the solution to \(A \boldsymbol{x} \cong \boldsymbol{b}\)?
Which costs less: solving a least squares problem via the normal equations or solving a least squares problem using the SVD?
What is the difference between a linear and a nonlinear least squares problem? What sort of model makes it a nonlinear problem? For data points \((t_i, y_i)\), is fitting \(y = a \cos(t) + b\) where \(a\) and \(b\) are the coefficients we are trying to determine a linear or nonlinear least squares problem?
For a given model and given data, what is the residual vector for a a nonlinear least squares problem?
How do we solve a nonlinear least squares problem? What do we minimize when solving a nonlinear least squares problem?
Consider solving a nonlinear least squares problem using gradient descent. For a given model, how do you compute the direction of the step?

ChangeLog

2017-11-29 Erin Carrier : fixes typos in lst-sq code, jacobian desc in nonlinear lst-sq
2017-11-17 Erin Carrier : fixes incorrect link
2017-11-16 Erin Carrier : adds review questions minor formatting changes throughout for consistency, adds normal equations and interp vs lst-sq sections removes Gauss-Newton from nonlinear least squares
2017-11-12 Yu Meng : first complete draft
2017-10-17 Luke Olson : outline