Online Recursive Least Squares
 Online Recursive Least Squares All Parameter Identification Articles All System Identification Articles In order to prevent spam, users must register before they can edit or create articles.

## 1 Introduction to Online Recursive Least Squares

The following online recursive least squares derivation comes from class notes provided for Dr. Shieh's ECE 7334 Advanced Digital Control Systems at the University of Houston. Similar derivations are presented in [[1], [2] and [3]].

If you wish to skip directly to the update equations click here.

## 2 Derivation of Online Recursive Least Squares[4]

The derivation for the online recursive least squares (RLS) method is as follows

 $LaTeX: \hat{Y}_{p}=\begin{bmatrix}\hat{y}_{0} \\ \hat{y}_{1} \\ \vdots \\ \hat{y}_{p-1}\end{bmatrix} \in R^{p x 1},$ 'Actual System Outputs'

 $LaTeX: A_{p}=\begin{bmatrix}\phi_{0}^{T} \\ \phi_{1}^{T} \\ \vdots \\ \phi_{p-1}^{T}\end{bmatrix} \in R^{p x \left(n+m+1\right)},$ 'System Inputs'

 $LaTeX: \theta_{p}=\begin{bmatrix}b_{0} \\ b_{1} \\ \vdots \\ b_{m} \\ a_{0} \\ a_{1} \\ \vdots \\ a_{n}\end{bmatrix} \in R^{\left(n+m+1\right) x 1},$ 'System Model Coefficients'

and

 $LaTeX: \Nu_{p}=\begin{bmatrix}\nu_{0} \\ \nu_{1} \\ \vdots \\ \nu_{p-1}\end{bmatrix} \in R^{p x 1}.$ 'Noise or Error'

The data set

 $LaTeX: \hat{Y}_{p}=A_{p}\theta_{p}+\Nu_{p}$

leads to the optimal parameter equation

 $LaTeX: \theta_{p}=\left(A_{p}^{T}\bar{W}_{p}A_{p}\right)^{-1}A_{p}^{T}\bar{W}_{p}\hat{Y}_{p},$

Note this is just $LaTeX: \theta_{p}=A_{p}^{-1}\hat{Y}_{p}$ where, by definition of optimality, $LaTeX: N_{p}=0$ and where

$LaTeX: \bar{W}_{p}$ is the weighting matrix, and
$LaTeX: A_{p}^{-1}$ is replaced by $LaTeX: A_{p}^{+}=\left(A_{p}^{T}A_{p}\right)^{-1}A_{p}^{T}$ the pseudo-inverse.

This is a recusive algorithm which means that data points must be added sequentially. In order to add a new data point use

 $LaTeX: \hat{Y}_{p+1}=\begin{bmatrix}\hat{Y}_{p} \\ \hat{y}_{p+1}\end{bmatrix}, A_{p+1}=\begin{bmatrix}A_{p} \\ \phi_{p+1}^{T}\end{bmatrix},$

where

 $LaTeX: \left(\hat{y}_{p+1}, \phi_{p+1}^{T}\right)-\hat{y}_{p+1}=\phi_{p+1}^{T}\theta_{p}$ Not quite right

is the new data set and $LaTeX: \theta_{p}$ is the old set of model coefficients. The updated weighting matrix is

 $LaTeX: \bar{W}_{p+1}=\begin{bmatrix}\bar{W}_{p} & 0 \\ 0 & w_{p+1}\end{bmatrix}$

Use

 $LaTeX: \hat{\theta}_{p+1}=\left(A_{p+1}^{T}\bar{W}_{p+1}A_{p+1}\right)^{-1}A_{p+1}^{T}\bar{W}_{p+1}\hat{Y}_{p+1}$

 $LaTeX: \hat{\theta}_{p+1}=\left( \begin{bmatrix} A_{p}^{T} & \phi_{p+1} \end{bmatrix} \begin{bmatrix} \bar{W}_{p} & 0 \\ 0 & w_{p+1} \end{bmatrix} \begin{bmatrix} A_{p} \\ \phi_{p+1}^{T} \end{bmatrix} \right)^{-1} \begin{bmatrix} A_{p}^{T} & \phi_{p+1} \end{bmatrix} \begin{bmatrix} \bar{W}_{p} & 0 \\ 0 & w_{p+1} \end{bmatrix} \begin{bmatrix} \hat{Y}_{p} \\ \hat{y}_{p+1} \end{bmatrix}$

 $LaTeX: \hat{\theta}_{p+1}=\begin{bmatrix} A_{p}^{T}\bar{W}_{p}A_{p}+\phi_{p+1}w_{p+1}\phi_{p+1}^{T}\end{bmatrix}^{-1}\begin{bmatrix} A_{p}^{T}\bar{W}_{p}\hat{Y}_{p}+\phi_{p+1}w_{p+1}\hat{y}_{p+1}\end{bmatrix}$ Eqn. 1

to find the optimal model coefficients.

To find the relationship between the old coefficients $LaTeX: \hat{\theta}_{p}$ and the new $LaTeX: \hat{\theta}_{p+1}$ use

 $LaTeX: \hat{\theta}_{p}=P_{p}A_{p}^{T}\bar{W}_{p}\hat{Y}_{p}$$LaTeX: \rArr$$LaTeX: P_{p}^{-1}\hat{\theta}_{p}=A_{p}^{T}\bar{W}_{p}\hat{Y}_{p}$

Where $LaTeX: P_{p}$ is defined as

 $LaTeX: P_{p} \equiv \left(A_{p}^{T}\bar{W}_{p}A_{p}\right)^{-1}$ Eqn. 2

$LaTeX: P_{p}$ is a symmetric matrix > 0 and the updated value is

 $LaTeX: P_{p+1} \equiv \left(A_{p+1}^{T}\bar{W}_{p+1}A_{p+1}\right)^{-1}=\begin{bmatrix}A_{p}^{T}\bar{W}_{p}A_{p}+\phi_{p+1}w_{p+1}\phi_{p+1}^{T}\end{bmatrix}^{-1}>0$ Eqn. 3

 $LaTeX: P_{p+1}^{-1}=A_{p+1}^{T}\bar{W}_{p+1}A_{p+1}$

 $LaTeX: P_{p+1}^{-1}=\begin{bmatrix}A_{p}^{T}\bar{W}_{p}A_{p}+\phi_{p+1}w_{p+1}\hat{y}_{p+1}\end{bmatrix}$

and Eqn. 2 to

 $LaTeX: P_{p+1}^{-1}=\begin{bmatrix}P_{p}^{-1}+\phi_{p+1}w_{p+1}\phi_{p+1}^{T}\end{bmatrix}$

and finally to

 $LaTeX: P_{p}^{-1}=P_{p+1}^{-1}-\phi_{p+1}w_{p+1}\phi_{p+1}^{T}$

Combine Eqns. 1 and 3 to obtain the equation

 $LaTeX: \hat{\theta}_{p+1}=P_{p+1}\left(A_{p}^{T}\bar{W}_{p}\hat{Y}_{p}+\phi_{p+1}w_{p+1}\hat{y}_{p+1}\right)$

 $LaTeX: \hat{\theta}_{p+1}=P_{p+1} \left( P_{p}^{-1} \hat{\theta}_{p}+\phi_{p+1} w_{p+1} \hat{y}_{p+1} \right)$

 $LaTeX: \hat{\theta}_{p+1}=P_{p+1}\left[\left(P_{p}^{-1}-\phi_{p+1}w_{p+1}\phi_{p+1}^{T}\right)\hat{\theta}_{p}+\phi_{p+1}w_{p+1}\hat{y}_{p+1}\right]$

which leads to the new model coefficients in terms of P (all the previous input and output data vectors) and the current data point. The equation for the new model coefficients can be rearranged into this form

 $LaTeX: \hat{\theta}_{p+1}=\hat{\theta}_{p}+P_{p+1}\phi_{p+1}w_{p+1}\left(\hat{y}_{p+1}-\phi_{p+1}^{T}\hat{\theta}_{p}\right)$

where

$LaTeX: \hat{\theta}_{p}$ is the previous model coefficients and
$LaTeX: \left(\hat{y}_{p+1}-\phi_{p+1}^{T}\hat{\theta}_{p}\right)$ is the error between actual system output and predicted system output.
 $LaTeX: \hat{\theta}_{p+1}=\hat{\theta}_{p}+K_{p+1}\left(\hat{y}_{p+1}-\phi_{p+1}\hat{\theta}_{p}\right)$

where

 $LaTeX: K_{p+1} \equiv P_{p+1}\phi_{p+1} w_{p+1}$

is the discrete Kalman gain matrix.

The online recursive least squares algorithm derived to this point is slow. Therefore inappropriate for an online algorithm. Specifically, matrix inversions should be avoided. The matrix inverse can be avoided through the use of the inverse lemma

 $LaTeX: \left(A+BCD\right)^{-1}=A^{-1}-A^{-1}B\left(DA^{-1}B+C^{-1}\right)DA^{-1},$ Inverse Lemma

where $LaTeX: P_{p}^{-1}=A, \phi_{p+1}=B, w_{p+1}=C,$ and $LaTeX: \phi_{p+1}^{T}=D$. The inverse lemma leads to the following equations

 $LaTeX: P_{p+1}^{(i+1)}=P_{p}^{(i)}-K_{p+1}^{(i+1)}\left(P_{p}^{(i)}\phi_{p+1}^{(i+1)}\right)^{T}>0$

where

 $LaTeX: K_{p+1}^{(i+1)}=\frac{P_{p}^{(i)} \phi_{p+1}^{(i+1)}}{\left(w_{p+1}^{(i+1)}+\phi_{p+1}^{T(i+1)} P_{p}^{(i)} \phi_{p+1}^{(i+1)}\right)}$

and finally

 $LaTeX: \hat{\theta}_{p+1}^{(i+1)}=\hat{\theta}_{p}^{(i)}+K_{p+1}^{(i+1)}e_{p+1}^{(i+1)}$

The variable i represents the iteration number and the innovation sequence is

 $LaTeX: e_{p+1}^{(i+1)}=\hat{y}_{p+1}^{(i+1)}-\phi_{p+1}^{T(i+1)}\hat{\theta}_{p}^{(i)}$

While the resulting algorithm avoids any matrix inverses it is not suitable for rapid time-varying systems. This is because the algorithm, after lots of samples, will stop updating at steady state. The algorithm cannot update quickly because large amountss of data are already incorporated. In other words, after 10,000 data points a new data point is effectively scaled by 1/10,001 vs. the combined effect of the prior points (10,000/10,001).

To improve the algorithm's performance a forgetting factor is added to

 $LaTeX: P_{p+1}^{-1}=A_{p+1}^{T}\bar{W}_{p+1}A_{p}=A_{p}^{T}\bar{W}_{p}A_{p}+\phi_{p+1}w_{p+1}\phi_{p+1}^{T}$

 $LaTeX: P_{p+1}^{-1}=A_{p}^{T}\left(\lambda_{p+1}\bar{W}_p\right)A_{p}+\phi_{p+1}\left(1\right) w_{p+1}\phi_{p+1}^{T}$

where

$LaTeX: \lambda_{p+1}<1$ is the forgetting factor and (1) represents a relatively large forgetting factor.

If all $LaTeX: P_{p+1}$ are replaced with $LaTeX: \lambda_{p+1}P_{p+1}$ and $LaTeX: w_{p+1}$ with $LaTeX: \frac{w_{p+1}}{\lambda_{p+1}}$ then the final form of the alorithm is achieved. The equations are

 $LaTeX: \lambda_{p+1}^{(i+1)}=\lambda_{p}^{(i)}\lambda_{0}+\left(1+\lambda_{0}\right)$

where $LaTeX: \lambda_{0}<1$

 $LaTeX: K_{p+1}^{(i+1)}=\frac{P_{p}^{(i)}\phi_{p+1}^{(i+1)}}{\left(\frac{\lambda_{p+1}}{w_{p+1}}\right)+\phi_{p+1}^{T(i+1)}P_{p}^{(i)}\phi_{p+1}^{(i+1)}}$

 $LaTeX: P_{p+1}=\frac{P_{p}^{(i)}-K_{p+1}^{(i+1)}\left(P_{p}^{(i)}\phi_{p+1}^{(i+1)}\right)^{T}}{\lambda_{p+1}^{(i+1)}}$

 $LaTeX: e_{p+1}^{(i+1)}=\hat{y}_{p+1}^{(i+1)}-\phi_{p+1}^{T(i+1)}\hat{\theta}_{p}^{(i)}$

 $LaTeX: \hat{\theta}_{p+1}^{(i+1)}=\hat{\theta}_{p}^{(i)}+K_{p+1}^{(i+1)}e_{p+1}^{(i+1)}$

with initial conditions

 $LaTeX: \hat{\theta}_{p}^{(0)}=0_{p x 1}$

and

 $LaTeX: P_{p}^{(0)}=\alpha I_{p}$

where

$LaTeX: \alpha=1000$ is a good starting value and
$LaTeX: I_{p}$ is the identity matrix of size (p x p).

## 3 Final Online Recursive Least Squares Equation Set

These are the final online recursive least squares update equations in order.

 $LaTeX: \lambda_{p+1}^{(i+1)}=\lambda_{p}^{(i)}\lambda_{0}+\left(1+\lambda_{0}\right)$

 $LaTeX: K_{p+1}^{(i+1)}=\frac{P_{p}^{(i)}\phi_{p+1}^{(i+1)}}{\left(\frac{\lambda_{p+1}}{w_{p+1}}\right)+\phi_{p+1}^{T(i+1)}P_{p}^{(i)}\phi_{p+1}^{(i+1)}}$

 $LaTeX: P_{p+1}=\frac{P_{p}^{(i)}-K_{p+1}^{(i+1)}\left(P_{p}^{(i)}\phi_{p+1}^{(i+1)}\right)^{T}}{\lambda_{p+1}^{(i+1)}}$

 $LaTeX: e_{p+1}^{(i+1)}=\hat{y}_{p+1}^{(i+1)}-\phi_{p+1}^{T(i+1)}\hat{\theta}_{p}^{(i)}$

 $LaTeX: \hat{\theta}_{p+1}^{(i+1)}=\hat{\theta}_{p}^{(i)}+K_{p+1}^{(i+1)}e_{p+1}^{(i+1)}$

with initial conditions

 $LaTeX: \hat{\theta}_{p}^{(0)}=0_{p x 1}$

and

 $LaTeX: P_{p}^{(0)}=\alpha I_{p}$

where

$LaTeX: \lambda_{0}<1$
$LaTeX: \alpha=1000$ is a good starting value and
$LaTeX: I_{p}$ is the identity matrix of size (p x p).

## 4 Notes of caution about Least Squares based techniques

Least squares techniques calculate coefficients for a polynomial approximation. All polynomials have tails - that is the ends of a polynomial trend toward + or - infinity. Therefore a polynomial fit can be good for interpolation but not for extrapolation. When dealing with measured data it is often difficult to determine the full breadth of the input data without extensive analysis. Often there will be lots of data towards the upper or lower bounds but there may be very little data in between. System behavior can lead to transitions that provide only handful of data points. When comparing these few points to the rest of the data the transition points can become lost in the rest of the data. The weighting of the coefficients will tend to mirror the weighting of the number of data points in particular regions. With multi-dimensional input data visualizing the weighting behavior is difficult.

### 4.1 Rich Excitation

Least squares methods are polynomial fits of input parameters to output parameters. This requires that the input to output relationship is linear. It also means that for output estimation to be accurate that the test inputs must fall within the range of inputs used to create the polynomial coefficients. In other words, a polynomial of arbitrary order is constructed fitting input data to output data. The input data used has some range (use [-1 1] for example). If this polynomial is used to estimate an output due to an input outside of that range (for example, 2) then the estimate will likely be inaccurate.

In 2 dimensions the reason for this is simple. All polynomials have tails that go to infinity. The polynomial created with least squares is usually stable (and reasonably accurate) within the range of input values used in its creation. Least squares polynomials are good for interpolation but become unstable when used for extrapolation.

The method described if for use in Control systems. The transfer function derived with least squares is then used to design the system controller. If the transfer function is inaccurate for some portion of available inputs the system can go unstable.

In short, the concept of Rich Excitation is to 'ping' the black box system with all possible inputs. The output of these inputs provides a range of all possible outputs. Thus the least squares fit can work with the entire range of applicable values and all output estimates are interpolations rather than extrapolations.

The entire range of applicable input values must be considered carefully. It must include inputs across the entire magnitude range as well as inputs of all possible frequencies. Careful consideration of the input signal is necessary in order to obtain the proper input/output data set.

In practice the standard solution is to use white noise as the input. White noise contains all frequencies at equal magnitude. Thus, given enough inputs samples, the system is tested across all physically possible inputs. Rich Excitation in the case of many real systems is not possible. This limits the ability of least squares methods to accurately predict the output.

## 6 References

• Broussard, K. J. and Trahan, R. E. Jr., "Automatic Control System Failure Detection via Parameter Identification Techniques," IEEE Proceedings of Southeastcon, April 1991, pp. 176-180.
• Hsia, T. C., "System Identification (Least-Squares Methods)," 1977 D. C. Heath and Company, Lexington, MA. ISBN 0669996300
• Ljung. L. and Soderstrom, T., "Theory and Practice of Recursive Identification," 2nd Ed. 1985 MIT Press, Cambridge, MA. ISBN 0262620588
• Spradlin, Gabriel T. '"AN EXPLORATION OF PARAMETER IDENTIFICATION TECHNIQUES: CMG TEMPERATURE PREDICTION THEORY AND RESULTS." Master's Thesis, University of Houston, Houston, TX December 2005.

### 6.1 Notes

1. Broussard et all
2. Hsia
3. Ljung et all