Online Recursive Least Squares

From ControlTheoryPro.com

Jump to: navigation, search
Symbol.gif
Online Recursive Least Squares
Green carrot left.gif
All Parameter Identification Articles All System Identification Articles
Green carrot.jpg
In order to prevent spam, users must register before they can edit or create articles.


1 Introduction to Online Recursive Least Squares

The following online recursive least squares derivation comes from class notes provided for Dr. Shieh's ECE 7334 Advanced Digital Control Systems at the University of Houston. Similar derivations are presented in [[1], [2] and [3]].

If you wish to skip directly to the update equations click here.

2 Derivation of Online Recursive Least Squares[4]

The derivation for the online recursive least squares (RLS) method is as follows

LaTeX: \hat{Y}_{p}=\begin{bmatrix}\hat{y}_{0} \\ \hat{y}_{1} \\ \vdots \\ \hat{y}_{p-1}\end{bmatrix} \in R^{p x 1}, 'Actual System Outputs'



LaTeX: A_{p}=\begin{bmatrix}\phi_{0}^{T} \\ \phi_{1}^{T} \\ \vdots \\ \phi_{p-1}^{T}\end{bmatrix} \in R^{p x \left(n+m+1\right)}, 'System Inputs'



LaTeX: \theta_{p}=\begin{bmatrix}b_{0} \\ b_{1} \\ \vdots \\ b_{m} \\ a_{0} \\ a_{1} \\ \vdots \\ a_{n}\end{bmatrix} \in R^{\left(n+m+1\right) x 1}, 'System Model Coefficients'


and

LaTeX: \Nu_{p}=\begin{bmatrix}\nu_{0} \\ \nu_{1} \\ \vdots \\ \nu_{p-1}\end{bmatrix} \in R^{p x 1}. 'Noise or Error'


The data set

LaTeX: \hat{Y}_{p}=A_{p}\theta_{p}+\Nu_{p}


leads to the optimal parameter equation

LaTeX: \theta_{p}=\left(A_{p}^{T}\bar{W}_{p}A_{p}\right)^{-1}A_{p}^{T}\bar{W}_{p}\hat{Y}_{p},


Note this is just LaTeX: \theta_{p}=A_{p}^{-1}\hat{Y}_{p} where, by definition of optimality, LaTeX: N_{p}=0 and where

LaTeX: \bar{W}_{p} is the weighting matrix, and
LaTeX: A_{p}^{-1} is replaced by LaTeX: A_{p}^{+}=\left(A_{p}^{T}A_{p}\right)^{-1}A_{p}^{T} the pseudo-inverse.

This is a recusive algorithm which means that data points must be added sequentially. In order to add a new data point use

LaTeX: \hat{Y}_{p+1}=\begin{bmatrix}\hat{Y}_{p} \\ \hat{y}_{p+1}\end{bmatrix}, A_{p+1}=\begin{bmatrix}A_{p} \\ \phi_{p+1}^{T}\end{bmatrix},


where

LaTeX: \left(\hat{y}_{p+1}, \phi_{p+1}^{T}\right)-\hat{y}_{p+1}=\phi_{p+1}^{T}\theta_{p} Not quite right


is the new data set and LaTeX: \theta_{p} is the old set of model coefficients. The updated weighting matrix is

LaTeX: \bar{W}_{p+1}=\begin{bmatrix}\bar{W}_{p} & 0 \\ 0 & w_{p+1}\end{bmatrix}


Use

LaTeX: \hat{\theta}_{p+1}=\left(A_{p+1}^{T}\bar{W}_{p+1}A_{p+1}\right)^{-1}A_{p+1}^{T}\bar{W}_{p+1}\hat{Y}_{p+1}



LaTeX: \hat{\theta}_{p+1}=\left( \begin{bmatrix} A_{p}^{T} & \phi_{p+1} \end{bmatrix} \begin{bmatrix} \bar{W}_{p} & 0 \\ 0 & w_{p+1} \end{bmatrix} \begin{bmatrix} A_{p} \\ \phi_{p+1}^{T} \end{bmatrix} \right)^{-1} \begin{bmatrix} A_{p}^{T} & \phi_{p+1} \end{bmatrix} \begin{bmatrix} \bar{W}_{p} & 0 \\ 0 & w_{p+1} \end{bmatrix} \begin{bmatrix} \hat{Y}_{p} \\ \hat{y}_{p+1} \end{bmatrix}



LaTeX: \hat{\theta}_{p+1}=\begin{bmatrix} A_{p}^{T}\bar{W}_{p}A_{p}+\phi_{p+1}w_{p+1}\phi_{p+1}^{T}\end{bmatrix}^{-1}\begin{bmatrix} A_{p}^{T}\bar{W}_{p}\hat{Y}_{p}+\phi_{p+1}w_{p+1}\hat{y}_{p+1}\end{bmatrix} Eqn. 1


to find the optimal model coefficients.

To find the relationship between the old coefficients LaTeX: \hat{\theta}_{p} and the new LaTeX: \hat{\theta}_{p+1} use

LaTeX: \hat{\theta}_{p}=P_{p}A_{p}^{T}\bar{W}_{p}\hat{Y}_{p}LaTeX: \rArrLaTeX: P_{p}^{-1}\hat{\theta}_{p}=A_{p}^{T}\bar{W}_{p}\hat{Y}_{p}


Where LaTeX: P_{p} is defined as

LaTeX: P_{p} \equiv \left(A_{p}^{T}\bar{W}_{p}A_{p}\right)^{-1} Eqn. 2


LaTeX: P_{p} is a symmetric matrix > 0 and the updated value is

LaTeX: P_{p+1} \equiv \left(A_{p+1}^{T}\bar{W}_{p+1}A_{p+1}\right)^{-1}=\begin{bmatrix}A_{p}^{T}\bar{W}_{p}A_{p}+\phi_{p+1}w_{p+1}\phi_{p+1}^{T}\end{bmatrix}^{-1}>0 Eqn. 3


Eqn. 3 leads to

LaTeX: P_{p+1}^{-1}=A_{p+1}^{T}\bar{W}_{p+1}A_{p+1}



LaTeX: P_{p+1}^{-1}=\begin{bmatrix}A_{p}^{T}\bar{W}_{p}A_{p}+\phi_{p+1}w_{p+1}\hat{y}_{p+1}\end{bmatrix}


and Eqn. 2 to

LaTeX: P_{p+1}^{-1}=\begin{bmatrix}P_{p}^{-1}+\phi_{p+1}w_{p+1}\phi_{p+1}^{T}\end{bmatrix}


and finally to

LaTeX: P_{p}^{-1}=P_{p+1}^{-1}-\phi_{p+1}w_{p+1}\phi_{p+1}^{T}


Combine Eqns. 1 and 3 to obtain the equation

LaTeX: \hat{\theta}_{p+1}=P_{p+1}\left(A_{p}^{T}\bar{W}_{p}\hat{Y}_{p}+\phi_{p+1}w_{p+1}\hat{y}_{p+1}\right)



LaTeX: \hat{\theta}_{p+1}=P_{p+1} \left( P_{p}^{-1} \hat{\theta}_{p}+\phi_{p+1} w_{p+1} \hat{y}_{p+1} \right)



LaTeX: \hat{\theta}_{p+1}=P_{p+1}\left[\left(P_{p}^{-1}-\phi_{p+1}w_{p+1}\phi_{p+1}^{T}\right)\hat{\theta}_{p}+\phi_{p+1}w_{p+1}\hat{y}_{p+1}\right]


which leads to the new model coefficients in terms of P (all the previous input and output data vectors) and the current data point. The equation for the new model coefficients can be rearranged into this form

LaTeX: \hat{\theta}_{p+1}=\hat{\theta}_{p}+P_{p+1}\phi_{p+1}w_{p+1}\left(\hat{y}_{p+1}-\phi_{p+1}^{T}\hat{\theta}_{p}\right)


where

LaTeX: \hat{\theta}_{p} is the previous model coefficients and
LaTeX: \left(\hat{y}_{p+1}-\phi_{p+1}^{T}\hat{\theta}_{p}\right) is the error between actual system output and predicted system output.
LaTeX: \hat{\theta}_{p+1}=\hat{\theta}_{p}+K_{p+1}\left(\hat{y}_{p+1}-\phi_{p+1}\hat{\theta}_{p}\right)


where

LaTeX: K_{p+1} \equiv P_{p+1}\phi_{p+1} w_{p+1}


is the discrete Kalman gain matrix.

The online recursive least squares algorithm derived to this point is slow. Therefore inappropriate for an online algorithm. Specifically, matrix inversions should be avoided. The matrix inverse can be avoided through the use of the inverse lemma

LaTeX: \left(A+BCD\right)^{-1}=A^{-1}-A^{-1}B\left(DA^{-1}B+C^{-1}\right)DA^{-1}, Inverse Lemma


where LaTeX: P_{p}^{-1}=A, \phi_{p+1}=B, w_{p+1}=C, and LaTeX: \phi_{p+1}^{T}=D. The inverse lemma leads to the following equations

LaTeX: P_{p+1}^{(i+1)}=P_{p}^{(i)}-K_{p+1}^{(i+1)}\left(P_{p}^{(i)}\phi_{p+1}^{(i+1)}\right)^{T}>0


where

LaTeX: K_{p+1}^{(i+1)}=\frac{P_{p}^{(i)} \phi_{p+1}^{(i+1)}}{\left(w_{p+1}^{(i+1)}+\phi_{p+1}^{T(i+1)} P_{p}^{(i)} \phi_{p+1}^{(i+1)}\right)}


and finally

LaTeX: \hat{\theta}_{p+1}^{(i+1)}=\hat{\theta}_{p}^{(i)}+K_{p+1}^{(i+1)}e_{p+1}^{(i+1)}


The variable i represents the iteration number and the innovation sequence is

LaTeX: e_{p+1}^{(i+1)}=\hat{y}_{p+1}^{(i+1)}-\phi_{p+1}^{T(i+1)}\hat{\theta}_{p}^{(i)}



While the resulting algorithm avoids any matrix inverses it is not suitable for rapid time-varying systems. This is because the algorithm, after lots of samples, will stop updating at steady state. The algorithm cannot update quickly because large amountss of data are already incorporated. In other words, after 10,000 data points a new data point is effectively scaled by 1/10,001 vs. the combined effect of the prior points (10,000/10,001).

To improve the algorithm's performance a forgetting factor is added to

LaTeX: P_{p+1}^{-1}=A_{p+1}^{T}\bar{W}_{p+1}A_{p}=A_{p}^{T}\bar{W}_{p}A_{p}+\phi_{p+1}w_{p+1}\phi_{p+1}^{T}


which leads to

LaTeX: P_{p+1}^{-1}=A_{p}^{T}\left(\lambda_{p+1}\bar{W}_p\right)A_{p}+\phi_{p+1}\left(1\right) w_{p+1}\phi_{p+1}^{T}


where

LaTeX: \lambda_{p+1}<1 is the forgetting factor and (1) represents a relatively large forgetting factor.

If all LaTeX: P_{p+1} are replaced with LaTeX: \lambda_{p+1}P_{p+1} and LaTeX: w_{p+1} with LaTeX: \frac{w_{p+1}}{\lambda_{p+1}} then the final form of the alorithm is achieved. The equations are

LaTeX: \lambda_{p+1}^{(i+1)}=\lambda_{p}^{(i)}\lambda_{0}+\left(1+\lambda_{0}\right)


where LaTeX: \lambda_{0}<1

LaTeX: K_{p+1}^{(i+1)}=\frac{P_{p}^{(i)}\phi_{p+1}^{(i+1)}}{\left(\frac{\lambda_{p+1}}{w_{p+1}}\right)+\phi_{p+1}^{T(i+1)}P_{p}^{(i)}\phi_{p+1}^{(i+1)}}



LaTeX: P_{p+1}=\frac{P_{p}^{(i)}-K_{p+1}^{(i+1)}\left(P_{p}^{(i)}\phi_{p+1}^{(i+1)}\right)^{T}}{\lambda_{p+1}^{(i+1)}}



LaTeX: e_{p+1}^{(i+1)}=\hat{y}_{p+1}^{(i+1)}-\phi_{p+1}^{T(i+1)}\hat{\theta}_{p}^{(i)}



LaTeX: \hat{\theta}_{p+1}^{(i+1)}=\hat{\theta}_{p}^{(i)}+K_{p+1}^{(i+1)}e_{p+1}^{(i+1)}


with initial conditions

LaTeX: \hat{\theta}_{p}^{(0)}=0_{p x 1}


and

LaTeX: P_{p}^{(0)}=\alpha I_{p}


where

LaTeX: \alpha=1000 is a good starting value and
LaTeX: I_{p} is the identity matrix of size (p x p).

3 Final Online Recursive Least Squares Equation Set

These are the final online recursive least squares update equations in order.

LaTeX: \lambda_{p+1}^{(i+1)}=\lambda_{p}^{(i)}\lambda_{0}+\left(1+\lambda_{0}\right)



LaTeX: K_{p+1}^{(i+1)}=\frac{P_{p}^{(i)}\phi_{p+1}^{(i+1)}}{\left(\frac{\lambda_{p+1}}{w_{p+1}}\right)+\phi_{p+1}^{T(i+1)}P_{p}^{(i)}\phi_{p+1}^{(i+1)}}



LaTeX: P_{p+1}=\frac{P_{p}^{(i)}-K_{p+1}^{(i+1)}\left(P_{p}^{(i)}\phi_{p+1}^{(i+1)}\right)^{T}}{\lambda_{p+1}^{(i+1)}}



LaTeX: e_{p+1}^{(i+1)}=\hat{y}_{p+1}^{(i+1)}-\phi_{p+1}^{T(i+1)}\hat{\theta}_{p}^{(i)}



LaTeX: \hat{\theta}_{p+1}^{(i+1)}=\hat{\theta}_{p}^{(i)}+K_{p+1}^{(i+1)}e_{p+1}^{(i+1)}


with initial conditions

LaTeX: \hat{\theta}_{p}^{(0)}=0_{p x 1}


and

LaTeX: P_{p}^{(0)}=\alpha I_{p}


where

LaTeX: \lambda_{0}<1
LaTeX: \alpha=1000 is a good starting value and
LaTeX: I_{p} is the identity matrix of size (p x p).

4 Notes of caution about Least Squares based techniques

Least squares techniques calculate coefficients for a polynomial approximation. All polynomials have tails - that is the ends of a polynomial trend toward + or - infinity. Therefore a polynomial fit can be good for interpolation but not for extrapolation. When dealing with measured data it is often difficult to determine the full breadth of the input data without extensive analysis. Often there will be lots of data towards the upper or lower bounds but there may be very little data in between. System behavior can lead to transitions that provide only handful of data points. When comparing these few points to the rest of the data the transition points can become lost in the rest of the data. The weighting of the coefficients will tend to mirror the weighting of the number of data points in particular regions. With multi-dimensional input data visualizing the weighting behavior is difficult.

4.1 Rich Excitation

Least squares methods are polynomial fits of input parameters to output parameters. This requires that the input to output relationship is linear. It also means that for output estimation to be accurate that the test inputs must fall within the range of inputs used to create the polynomial coefficients. In other words, a polynomial of arbitrary order is constructed fitting input data to output data. The input data used has some range (use [-1 1] for example). If this polynomial is used to estimate an output due to an input outside of that range (for example, 2) then the estimate will likely be inaccurate.

In 2 dimensions the reason for this is simple. All polynomials have tails that go to infinity. The polynomial created with least squares is usually stable (and reasonably accurate) within the range of input values used in its creation. Least squares polynomials are good for interpolation but become unstable when used for extrapolation.

The method described if for use in Control systems. The transfer function derived with least squares is then used to design the system controller. If the transfer function is inaccurate for some portion of available inputs the system can go unstable.

In short, the concept of Rich Excitation is to 'ping' the black box system with all possible inputs. The output of these inputs provides a range of all possible outputs. Thus the least squares fit can work with the entire range of applicable values and all output estimates are interpolations rather than extrapolations.

The entire range of applicable input values must be considered carefully. It must include inputs across the entire magnitude range as well as inputs of all possible frequencies. Careful consideration of the input signal is necessary in order to obtain the proper input/output data set.

In practice the standard solution is to use white noise as the input. White noise contains all frequencies at equal magnitude. Thus, given enough inputs samples, the system is tested across all physically possible inputs. Rich Excitation in the case of many real systems is not possible. This limits the ability of least squares methods to accurately predict the output.

5 See Also

6 References

  • Broussard, K. J. and Trahan, R. E. Jr., "Automatic Control System Failure Detection via Parameter Identification Techniques," IEEE Proceedings of Southeastcon, April 1991, pp. 176-180.
  • Hsia, T. C., "System Identification (Least-Squares Methods)," 1977 D. C. Heath and Company, Lexington, MA. ISBN 0669996300
  • Ljung. L. and Soderstrom, T., "Theory and Practice of Recursive Identification," 2nd Ed. 1985 MIT Press, Cambridge, MA. ISBN 0262620588
  • Spradlin, Gabriel T. '"AN EXPLORATION OF PARAMETER IDENTIFICATION TECHNIQUES: CMG TEMPERATURE PREDICTION THEORY AND RESULTS." Master's Thesis, University of Houston, Houston, TX December 2005.

6.1 Notes

  1. Broussard et all
  2. Hsia
  3. Ljung et all
  4. Spradlin, pp. 82-87