Math.NET Numerics
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 

265 lines
9.3 KiB

(*** hide ***)
#I "../../out/lib/net40"
#r "MathNet.Numerics.dll"
#r "MathNet.Numerics.FSharp.dll"
open MathNet.Numerics
open MathNet.Numerics.LinearRegression
open MathNet.Numerics.LinearAlgebra
(**
Curve Fitting: Linear Regression
================================
Regression is all about fitting a low order parametric model or curve to data, so we can
reason about it or make predictions on points not covered by the data. Both data and
model are known, but we'd like to find the model parameters that make the model fit best
or good enough to the data according to some metric.
We may also be interested in how well the model supports the data or whether we better
look for another more appropriate model.
In a regression, a lot of data is reduced and generalized into a few parameters.
The resulting model can obviously no longer reproduce all the original data exactly -
if you need the data to be reproduced exactly, have a look at interpolation instead.
Simple Regression: Fit to a Line
--------------------------------
In the simplest yet still common form of regression we would like to fit a line
$y : x \mapsto a + b x$ to a set of points $(x_j,y_j)$, where $x_j$ and $y_j$ are scalars.
Assuming we have two double arrays for x and y, we can use `Fit.Line` to evaluate the $a$ and $b$
parameters of the least squares fit:
[lang=csharp]
double[] xdata = new double[] { 10, 20, 30 };
double[] ydata = new double[] { 15, 20, 25 };
Tuple<double, double> p = Fit.Line(xdata, ydata);
double a = p.Item1; // == 10; intercept
double b = p.Item2; // == 0.5; slope
Or in F#:
*)
let a, b = Fit.Line ([|10.0;20.0;30.0|], [|15.0;20.0;25.0|])
(**
How well do these parameters fit the data? The data points happen to be positioned
exactly on a line. Indeed, the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination)
confirms the perfect fit:
[lang=csharp]
GoodnessOfFit.RSquared(xdata.Select(x => a+b*x), ydata); // == 1.0
Linear Model
------------
In practice, a line is often not an adequate model. But if we can choose a model that is linear,
we can leverage the power of linear algebra; otherwise we have to resort to iterative methods
(see Nonlinear Optimization).
A linear model can be described as linear combination of $N$ arbitrary but known
functions $f_i(x)$, scaled by the model parameters $p_i$. Note that none of the functions
$f_i$ depends on any of the $p_i$ parameters.
$$$
y : x \mapsto p_1 f_1(x) + p_2 f_2(x) + \cdots + p_N f_N(x)
If we have $M$ data points $(x_j,y_j)$, then we can write the regression problem as an
overdefined system of $M$ equations:
$$$
\begin{eqnarray}
y_1 &=& p_1 f_1(x_1) + p_2 f_2(x_1) + \cdots + p_N f_N(x_1) \\
y_2 &=& p_1 f_1(x_2) + p_2 f_2(x_2) + \cdots + p_N f_N(x_2) \\
&\vdots& \\
y_M &=& p_1 f_1(x_M) + p_2 f_2(x_M) + \cdots + p_N f_N(x_M)
\end{eqnarray}
Or in matrix notation with the predictor matrix $X$ and the response $y$:
$$$
\begin{eqnarray}
\mathbf y &=& \mathbf X \mathbf p \\
\begin{bmatrix}y_1\\y_2\\ \vdots \\y_M\end{bmatrix} &=&
\begin{bmatrix}f_1(x_1) & f_2(x_1) & \cdots & f_N(x_1)\\f_1(x_2) & f_2(x_2) & \cdots & f_N(x_2)\\ \vdots & \vdots & \ddots & \vdots\\f_1(x_M) & f_2(x_M) & \cdots & f_N(x_M)\end{bmatrix}
\begin{bmatrix}p_1\\p_2\\ \vdots \\p_N\end{bmatrix}
\end{eqnarray}
Provided the dataset is small enough, if transformed to the normal equation
$\mathbf{X}^T\mathbf y = \mathbf{X}^T\mathbf X \mathbf p$ this can be solved efficiently by the
Cholesky decomposition (do not use matrix inversion!).
[lang=csharp]
Vector<double> p = MultipleRegression.NormalEquations(X, y);
Using normal equations is comparably fast as it can dramatically reduce the linear algebra problem
to be solved, but that comes at the cost of less precision. If you need more precision, try using
`MultipleRegression.QR` or `MultipleRegression.Svd` instead, with the same arguments.
Polynomial Regression
---------------------
To fit to a polynomial we can choose the following linear model with $f_i(x) := x^i$:
$$$
y : x \mapsto p_0 + p_1 x + p_2 x^2 + \cdots + p_N x^N
The predictor matrix of this model is the [Vandermonde matrix](http://en.wikipedia.org/wiki/Vandermonde_matrix).
There is a special function in the `Fit` class for regressions to a polynomial,
but note that regression to high order polynomials is numerically problematic.
[lang=csharp]
double[] p = Fit.Polynomial(xdata, ydata, 3); // polynomial of order 3
Multiple Regression
-------------------
The $x$ in the linear model can also be a vector $\mathbf x = [x^{(1)}\; x^{(2)} \cdots x^{(k)}]$
and the arbitrary functions $f_i(\mathbf x)$ can accept vectors instead of scalars.
If we use $f_i(\mathbf x) := x^{(i)}$ and add an intercept term $f_0(\mathbf x) := 1$
we end up at the simplest form of ordinary multiple regression:
$$$
y : x \mapsto p_0 + p_1 x^{(1)} + p_2 x^{(2)} + \cdots + p_N x^{(N)}
For example, for the data points $(\mathbf{x}_j = [x^{(1)}_j\; x^{(2)}_j], y_j)$ with values
`([1,4],15)`, `([2,5],20)` and `([3,2],10)` we can evaluate the best fitting parameters with:
[lang=csharp]
double[] p = Fit.MultiDim(
new[] {new[] { 1.0, 4.0 }, new[] { 2.0, 5.0 }, new[] { 3.0, 2.0 }},
new[] { 15.0, 20, 10 },
intercept: true);
The `Fit.MultiDim` routine uses normal equations, but you can always choose to explicitly use e.g.
the QR decomposition for more precision by using the `MultipleRegression` class directly:
[lang=csharp]
double[] p = MultipleRegression.QR(
new[] {new[] { 1.0, 4.0 }, new[] { 2.0, 5.0 }, new[] { 3.0, 2.0 }},
new[] { 15.0, 20, 10 },
intercept: true);
Arbitrary Linear Combination
----------------------------
In multiple regression, the functions $f_i(\mathbf x)$ can also operate on the whole
vector or mix its components arbitrarily and apply any functions on them, provided they are
defined at all the data points. For example, let's have a look at the following complicated but still linear
model in two dimensions:
$$$
z : (x, y) \mapsto p_0 + p_1 \mathrm{tanh}(x) + p_2 \psi(x y) + p_3 x^y
Since we map (x,y) to (z) we need to organize the tuples in two arrays:
[lang=csharp]
double[][] xy = new[] { new[]{x1,y1}, new[]{x2,y2}, new[]{x3,y3}, ... };
double[] z = new[] { z1, z2, z3, ... };
Then we can call Fit.LinearMultiDim with our model, which will return an array with the best fitting 4 parameters $p_0, p_1, p_2, p_3$:
[lang=csharp]
double[] p = Fit.LinearMultiDim(xy, z,
d => 1.0, // p0*1.0
d => Math.Tanh(d[0]), // p1*tanh(x)
d => SpecialFunctions.DiGamma(d[0]*d[1]), // p2*psi(x*y)
d => Math.Pow(d[0], d[1])); // p3*x^y
Evaluating the model at specific data points
--------------------------------------------
Let's say we have the following model:
$$$
y : x \mapsto a + b \ln x
For this case we can use the `Fit.LinearCombination` function:
[lang=csharp]
double[] p = Fit.LinearCombination(
new[] {61.0, 62.0, 63.0, 65.0},
new[] {3.6,3.8, 4.8, 4.1},
x => 1.0,
x => Math.Log(x)); // -34.481, 9.316
In order to evaluate the resulting model at specific data points we can manually apply
the values of p to the model function, or we can use an alternative function with the `Func`
suffix that returns a function instead of the model parameters. The returned function
can then be used to evaluate the parametrized model:
[lang=csharp]
Func<double,double> f = Fit.LinearCombinationFunc(
new[] {61.0, 62.0, 63.0, 65.0},
new[] {3.6, 3.8, 4.8, 4.1},
x => 1.0,
x => Math.Log(x));
f(66.0); // 4.548
Linearizing non-linear models by transformation
-----------------------------------------------
Sometimes it is possible to transform a non-linear model into a linear one.
For example, the following power function
$$$
z : (x, y) \mapsto u x^v y^w
can be transformed into the following linear model with $\hat{z} = \ln z$ and $t = \ln u$
$$$
\hat{z} : (x, y) \mapsto t + v \ln x + w \ln y
[lang=csharp]
var xy = new[] {new[] { 1.0, 4.0 }, new[] { 2.0, 5.0 }, new[] { 3.0, 2.0 }};
var z = new[] { 15.0, 20, 10 };
var z_hat = z.Select(r => Math.Log(r)).ToArray(); // transform z_hat = ln(z)
double[] p_hat = Fit.LinearMultiDim(xy, z_hat,
d => 1.0,
d => Math.Log(d[0]),
d => Math.Log(d[1]));
double u = Math.Exp(p_hat[0]); // transform t = ln(u)
double v = p_hat[1];
double w = p_hat[2];
Weighted Regression
-------------------
Sometimes the regression error can be reduced by dampening specific data points.
We can achieve this by introducing a weight matrix $W$ into the normal equations
$\mathbf{X}^T\mathbf{y} = \mathbf{X}^T\mathbf{X}\mathbf{p}$. Such weight matrices
are often diagonal, with a separate weight for each data point on the diagonal.
$$$
\mathbf{X}^T\mathbf{W}\mathbf{y} = \mathbf{X}^T\mathbf{W}\mathbf{X}\mathbf{p}
[lang=csharp]
var p = WeightedRegression.Weighted(X,y,W);
Weighter regression becomes interesting if we can adapt them to the point of interest
and e.g. dampen all data points far away. Unfortunately this way the model parameters
are dependent on the point of interest $t$.
[lang=csharp]
// warning: preliminary api
var p = WeightedRegression.Local(X,y,t,radius,kernel);
Regularization
--------------
Iterative Methods
-----------------
*)