Welcome to the documentation of multilstsq!¶
Least squares fitting is a underlying method for numerous applications, the most common one being linear regression. It consists in finding the parameters vector β°
which minimizes ‖ε‖₂
in the equation y = Xβ + ε
, where X
is the design matrix, y
the observation vector, and ε the error vector.
Since it is a fundamental algorithm, a number of Python 3 implementations exist, with different feature sets and performance, such as: numpy.linalg.lstsq
, scipy.stats.linregress
, sklearn.linear_model.LinearRegression
and statsmodel.OLS
.
However, the current available libraries are not designed to work on a large quantity of simultaneous problems, for example solving a least square problem for each pixel of an image. Iterating over a large number of small problems is inefficient. Moreover, when doing linear regression, it is often tedious to build the design matrix X
.
The goal of multilstsq
is to work on arrays of problems, with good performance, low memory requirements, reliability and flexibility. It also provides a way to automate the construction of the relevant structures (mostly the design matrix), using a model given as a string. It however does not strive to be a complete statistical library such as what would be provided by statsmodel
or the language R
.
To reach these goals, multilstsq
uses the following techniques:
- It is possible to compute
β°=(XᵀX)⁻¹Xᵀy
incrementally, due to the linearity ofXᵀX
andXᵀy
, by providing data in chunks. - Inverting
XᵀX
is done by explicit formulas when the dimension is small. This has the advantage of being vector operations which can be applied simultaneously on all problems. - Masked data are handled as lines of zeros in the design matrix and the observation, which in fact have no effect. This allows adding different amount of data in different subproblems.
- For regression, an expression evaluator is implemented, which converts the input model from the user (for example
b0+b1*x0
) into the complex expression needed to build the design matrix from the vectorX
provided by the user. In that example, it is:np.concatenate([np.ones(o_dim)[(..., np.newaxis)], ((X)[..., :, 0])[(..., np.newaxis)]])
. This expression evaluator also may be useful for other purposes in other libraries.
Example use¶
import numpy as np
from multilstsq import MultiRegression
x1 = np.array([1.47, 1.50, 1.52, 1.55, 1.57, 1.60, 1.63, 1.65, 1.68, 1.70, 1.73, 1.75, 1.78, 1.80, 1.83])
y1 = np.array([52.21, 53.12, 54.48, 55.84, 57.20, 58.57, 59.93, 61.29, 63.11, 64.47, 66.28, 68.10, 69.92, 72.19, 74.46])
x2 = np.arange(10)
y2 = np.arange(10)
X = np.ma.masked_all((2, max(len(x1), len(x2)), 1))
y = np.ma.masked_all((2, max(len(x1), len(x2)), 1))
X[0, :len(x1), 0] = x1
X[1, :len(x2), 0] = x2
y[0, :len(y1), 0] = y1
y[1, :len(y2), 0] = y2
mr = MultiRegression((2,), 'b0 + b1*x0 + b2*(x0**2)')
mr.add_data(X,y)
mr.switch_to_variance()
mr.add_data(X,y)
print(mr.beta)
#Identify parameter names in the parameter vector
print(mr.beta_names)
#Get the covariance matrix for the first problem
print(mr.variance[0])
#Get the expression to predict for the first problem
expr = mr.get_expr_for_idx((0,))
#Evaluate at x=1.79
print(expr(1.79))
The nice thing about this module is that it is possible to change the model by only changing the line instanciating the MultiRegression object. For example, for a quadratic regression:
mr = MultiRegression((2,), 'b0 + b1*x0 + b2*(x0**2)')
Documentation contents¶
Installation¶
multilstsq
can be installed either using pip
, or from source.
Since multilstsq
requires Python 3
, you should ensure that you’re using the correct pip
executable. On some distributions, the executable is named pip3
.
It is possible to check the Python
version of pip
using:
pip -V
Similarly, you may need to use py.test-3
instead of py.test
.
Manual installation from source¶
To install manually, run the following:
git clone https://github.com/UniNE-CHYN/multilstsq
cd multilstsq
pip install .
To contribute, use pip install -e
instead of pip install
. This sets a link to the source folder in the python installation, instead of copying the files.
It is advisable to run the tests to ensure that everything is working. This can be done by running the following command in the multilstsq
directory:
py.test
Tutorial¶
This page contains some information about how to start if you want to use multilstsq
in your project.
Three different use cases are described in this documentation:
- You want to solve
y = Xβ
in the least squares sense,y
andX
are already available. - You want to solve
y = f(x, β)
in the least squares sense (β
is the unknown), wheref
is linear inβ
, but not necessarly inx
. This is the linear regression case. UsingMultiregression
, it is possible to test multiple differentf
, without changing anything else, so it’s great for evaluating multiple models. - You only want to use the expression evaluator to add flexibility to some algorithm of your creation. This case is described in its own section: Using ExprEvaluator.
For the first two cases, we use the following terminology: X
is the design matrix, y
the observation vector, and β the parameter vector.
Since multilstsq
is usually used to solve multiple problems simultaneously, we define as problem dimension the size of the n-dimensional grid of problems.
The number of problem dimension doesn’t have much influence on the way the computation are done, but it makes sense to use dimensions that corresponds to the problem.
For instance, if a characteristics is required for each element of a 20×20 grid, it makes sense to define the problem dimension as (20, 20)
, even though it
would be possible to map the problem to a (400, )
or a (10, 10, 4)
problem dimension.
Solving a least squares problem¶
First, lets see how to solve a least squares problem.
To keep the example small, let consider problem_dimension = (2, 3)
, and the model to be y = β₀ + β₁x
.
We have two parameters, and the design matrix consists in rows [1 x]
.
Let’s see how to do such a regression using first only the class multilstsq.MultiLstSq
. This class requires us to build the explicit design matrix, and
does a textbook least square solving on it. First, we need to create the object:
from multilstsq import MultiLstSq
import numpy as np
# First argument is the problem dimensions
# Second argument is the number of parameters
mls = MultiLstSq((2, 3), 2)
Now, we need to construct the data to add. We can add one or multiple “observations” at the time. For a single problem, the matrix X
would be of dimensions
(n_observations, 2)
(since we have two parameters), while the observation vector should be a column vector of shape: (n_observations, 1)
. However, since
we are solving multiple simultaneous least squares problems, we need to add the data for all the problems simultaneously. Therefore in our case the matrix X
is of dimensions (2, 3, n_observations, 2)
(we added the problem dimensions) and the observation vector (2, 3, n_observations, 1)
.
# 4 observations
X = np.ma.masked_all((2, 3, 4, 2))
y = np.ma.masked_all((2, 3, 4, 1))
# This is for β₀
X[:, :, :, 0] = 1
# Values of x, as coefficients of β₁
# The first two dims are the problem index
# The third is the observation index
X[0, 0, :, 1] = [1, 2, 3, 4]
X[0, 1, :, 1] = [1, 2, 1, 2]
X[0, 2, :, 1] = [1, 1, 1, 1]
X[1, 0, :, 1] = [-5, -6, 13, 43]
X[1, 1, :3, 1] = [-1, 0, 1] # only 3 observations
X[1, 2, :2, 1] = [4, 8] # only 2 observations
# Observations
y[0, 0, :, 0] = [1, 2, 3, 4]
y[0, 1, :, 0] = [1.1, 2, 0.9, 2.1]
y[0, 2, :, 0] = [3, 4, 5, 6]
y[1, 0, :, 0] = [-5.9, -5.2, 11.9, 42.1]
y[1, 1, :3, 0] = [1, 2, 3] # only 3 observations
y[1, 2, :2, 0] = [4.5, 5] # only 2 observations
#Add the data to the regression
mls.add_data(X, y)
It is possible to run multilstsq.MultiLstSq.add_data()
multiple times, to “stream” the data to the solver, for example one observation at a time. It may
seem complicated but usually it’s quite easy to convert the data matrix into the design matrix. We will see later an even simpler method to do so.
Once we added all the data, we can get β
:
print(mls.beta.shape) # (2, 3, 2, 1)
# Since it is a big matrix, lets get the value for the first problem
print(mls.beta[0, 0]) # We get a column vector [0, 1], as expected
print(mls.beta[0, 2]) # This problem cannot be solved, we get [nan, nan]
It may be required to have variance information. We need to switch to variance mode, and then re-add all the data:
mls.switch_to_variance()
# We should add exactly the same data as in the previous step.
# It doesn't matter if the number of calls is the same.
mls.add_data(X, y)
Once all the data is added, it is possible to get the covariance matrix for each problem. The covariance matrix is of shape (n_parameters, n_parameters)
for
each subproblem.
# We can get the covariance matrix
print(mls.variance.shape) # (2, 3, 2, 2)
print(mls.variance[0, 0]) # all zero, as it is a perfect fit
print(mls.variance[1, 0]) # [[0.299, -0.006], [-0.006, 0.001]]
print(mls.variance[1, 2]) # all masked: no variance can be computed (only two points for the line)
print(mls.variance[0, 2]) # all masked: no variance can be computed (not possible to do the fit)
It is also possible to get some additional information, which enables additional processing (like implementing tests):
print(mls.n_observations)
# [[4 4 4]
# [4 3 2]]
print(mls.n_parameters)
# 2
print(mls.sigma_2)
# [[0.00 0.01 -- ]
# [0.90 0.00 -- ]]
print(mls.rss)
# [[0.00 0.03 -- ]
# [1.81 0.00 0.0]]
Solving a regression problem¶
The direct approach shown in the section above works well, but building the design matrix may be impractical when complex model are used, especially if the model
changes. In the following example, for readability we will use ()
as the problem size, this means we only do a single regression. Working with a bigger
problem size is exacly similar to the previous section.
Let’s say we have the following data:
import numpy as np
x0 = np.array([-0.44, -0.52, -0.65, 0.89, -1.15, 1.07, -0.1, 1.05, 1.5, 0.23, -0.87, 1.77, -0.42, 0.43, 1.58, -0.2, -1.69, -1.92, 1.18, -1.18])
x1 = np.array([1.36, -0.69, -0.96, -0.27, 0.34, -0.02, -0.63, -0.66, 0.96, -0.21, 0.01, -0.06, -1.3, 1.05, 1.08, -1.74, -0.87, 0.72, 0.7, 1.67])
y = np.array([-0.68, -1.74, -2.24, 2.65, -3.23, 3.29, -0.45, 2.97, 4.96, 0.71, -2.51, 5.35, -1.68, 1.81, 5.25, -1.2, -5.35, -5.41, 3.91, -2.79])
We suspect that this data follows a multilinear relationship according to the model y = β₀ + β₁x₀ + β₂x₁
.
The classical approach would be to create a design matrix with rows [1, x₀, x₁]
, but we can let the class multilstsq.MultiRegression
do the work for us:
from multilstsq import MultiRegression
mr = MultiRegression((), 'b0 + b1 * x0 + b2 * x1')
#Note that X is a matrix of two columns x₀, x₁
X = np.array([x0, x1]).T
Y = np.array([y]).T
mr.add_data(X,Y)
mr.switch_to_variance()
mr.add_data(X,Y)
mr.rss # 0.002193175857390197
# etc.
Now, let say we suspect that the model is not what we expected, we can compare models easily:
models = [
'b0 + b1 * x0 + b2 * x1',
'b0 + b1 * x0 + b2 * x1 + b3 * (x1**2)',
'b0 + b1 * x0 + b2 * x1 + b3 * (x0**2)',
'b0 + b1 * x0 + b2 * x1 + b3 * (x0**2) + b4 * (x1**2)',
]
for model in models:
mr = MultiRegression((), model)
mr.add_data(X,Y)
mr.switch_to_variance()
mr.add_data(X,Y)
print('{:0.05f}'.format(mr.rss), model)
# We obtain the following output:
# 0.00278 b0 + b1 * x0 + b2 * x1
# 0.00270 b0 + b1 * x0 + b2 * x1 + b3 * (x1**2)
# 0.00016 b0 + b1 * x0 + b2 * x1 + b3 * (x0**2)
# 0.00014 b0 + b1 * x0 + b2 * x1 + b3 * (x0**2) + b4 * (x1**2)
We can see that the best model is likely to be y = β₀ + β₁x₀ + β₂x₁ + β₂x₀²
. Using this kind of technique it is very simple to make step-wise regression.
The model string can be any valid Python expression, but requires it to be linear in b
’s. Each variable b0
, b1
, b2
… corresponds to a parameter,
while x0
, x1
, x2
… corresponds to columns of the matrix X
.
Using ExprEvaluator
¶
This page describes how to use multilstsq.ExprEvaluator
, which is a tool to parse and evaluate (convert to a Python object) expressions provided as strings.
At first, an expression is parsed:
from multilstsq import ExprEvaluator
e = ExprEvaluator('a*(2+c*b)')
This expression has 3 variables: a
, b
and c
. We can get this information using:
e.variables # returns {'a', 'b', 'c'}
As long as these variables are not substituted, it is not possible to evaluate the expression. We can substitute these variables and evaluate the results by doing:
e(a=3, b=5, c=1) # returns 21, as expected.
However, this is not the typical use case. The goal of multilstsq.ExprEvaluator
is to enable partial substitution, for example substituting b
and c
, but not a
:
e2 = e.substitute(constants={'b':5, 'c':1})
e2(a=3) # Will also return 21, as previously
We can see that e2
has two constants (defined variables) and one variable. We can also get the expression as a string for debugging:
e2.variables # returns {'a'}
e2.constants # returns {'b', 'c'}
str(e2) # returns '(a) * ((2) + ((c) * (b)))'
# The parentheses are added to ensure correct resolution order
By looking at the expression, we can observe that second operand of the multiplication is fully known, and therefore that it could be reduced:
e3 = e2.reduce()
str(e3) # returns '(a) * (__ExprEvaluator_0)'
e3.variables # returns {'a'}
e3.constants # returns {'__ExprEvaluator_0'}
We can observe that the expression has indeed be reduced. In the present case, it would have made sense to reduce it to (a) * 7
, but instead a (reserved) constant __ExprEvaluator_0
has been created. This makes sense the result of the parenthesis could have been a complex object, like a numpy array.
If the expression is evaluated multiple times, it is usually a large performance improvement to substitute everything possible before evaluating it.
Note that any python object can be substituted:
import numpy as np
e4 = e3.substitute(constants = {'a': np.array([1, 2, 3])})
e4.eval() # returns array([ 7, 14, 21])
e4() # returns the same thing
e4.reduce().eval() # still the same thing
Complex expression can also be used in the evaluation string. Here’s another example:
import numpy as np
e = ExprEvaluator('np.linalg.inv(A).dot(y)')
v1 = np.array([[1, 4], [2, -1]])
v2 = np.array([9, 0])
e(A=v1,y=v2) # returns array([1., 2.])
By default, multilstsq.ExprEvaluator
uses the modules defined at the calling stack frame. This makes np
work in the expression in the example above.
This behavious can be disabled if needed:
import numpy as np
e = ExprEvaluator('np.linalg.inv(A).dot(y)', enable_caller_modules=False)
v1 = np.array([[1, 4], [2, -1]])
v2 = np.array([9, 0])
e(A=v1,y=v2) # Will fail, np is not defined
e(A=v1,y=v2, np=np) # returns array([1., 2.])
Additionally, it is also possible to substitute a variable in the expression by something else:
import numpy as np
e = ExprEvaluator('np.linalg.inv(A).dot(y)')
e2 = e.substitute(expressions={'A':'A.T'})
str(e2) # '(((((np).linalg).inv)(A)).dot)(y)'
v1 = np.array([[1, 4], [2, -1]])
v2 = np.array([9, 0])
e(A=v1,y=v2) # returns array([1., 4.])
This functionnality is used extensively by the multilstsq.MultiRegression
class.
Warning
Using multilstsq.ExprEvaluator
to evaluate expressions from untrusted sources may lead to security vulnerabilities.
API Documentation¶
This is the API documentation of the public API of multilstsq. These are the most commonly used functions when using multilstsq.
Least squares¶
-
class
multilstsq.
MultiLstSq
(problem_dimensions, n_parameters, internal_dtype=<class 'float'>)¶ Multiple least squares problems.
y = Xβ + ε
This is a class which can exist in three modes:
MODE_REGRESSION
: add data to the least square matrices to obtain the meanMODE_VARIANCE
: compute the varianceMODE_READONLY
: the object is frozen.
The mode can only be switched “forward” (it is not possible to move from
MODE_VARIANCE
toMODE_REGRESSION
for example.)-
__init__
(problem_dimensions, n_parameters, internal_dtype=<class 'float'>)¶ Create a MultiLstSq object, which is originally in
MODE_REGRESSION
.Parameters: - problem_dimensions – Tuple, size of the problem array, it can be
()
(0-dimensional), for a single least square problem, or for example(800, 600)
for 800x600 times the regression problem - n_parameters – Number of parameters of the least squares problem.
- internal_dtype –
numpy.dtype
data type of the matrices
- problem_dimensions – Tuple, size of the problem array, it can be
-
__weakref__
¶ list of weak references to the object (if defined)
-
add_data
(X, y, w=None)¶ Add data to the object (depending on the mode, for either mean or variance computation)
-
beta
¶ The linear coefficients that minimize the least squares criterion.
-
evaluate_at
(X)¶ Evaluate Xβ
-
n_observations
¶ Number of observations n.
-
n_parameters
¶ Number of parameters
-
rss
¶ Residual sum of squares
-
sigma_2
¶ Scaling parameter for the covariance matrix.
-
switch_to_read_only
()¶ Switch to read-only mode.
-
switch_to_variance
()¶ Switch to variance computation mode.
-
variance
¶ Returns the variance/covariance matrix.
Regression¶
-
class
multilstsq.
MultiRegression
(problem_dimensions, model_str, internal_dtype=<class 'float'>)¶ -
__init__
(problem_dimensions, model_str, internal_dtype=<class 'float'>)¶ Create a MultiLstSq object, which is originally in
MODE_REGRESSION
.Parameters: - problem_dimensions – Tuple, size of the problem array, it can be
()
(0-dimensional), for a single least square problem, or for example(800, 600)
for 800x600 times the regression problem - n_parameters – Number of parameters of the least squares problem.
- internal_dtype –
numpy.dtype
data type of the matrices
- problem_dimensions – Tuple, size of the problem array, it can be
-
add_data
(X, y, w=None)¶ Add data to the object (depending on the mode, for either mean or variance computation)
-
Expression evaluator¶
-
class
multilstsq.
ExprEvaluator
(expr, constants=None, enable_caller_modules=True)¶ -
__call__
(*args, **kwargs)¶ Substitute the arguments in the expression, and then evaluate it and return the resulting Python object.
The positional arguments are
zip()
’ed with thevariable_list
ofmultilstsq.ExprEvaluator.enable_call()
, which the named arguments are directly substituted.Returns: The Python object of the evaluated expression.
Raises: - RuntimeError – if positional arguments are used, but
multilstsq.ExprEvaluator.enable_call()
was not called. - ValueError – if there are undefined variables.
- RuntimeError – if positional arguments are used, but
-
__init__
(expr, constants=None, enable_caller_modules=True)¶ Initialize an ExprEvaluator object
Parameters: - expr – Expression string, should be a valid Python 3 expression
- constant – Dictionary of constants
{'constantname': value}
. - enable_caller_modules – Boolean, if
True
modules of the first frame in the stack outside of multilstsq will be included in the context of the expression
-
__repr__
()¶ Return repr(self).
-
__str__
()¶ Returns: a string corresponding to the expression
-
__weakref__
¶ list of weak references to the object (if defined)
-
constants
¶ Returns: Set of constants names (values which have a defined substitution) in the current expression
-
enable_call
(variable_list)¶ Enables calling the expression, as a simplification for calling
multilstsq.ExprEvaluator.substitute()
followed bymultilstsq.ExprEvaluator.eval()
.Parameters: variable_list – List of variables names which correspond to the arguments which will be used in multilstsq.ExprEvaluator.__call__()
-
eval
()¶ Returns: The python object which results of the expression Raises: ValueError – if at least one variable is not defined
-
reduce
()¶ Returns: A copy of the current expression, where all known part of the syntax tree are simplified. For example, reducing
a*(b+c)
with knownb
andc
, will result ina*__ExprEvaluator_0
, where__ExprEvaluator_0
is defined as a constant.
-
substitute
(expressions=None, constants=None)¶ Substitute expressions or constants in the current expression.
Parameters: - expressions – dictionary of expression to substitute. The keys can be of
str
,ast.AST
ormultilstsq.ExprEvaluator
. - constants – dictionary of constants to substitute
{'constantname': value}
Returns: New
multilstsq.ExprEvaluator
object with the substitution done.- expressions – dictionary of expression to substitute. The keys can be of
-
variables
¶ Returns: Set of variables names (values which don’t have a defined substitution) in the current expression
-
Contributing guide¶
multilstsq
is a free software, and all contributions are welcome, whether they are bug reports, source code, or documentation.
Reporting bugs¶
To report bugs, open an issue in the issue tracker.
Ideally, a bug report should contain at least the following information:
- a minimum code example to trigger the bug
- the expected result
- the result obtained.
Quick guide to contributing code or documentation¶
To contribute, you’ll need Sphinx to build the documentation, and pytest to run the tests.
If for some reason you are not able to run the following steps, simply open an issue with your proposed change.
- Fork the multilstsq on GitHub.
- Clone your fork to your local machine:
git clone https://github.com/<your username>/multilstsq.git
cd multilstsq
pip install -e .
- Create a branch for your changes:
git checkout -b <branch-name>
- Make your changes.
- If you’re writing code, you should write some tests, ensure that all the tests pass and that the code coverage is good. This can be done using:
py.test --cov=multilstsq --pep8
- You should also check that the documentation compiles and that the result look good. The documentation can be seen by opening a browser in doc/html. You can (re)build it using the following command line (make sure that there is no warnings):
sphinx-build doc/source doc/html
- Commit your changes and push to your fork on GitHub:
git add .
git commit -m "<description-of-changes>"
git push origin <name-for-changes>
- Submit a pull request.