Welcome to PyQt-Fit’s documentation!¶
PyQt-Fit is a regression toolbox in Python with simple GUI and graphical tools to check your results. It currently handles regression based on user-defined functions with user-defined residuals (i.e. parametric regression) or non-parametric regression, either local-constant or local-linear, with the option to provide your own. The GUI currently provides an interface only to parametric regression.
Contents:
Introduction to PyQt-Fit¶
The GUI for 1D data analysis is invoked with:
$ pyqt_fit1d.py
PyQt-Fit can also be used from the python interpreter. Here is a typical session:
>>> import pyqt_fit
>>> from pyqt_fit import plot_fit
>>> import numpy as np
>>> from matplotlib import pylab
>>> x = np.arange(0,3,0.01)
>>> y = 2*x + 4*x**2 + np.random.randn(*x.shape)
>>> def fct((a0, a1, a2), x):
... return a0 + a1*x + a2*x*x
>>> fit = pyqt_fit.CurveFitting(x, y, (0,1,0), fct)
>>> result = plot_fit.fit_evaluation(fit, x, y)
>>> print fit(x) # Display the estimated values
>>> plot_fit.plot1d(result)
>>> pylab.show()
PyQt-Fit is a package for regression in Python. There are two set of tools: for parametric, or non-parametric regression.
For the parametric regression, the user can define its own vectorized function (note that a normal function wrappred into numpy’s “vectorize” function is perfectly fine here), and find the parameters that best fit some data. It also provides bootstrapping methods (either on the samples or on the residuals) to estimate confidence intervals on the parameter values and/or the fitted functions.
The non-parametric regression can currently be either local constant (i.e. spatial averaging) in nD or local-linear in 1D only. There is a version of the bootstrapping adapted to non-parametric regression too.
The package also provides with four evaluation of the regression: the plot of residuals vs. the X axis, the plot of normalized residuals vs. the Y axis, the QQ-plot of the residuals and the histogram of the residuals. All this can be output to a CSV file for further analysis in your favorite software (including most spreadsheet programs).
Regression using the GUI - tutorial¶
Using the interface¶
The script is starting from the command line with:
$ pyqt_fit1d.py
Once starting the script, the interface will look like this:

Main GUI of PyQt-Fit
The interface is organised in 4 sections:
- the top-left of the window to define the data to load and process;
- the bottom-left to define the function to be fitted and its parameters;
- the top-right to define the options to compute confidence intervals;
- the bottom-right to define the output options.
Loading the Data¶
The application can load CSV files. The first line of the file must be the name of the available datasets. In case of missing data, only what is available on the two selected datasets are kept.
Once loaded, the available data sets will appear as option in the combo-boxes. You need to select for the X axis the explaining variable and the explained variable on the Y axis.
Defining the regression function¶
First, you will want to choose the function. The available functions are listed in the combo box. When selecting a function, the list of parameters appear in the list below. The value presented are estimated are a quick estimation from the data. You can edit them by double-clicking. It is also where you can specify if the parameter is of known value, and should therefore be fixed.
If needed, you can also change the computation of the residuals. By default there are two kind of residuals:
- Standard
- residuals are simply the difference between the estimated and observed value.
- Difference of the logs
- residual are the difference of the log of the values.
Plotting and output¶
By default, the output consists in the data points, and the fitted function, interpolated on the whole range of the input data. If is, however, possible to both change the range of data, or even evaluate the function on the existing data points rather than interpolated ones.
The output also presents a window to evaluate the quality of the fitting:

In general, the dashed red line is the target to achieve for a good fitting. When present the green line is the estimates that should match the red line.
The top-left graph presents the distribution of the residuals against the explaining variable. The green line shows a local-linear regression of the residuals. It should be aligned with the dashed red line.
The top-right graph presents the distribution of the square root of the standardized residuals against the explained variable. The purpose of this graph is to test the uniformity of the distribution. The green line is again a local-linear regression of the points. The line should be as flat and horizontal as possible. If the distribution is normal, the green line should match the dashed red line.
The bottom right graph presents a histogram of the residuals. For parametric fitting, the residuals should globally be normal.
The bottom left graph presents a QQ-plot, matching the theoretical quantiles and the standardized residuals. If the residuals are normally distributed, the points should be on the dashed red line.
The result of the fitting can also be output. What is written correspond exactly to what is displayed. The output is also a CSV file, and is meant to be readable by a human.
Confidence interval¶
Confidence interval can be computed using bootstrapping. There are two kinds of boostrapping implemented:
- regular bootstrapping
- The data are resampled, the pairs \((x,y)\) are kept. There is no assumption made. But is is often troublesome in regression, tending to flatten the results.
- residual resampling
- After the first evaluation, for each pair \((x,y)\), we find the estimated value \(\hat{y}\). Then, the residuals are re-sampled, and new pairs \((x,\hat{y}+r')\) are recreated, \(r'\) being the resampled residual.
The intervals ought to be a list of semi-colon separated values of percentages. At last, the number of repeats will define how many re-sampling there will be.
Defining your own function¶
First, you need to define the environment variable PYQTFIT_PATH and add a list of colon-separated folders. In each folder, you can add python modules in a functions sub-folder. For example, if the path ~/.pyqtfit is in PYQTFIT_PATH, then you need to create a folder ~/.pyqtfit/functions, in which you can add your own python modules.
Which module will be loaded, and the functions defined in it will be added in the interface. A function is a class or an object with the following properties:
- name
- Name of the function
- description
- Equation of the function
- args
- List of arguments
- __call__(args, x)
- Compute the function. The args argument is a tuple or list with as many elements are in the args attribute of the function.
- init_args(x,y)
- Function guessing some initial arguments from the data. It must return a list or tuple of values, one per argument to the function.
- Dfun(args, x)
- Compute the jacobian of the function at the points c. If the function is not provided, the attribute should be set to None, and the jacobian will be estimated numerically.
As an example, here is the definition of the cosine function:
import numpy as np
class Cosine(object):
name = "Cosine"
args = "y0 C phi t".split()
description = "y = y0 + X cos(phi x + t)"
@staticmethod
def __call__((y0,C,phi,t), x):
return y0 + C*np.cos(phi*x+t)
Dfun = None
@staticmethod
def init_args(x, y):
C = y.ptp()/2
y0 = y.min() + C
phi = 2*np.pi/x.ptp()
t = 0
return (y0, C, phi, t)
Defining your own residual¶
Similarly to the functions, it is possible to implement your own residual. The rediduals need to be in a residuals folder. And they need to be object or classes with the following properties:
- name
- Name of the residuals
- description
- Formula used to compute the residuals
- __call__(y1, y0)
- Function computing the residuals, y1 being the original data and y0 the estimated data.
- invert(y, res)
- Function applying the residual to the estimated data.
- Dfun(y1, y0, dy)
- Compute the jacobian of the residuals. y1 is the original data, y0 the estaimted data and dy the jacobian of the function at y0.
As an example, here is the definition of the log-residuals:
class LogResiduals(object):
name = "Difference of the logs"
description = "log(y1/y0)"
@staticmethod
def __call__(y1, y0):
return log(y1/y0)
@staticmethod
def Dfun(y1, y0, dy):
"""
J(log(y1/y0)) = -J(y0)/y0
where J is the jacobian and division is element-wise (per row)
"""
return -dy/y0[newaxis,:]
@staticmethod
def invert(y, res):
"""
Multiply the value by the exponential of the residual
"""
return y*exp(res)
Parametric regression tutorial¶
Introduction¶
Given a set of observations \((x_i, y_i)\), with \(x_i = (x_{i1}, \ldots, x_{ip})^T \in \mathbb{R}^p\). We assume, there exists a function \(f(\theta, x)\) and a set of parameters \(\theta \in \mathbb{R}^q\) such that:
with \(\epsilon_i \in \mathbb{R}\) such that \(E(\epsilon) = 0\).
The objective is to fine the set of parameters theta. Obviously, the real function is inaccesible. Instead, we will try to find an estimate of the parameters, \(\hat{\theta}\) using the least square estimator, which is:
The method is based on the SciPy function scipy.optimize.leastsq, which relies on the MINPACK’s functions lmdif and lmder. Both functions implement a modified Levenberg-Marquardt algorithm to solve the least-square problem. Most of the output of the main curve fitting option will be the output of the least-square function in scipy.
A simple example¶
As a simple example, we will take the function \(f\) to be:
Let’s assume the points look like this:

Raw data for curve fitting
The data points have been generated by that script:
>>> import numpy as np
>>> from matplotlib import pylab as plt
>>> x = np.arange(0,3,0.01)
>>> y = 2*x + 4*x**2 + 3*np.random.randn(*x.shape)
>>> plt.plot(x,y,'+',label='data')
>>> plt.legend(loc=0)
>>> plt.xlabel('X'); plt.ylabel('Y')
So we will expect to find something close to \((0,2,4)\).
To perform the analysis, we first need to define the function to be fitted:
>>> def f(params, x):
... a0, a1, a2 = params
... return a0 + a1*x+ a2*x**2
Then, we construct a CurveFitting object, which computes and stores the optimal parameters, and also behaves as a function for the fitted data:
>>> import pyqt_fit
>>> fit = pyqt_fit.CurveFitting(x,y,(0,1,0),f)
>>> print "The parameters are: a0 = {0}, a1 = {1}, a2 = {2}".format(*fit.popt)
The parameters are: a0 = 0.142870141922, a1 = 1.33420587099, a2 = 4.27241667343
>>> yfitted = fit(x)
The fit object, beside being a callable object to evaluate the fitting function as some points, contain the following properties:
- fct
- Function being fitted (e.g. the one given as argument)
- popt
- Optimal parameters for the function
- res
- Residuals of the fitted data
- pcov
- Covariance of the parameters around the optimal values.
- infodict
- Additional estimation outputs, as given by scipy.optimize.leastsq()
Fitting analysis¶
PyQt-Fit also has tools to evaluate your fitting. You can use them as a whole:
>>> from pyqt_fit import plot_fit
>>> result = plot_fit.fit_evaluation(fit, x, y,
... fct_desc = "$y = a_0 + a_1 x + a_2 x^2$",
... param_names=['a_0', 'a_1', 'a_2'])
You can then examine the result variable. But you can also perform only the analysis you need. For example, you can compute the data needed for the residual analysis with:
>>> rm = plot_fit.residual_measures(fit.res)
rm is a named tuple with the following fields:
- scaled_res
- Scaled residuals, sorted in ascending values for residuals. The scaled residuals are computed as \(sr_i = \frac{r_i}{\sigma_r}\), where \(\sigma_r\) is the variance of the residuals.
- res_IX
- Ordering indices for the residuals in scaled_res. This orders the residuals in an ascending manner.
- prob
- List of quantiles used to compute the normalized quantiles.
- normq
- Value expected for the quantiles in prob if the distribution is normal. The foluma is: \(\Phi(p) = \sqrt{2} \erf^{-1}(2p-1), p\in[0;1]\)
Plotting the results¶
At last, you can use the display used for the GUI:
>>> handles = plot_fit.plot1d(result)
What you will obtain are these two graphs:

Curve fitting output

Residual checking output
Do not hesitate to look at the code for pyqt_fit.plot_fit.plot1d() to examine how things are plotted. The function should return all the handles you may need to tune the presentation of the various curves.
Speeding up the fitting: providing the jacobian¶
The least-square algorithm uses the jacobian (i.e. the derivative of the function with respect to each parameter on each point). By default, the jacobian is estimated numerically, which can be quite expensive (if the function itself is). But in many cases, it is fairly easy to compute. For example, in our case we have:
By default, the derivatives should be given in columns (i.e. each line correspond to a parameter, each column to a point):
>>> def df(params, x):
... result = np.ones((3, x.shape[0]), dtype=float)
... result[1] = x
... result[2] = x**2
... return result # result[0] is already 1
>>> fit.Dfun = df
>>> fit.fit()
Of course there is no change in the result, but it should be slightly faster (note that in this case, the function is so fast that to make it worth it, you need a lot of points as input).
Confidence Intervals¶
PyQt-Fit provides bootstrapping methods to compute confidence intervals. Bootstrapping is a method to estimate confidence interval and probability distribution by resampling the data provided. For our problem, we will call:
>>> import pyqt_fit.bootstrap as bs
>>> xs = np.arange(0, 3, 0.01)
>>> result = bs.bootstrap(pyqt_fit.CurveFitting, x, y, eval_points = xs, fit_kwrds = dict(p0 = (0,1,0), function = f), CI = (95,99), extra_attrs = ('popt',))
This will compute the 95% and 99% confidence intervals for the curves and for the optimised parameters (popt). The result is a named tuple pyqt_fit.bootstrap.BootstrapResult. The most important field are y_est and CIs that provide the estimated values and the confidence intervals for the curve and for the parameters.
On the data, the result can be plotted with:
>>> plt.plot(xs, result.y_fit(xs), 'r', label="Fitted curve")
>>> plt.plot(xs, result.CIs[0][0,0], 'g--', label='95% CI')
>>> plt.plot(xs, result.CIs[0][0,1], 'g--')
>>> plt.fill_between(xs, result.CIs[0][0,0], result.CIs[0][0,1], color='g', alpha=0.25)
>>> plt.legend(loc=0)
The result is:

Drawing of the 95% confidence interval
The bounds for the parameters are obtained with:
>>> print "95% CI for p0 = {}-{}".format(*result.CIs[1][0])
>>> print "99% CI for p0 = {}-{}".format(*result.CIs[1][1])
95% CI for p0 = [-0.84216998 -0.20389559 3.77950689]-[ 1.14753806 2.8848943 4.7557855 ]
99% CI for p0 = [-1.09413524 -0.62373955 3.64217184]-[ 1.40142123 3.32762714 4.91391328]
It is also possible to obtain the full distribution of the values for the curve and for the parameters by providing the argument full_results=True and by looking at result.full_results.
Defining the functions and residuals¶
User-defined function¶
The function must be a two argument python function:
- the parameters of the function, provided either as a tuple or a ndarray
- the values on which the function is to be evaluated, provided as a single value or a ndarray
If the second argument is a ndarray of shape (...,N), the output must be a ndarray of shape (N,).
If is also possible to provide the function computing the Jacobian of the estimation function. The arguments are the same as for the function, but the shape of the output must be (P,N), where P is the number of parameters to be fitted, unless the option col_deriv is set to 0, in which case the shape of the output must be (N,P).
User-defined residuals¶
It is also possible to redefine the notion of residuals. A common example is to use the log of the residuals. It is most applicable if the standard deviation of the residuals is proportional to the fitted quantity. The residual should be a function of two arguments:
- the measured data
- the fitted data
For example, the log residuals would be:
>>> def log_residuals(y1, y0):
... return np.log(y1/y0)
As for the user-defined function, it is possible to provide the jacobian of the residuals. It must be provided as a function of 3 arguments:
- the measured data
- the fitted data
- the jacobian of the function on the fitted data
The shape of the output must be the same as the shape of the jacobian of the function. For example, if col_deriv is set to True, the jacobian of the log-residuals will be defined as:
>>> def Dlog_residual(y1, y0):
... return -1/y0[np.newaxis,:]
This is because:
as \(y_1\) is a constant, and \(y_0\) depend on the parameters.
Also, methods like the residuals bootstrapping will require a way to apply residuals on fitted data. For this, you will need to provide a function such as:
>>> def invert_log_residuals(y, res):
... return y*np.exp(res)
This function should be such that this expression returns always true:
>>> all(log_residuals(invert_log_residuals(y, res), y) == res)
Of course, working with floating point values, this is usually not happening. So a better test function would be:
>>> sum((log_residuals(invert_log_residuals(y, res), y) - res)**2) < epsilon
Using the functions/residuals defined for the GUI¶
It is also possible to use the functions and residuals defined for the GUI. The interface for this are via the modules pyqt_fit.functions and pyqt_fit.residuals.
The list of available functions can be retrieved with:
>>> pyqt_fit.functions.names()
['Power law', 'Exponential', 'Linear', 'Logistic']
And a function is retrieved with:
>>> f = pyqt_fit.functions.get('Logistic')
The function is an object with the following properties:
- __call__
- Evaluate the function on a set of points, as described in the previous section
- Dfun
- Evaluate the jacobian of the function. If not available, this property is set to None
- args
- Name of the arguments
- description
- Formula or description of the evaluated function
- init_args
- Function provided a reasonnable first guess for the parameters. Should be called with f.init_args(x,y).
In the same way, the list of available residuals can be retrieved with:
>>> pyqt_fit.residuals.names()
['Difference of the logs', 'Standard']
And a residuals function is retrieved with:
>>> r = pyqt_fit.residuals.get('Difference of the logs')
The residuals is an object with the following properties:
- __call__
- Evaluate the residuals, as described in the previous section
- Dfun
- Evaluate the jacobian of the residuals. If not available, this property is set to None
- invert
- Function that apply the residuals to a set of fitted data. It will be called as r.invert(y, res). It should have the properties of the invert function described in the previous section.
- description
- Description of the kind of residuals
- name
- Name of the residuals.
Non-Parametric regression tutorial¶
Introduction¶
In general, given a set of observations \((x_i,y_i)\), with \(x_i = (x_{i1}, \ldots, x_{ip})^T \in \R^p\). We assume there exists a function \(f(x)\) such that:
with \(\epsilon_i \in\R\) such that \(E(\epsilon) = 0\). This function, however, is not accessible. So we will consider the function \(\hat{f}\) such that:
The various methods presented here consists in numerical approximations finding the minimum in a part of the function space. The most general method offered by this module is called the local-polynomial smoother. It uses the Taylor-decomposition of the function f on each point, and a local weigthing of the points, to find the values. The function is then defined as:
Where \(\mathcal{P}_n\) is a polynomial of order \(n\) whose constant term is \(a_0\), \(K\) is a kernel used for weighing the values and \(h\) is the selected bandwidth. In particular, in 1D:
In general, higher polynomials will reduce the error term but will overfit the data, in particular at the boundaries.
A simple example¶
For our example, lets first degine our target function:
>>> import numpy as np
>>> def f(x):
... return 3*np.cos(x/2) + x**2/5 + 3
Then, we will generate our data:
>>> xs = np.random.rand(200) * 10
>>> ys = f(xs) + 2*np.random.randn(*xs.shape)
We can then visualize the data:
>>> import matplotlib.pyplot as plt
>>> grid = np.r_[0:10:512j]
>>> plt.plot(grid, f(grid), 'r--', label='Reference')
>>> plt.plot(xs, ys, 'o', alpha=0.5, label='Data')
>>> plt.legend(loc='best')

Generated data with generative function.
At first, we will try to use a simple Nadaraya-Watson method, or spatial averaging, using a gaussian kernel:
>>> import pyqt_fit.nonparam_regression as smooth
>>> from pyqt_fit import npr_methods
>>> k0 = smooth.NonParamRegression(xs, ys, method=npr_methods.SpatialAverage())
>>> k0.fit()
>>> plt.plot(grid, k0(grid), label="Spatial Averaging", linewidth=2)
>>> plt.legend(loc='best')

Result of the spatial averaging.
As always during regressionm we need to look at the residuals:
>>> from pyqt_fit import plot_fit
>>> yopts = k0(xs)
>>> res = ys - yopts
>>> plot_fit.plot_residual_tests(xs, yopts, res, 'Spatial Average')

Residuals of the Spatial Averaging regression
We can see from the data that the inside of the curve is well-fitted. However, the boundaries are not. This is extremely visible on the right boundary, where the data is clearly under-fitted. This is a typical problem with spatial averaging, as it doesn’t cope well with strong maxima, especially on the boundaries. As an improvement, we can try local-linear or local-polynomial. The process is exactly the same:
>>> k1 = smooth.NonParamRegression(xs, ys, method=npr_methods.LocalPolynomialKernel(q=1))
>>> k2 = smooth.NonParamRegression(xs, ys, method=npr_methods.LocalPolynomialKernel(q=2))
>>> k3 = smooth.NonParamRegression(xs, ys, method=npr_methods.LocalPolynomialKernel(q=3))
>>> k12 = smooth.NonParamRegression(xs, ys, method=npr_methods.LocalPolynomialKernel(q=12))
>>> k1.fit(); k2.fit(); k3.fit(); k12.fit()
>>> plt.figure()
>>> plt.plot(xs, ys, 'o', alpha=0.5, label='Data')
>>> plt.plot(grid, k12(grid), 'b', label='polynom order 12', linewidth=2)
>>> plt.plot(grid, k3(grid), 'y', label='cubic', linewidth=2)
>>> plt.plot(grid, k2(grid), 'k', label='quadratic', linewidth=2)
>>> plt.plot(grid, k1(grid), 'g', label='linear', linewidth=2)
>>> plt.plot(grid, f(grid), 'r--', label='Target', linewidth=2)
>>> plt.legend(loc='best')

Result of polynomial fitting with orders 1, 2, 3 and 12
In this example, we can see that linear, quadratic and cubic give very similar result, while a polynom of order 12 is clearly over-fitting the data. Looking closer at the data, we can see that the quadratic and cubic fits seem to be better adapted, as quadratic and cubic both seem to over-fit the data. Note that this is not to be generalise and is very dependent on the data you have! We can now redo the residual plots:
>>> yopts = k1(xs)
>>> res = ys - yopts
>>> plot_fit.plot_residual_tests(xs, yopts, res, 'Local Linear')

Residuals of the Local Linear Regression
We can also look at the residuals for the quadratic polynomial:
>>> yopts = k2(xs)
>>> res = ys - yopts
>>> plot_fit.plot_residual_tests(xs, yopts, res, 'Local Quadratic')

Residuals of the Local Quadratic Regression
We can see from the structure of the noise that the quadratic curve seems indeed to fit much better the data. Unlike in the local linear regression, we do not have significant bias along the X axis. Also, the residuals seem “more normal” (i.e. the points in the QQ-plot are better aligned) than in the linear case.
Confidence Intervals¶
Confidence intervals can be computed using bootstrapping. Based on the previous paragraph, you can get confidence interval on the estimation with:
>>> import pyqt_fit.bootstrap as bs
>>> grid = np.r_[0:10:512j]
>>> def fit(xs, ys):
... est = smooth.NonParamRegression(xs, ys, method=npr_methods.LocalPolynomialKernel(q=2))
... est.fit()
... return est
>>> result = bs.bootstrap(fit, xs, ys, eval_points = grid, CI = (95,99))
This will compute the 95% and 99% confidence intervals for the quadratic fitting. The result is a named tuple pyqt_fit.bootstrap.BootstrapResult. The most important field are y_est and CIs that provide the estimated values and the confidence intervals for the curve.
The data can be plotted with:
>>> plt.plot(xs, ys, 'o', alpha=0.5, label='Data')
>>> plt.plot(grid, result.y_fit(grid), 'r', label="Fitted curve", linewidth=2)
>>> plt.plot(grid, result.CIs[0][0,0], 'g--', label='95% CI', linewidth=2)
>>> plt.plot(grid, result.CIs[0][0,1], 'g--', linewidth=2)
>>> plt.fill_between(grid, result.CIs[0][0,0], result.CIs[0][0,1], color='g', alpha=0.25)
>>> plt.legend(loc=0)

Confidence intervals
Types of Regressions¶
Kernel Density Estimation tutorial¶
Introduction¶
Kernel Density Estimation is a method to estimate the frequency of a given value given a random sample.
Given a set of observations \((x_i)_{1\leq i \leq n}\). We assume the observations are a random sampling of a probability distribution \(f\). We first consider the kernel estimator:
- Where:
- \(K: \R^p\rightarrow \R\) is the kernel, a function centered on 0 and that integrates to 1;
- math:h is the bandwidth, a smoothing parameter that would typically tend to 0 when the number of samples tend to \(\infty\);
- \((w_i)\) are the weights of each of the points, and \(W\) is the sum of the weigths;
- \((\lambda_i)\) are the adaptation factor of the kernel.
Also, it is desirable if the second moment of the kernel (i.e. the variance) is 1 for the bandwidth to keep a uniform meaning across the kernels.
A simple example¶
First, let’s assume we have a random variable following a normal law \(\mathcal{N}(0,1)\), and let’s plot its histogram:
>>> import numpy as np
>>> from scipy.stats import norm
>>> from matplotlib import pylab as plt
>>> f = norm(loc=0, scale=1)
>>> x = f.rvs(500)
>>> xs = np.r_[-3:3:1024j]
>>> ys = f.pdf(xs)
>>> h = plt.hist(x, bins=30, normed=True, color=(0,.5,0,1), label='Histogram')
>>> plt.plot(xs, ys, 'r--', linewidth=2, label='$\mathcal{N}(0,1)$')
>>> plt.xlim(-3,3)
>>> plt.xlabel('X')
We can get estimate the density with:
>>> from pyqt_fit import kde
>>> est = kde.KDE1D(x)
>>> plot(xs, est(xs), label='Estimate (bw={:.3g})'.format(est.bandwidth))
>>> plt.legend(loc='best')
You may wonder why use KDE rather than a histogram. Let’s test the variability of both method. To that purpose, let first generate a set of a thousand datasets and the corresponding histograms and KDE, making sure the width of the KDE and the histogram are the same:
>>> import numpy as np
>>> from scipy.stats import norm
>>> from pyqt_fit import kde
>>> f = norm(loc=0, scale=1)
>>> xs = np.r_[-3:3:1024j]
>>> nbins = 20
>>> x = f.rvs(1000*1000).reshape(1000,1000)
>>> hs = np.empty((1000, nbins), dtype=float)
>>> kdes = np.empty((1000, 1024), dtype=float)
>>> hs[0], edges = np.histogram(x[0], bins=nbins, range=(-3,3), density=True)
>>> mod = kde.KDE1D(x[0])
>>> mod.fit() # Force estimation of parameters
>>> mod.bandwidth = mod.bandwidth # Prevent future recalculation
>>> kdes[0] = mod(xs)
>>> for i in range(1, 1000):
>>> hs[i] = np.histogram(x[i], bins=nbins, range=(-3,3), density=True)[0]
>>> mod.xdata = x[i]
>>> kdes[i] = mod(xs)
Now, let’s find the mean and the 90% confidence interval:
>>> h_mean = hs.mean(axis=0)
>>> h_ci = np.array(np.percentile(hs, (5, 95), axis=0))
>>> h_err = np.empty(h_ci.shape, dtype=float)
>>> h_err[0] = h_mean - h_ci[0]
>>> h_err[1] = h_ci[1] - h_mean
>>> kde_mean = kdes.mean(axis=0)
>>> kde_ci = np.array(np.percentile(kdes, (5, 95), axis=0))
>>> width = edges[1:]-edges[:-1]
>>> fig = plt.figure()
>>> ax1 = fig.add_subplot(1,2,1)
>>> ax1.bar(edges[:-1], h_mean, yerr=h_err, width = width, label='Histogram',
... facecolor='g', edgecolor='k', ecolor='b')
>>> ax1.plot(xs, f.pdf(xs), 'r--', lw=2, label='$\mathcal{N}(0,1)$')
>>> ax1.set_xlabel('X')
>>> ax1.set_xlim(-3,3)
>>> ax1.legend(loc='best')
>>> ax2 = fig.add_subplot(1,2,2)
>>> ax2.fill_between(xs, kde_ci[0], kde_ci[1], color=(0,1,0,.5), edgecolor=(0,.4,0,1))
>>> ax2.plot(xs, kde_mean, 'k', label='KDE (bw = {:.3g})'.format(mod.bandwidth))
>>> ax2.plot(xs, f.pdf(xs), 'r--', lw=2, label='$\mathcal{N}(0,1)$')
>>> ax2.set_xlabel('X')
>>> ax2.legend(loc='best')
>>> ymax = max(ax1.get_ylim()[1], ax2.get_ylim()[1])
>>> ax2.set_ylim(0, ymax)
>>> ax1.set_ylim(0, ymax)
>>> ax1.set_title('Histogram, max variation = {:.3g}'.format((h_ci[1] - h_ci[0]).max()))
>>> ax2.set_title('KDE, max variation = {:.3g}'.format((kde_ci[1] - kde_ci[0]).max()))
>>> fig.suptitle('Comparison Histogram vs. KDE')
Note that the KDE doesn’t tend toward the true density. Instead, given a kernel \(K\), the mean value will be the convolution of the true density with the kernel. But for that price, we get a much narrower variation on the values. We also avoid boundaries issues linked with the choices of where the bars of the histogram start and stop.
Boundary Conditions¶
Simple Boundary¶
One of the main focus of the implementation is the estimation of density on bounded domain. As an example, let’s try to estimate the KDE of a dataset following a \(\chi^2_2\) distribution. As a reminder, the PDF of this distribution is:
This distribution is only defined for \(x>0\). So first let’s look at the histogram and the default KDE:
>>> from scipy import stats
>>> from matplotlib import pylab as plt
>>> from pyqt_fit import kde, kde_methods
>>> import numpy as np
>>> chi2 = stats.chi2(2)
>>> x = chi2.rvs(1000)
>>> plt.hist(x, bins=20, range=(0,8), color=(0,.5,0), label='Histogram', normed=True)
>>> est = kde.KDE1D(x)
>>> xs = np.r_[0:8:1024j]
>>> plt.plot(xs, est(xs), label='KDE (bw = {:.3g})'.format(est.bandwidth))
>>> plt.plot(xs, chi2.pdf(xs), 'r--', lw=2, label=r'$\chi^2_2$')
>>> plt.legend(loc='best')
We can see that the estimation is correct far from the 0, but when closer than twice the bandwidth, the estimation becomes incorrect. The reason is that the method “sees” there are no points below 0, and therefore assumes the density continuously decreases to reach 0 in slightly negative values. Moreover, if we integrate the KDE in the domain \([0,\infty]\):
>>> from scipy import integrate
>>> integrate.quad(est, 0, np.inf)
(0.9138087148449997, 2.7788548831933142e-09)
we can see the distribution sums up only to about 0.91, instead of 1. In short, we are “loosing weight”.
There are a number of ways to take into account the bounded nature of the distribution and correct with this loss. A common one consists in truncating the kernel if it goes below 0. This is called “renormalizing” the kernel. The method can be specified setting the method attribute of the KDE object to pyqt_fit.kde_methods.renormalization:
>>> est_ren = kde.KDE1D(x, lower=0, method=kde_methods.renormalization)
>>> plt.plot(xs, est_ren(xs), 'm', label=est_ren.method.name)
>>> plt.legend(loc='best')
It can be shown that the convergence at the boundary with the renormalization method is slower than in the rest of the dataset. Another method is a linear approximation of the density toward the boundaries. The method, being an approximation, will not sum up to exactly 1. However, it often approximate the density much better:
>>> from pyqt_fit import kde_methods
>>> est_lin = kde.KDE1D(x, lower=0, method=kde_methods.linear_combination)
>>> plt.plot(xs, est_lin(xs), 'y', label=est_lin.method.name)
>>> plt.legend(loc='best')
Reflective Boundary¶
Sometimes, not only do we have a boundary, but we expect the density to be reflective, that is the derivative on the boundary is 0, we expect the data to behave the same as being repeated by reflection on the boundaries. An example is the distribution of the distance from a 2D point taken from a 2D gaussian distribution to the center:
First, let’s look at the histogram:
>>> from scipy import stats, integrate
>>> from matplotlib import pylab as plt
>>> from pyqt_fit import kde, kde_methods
>>> import numpy as np
>>> f = stats.norm(loc=0, scale=1)
>>> x = f.rvs(1000)
>>> y = f.rvs(1000)
>>> z = np.abs(x-y)
>>> plt.hist(z, bins=20, facecolor=(0,.5,0), normed=True)
Then, the KDE assume reflexive boundary conditions:
>>> xs = np.r_[0:8:1024j]
>>> est = kde.KDE1D(z, lower=0, method=kde_methods.reflection)
>>> plot(xs, est(xs), color='b', label=est.method.name)
To estimate the “real” distribution, we will increase the number of samples:
>>> xx = f.rvs(1000000)
>>> yy = f.rvs(1000000)
>>> zz = np.abs(xx-yy)
If you try to estimate the KDE, it will now be very slow. To speed up the process, you can use the grid method. The grid method will compute the result using DCT or FFT if possible. It will work only if you don’t have variable bandwidth and boundary conditions are either reflexive, cyclic, or non-existent (i.e. unbounded):
>>> est_large = kde.KDE1D(zz, lower=0, method=kde_methods.reflection)
>>> xxs, yys = est_large.grid()
>>> plt.plot(xxs, yys, 'r--', lw=2, label='Estimated')
>>> plt.xlim(0, 6)
>>> plt.ylim(ymin=0)
>>> plt.legend(loc='best')
Cyclic Boundaries¶
Cyclic boundaries work very much like reflexive boundary. The main difference is that they require two bounds, as reflexive conditions can be only with one bound.
Methods for Bandwidth Estimation¶
There are a number of ways to estimate the bandwidth. The simplest way is to specify it numerically, either during construction or after:
>>> est = kde.KDE1D(x, bandwidth=.1)
>>> est.bandwidth = .2
If is sometimes more convenient to specify the variance of the kernel (which is the square bandwidth). So these are equivalent to the two previous lines:
>>> est = kde.KDE1D(x, bandwidth=.01)
>>> est.covariance = .04
But often you will want to use a pre-defined method:
>>> est = kde.KDE1D(x, covariance = kde.scotts_covariance)
At last, if you want to define your own method, you simply need to define a function. For example, you can re-define and use the function for the Scotts rule of thumb (which compute the variance of the kernel):
>>> def my_scotts(x, model=None):
... return (0.75 * len(x))**-.2 * x.var()
>>> est = kde.KDE1D(x, covariance=my_scotts)
The model object is a reference to the KDE object and is mostly useful if you need to know about the boundaries of the domain.
Transformations¶
Sometimes, it is not really possible to estimate correctly the density is the current domain. A transformation is required. As an example, let’s try to estimate a log-normal distribution, i.e. the distribution of a variable whose logarithm is normally distributed:
>>> from scipy import stats
>>> from matplotlib import pylab as plt
>>> from pyqt_fit import kde, kde_methods
>>> import numpy as np
>>> f = stats.lognorm(1)
>>> x = f.rvs(1000)
>>> xs = r_[0:10:4096j]
>>> plt.hist(x, bins=20, range=(0,10), color='g', normed=True)
>>> plt.plot(xs, f.pdf(xs), 'r--', lw=2, label='log-normal')
>>> est = kde.KDE1D(x, method=kde_methods.linear_combination, lower=0)
>>> plt.plot(xs, est(xs), color='b', label='KDE')
>>> plt.legend(loc='best')
You can note that even the histogram doesn’t reflect very well the distribution here. The linear recombination method, although not perfect also gives a better idea of what is going on. But really, we should be working in log space:
>>> plt.figure()
>>> lx = np.log(x)
>>> h, edges = np.histogram(lx, bins=30, range=(-np.log(30), np.log(10)))
>>> width = np.exp(edges[1:]) - np.exp(edges[:-1])
>>> h = h / width
>>> h /= len(x)
>>> plt.bar(np.exp(edges[:-1]), h, width = width, facecolor='g', linewidth=0, ecolor='b')
>>> plt.plot(xs, f.pdf(xs), 'r--', lw=2, label='log-normal')
>>> plt.xlim(xmax=10)
>>> plt.legend(loc='best')
We can do the same for the KDE by using the transformKDE method. This method requires the transformation as argument:
>>> trans = kde.KDE1D(x, method=kde_methods.transformKDE1D(kde_methods.LogTransform))
>>> plt.plot(xs, trans(xs), color='b', lw=2, label='Transformed KDE')
>>> plt.legend(loc='best')
Modules of PyQt-Fit¶
Module pyqt_fit.plot_fit¶
Author: | Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com> |
---|
This modules implement functions to test and plot parametric regression.
Analyses of the residuals¶
- pyqt_fit.plot_fit.fit_evaluation(fit, xdata, ydata, eval_points=None, CI=(), CIresults=None, xname='X', yname='Y', fct_desc=None, param_names=(), residuals=None, res_name='Standard')[source]¶
This function takes the output of a curve fitting experiment and store all the relevant information for evaluating its success in the result.
Parameters: - fit (fitting object) – object configured for the fitting
- xdata (ndarray of shape (N,) or (k,N) for function with k prefictors) – The independent variable where the data is measured
- ydata (ndarray) – The dependant data
- eval_points (ndarray or None) – Contain the list of points on which the result must be expressed. It is used both for plotting and for the bootstrapping.
- CI (tuple of int) – List of confidence intervals to calculate. If empty, none are calculated.
- xname (string) – Name of the X axis
- yname (string) – Name of the Y axis
- fct_desc (string) – Formula of the function
- param_names (tuple of strings) – Name of the various parameters
- residuals (callable) – Residual function
- res_desc (string) – Description of the residuals
Return type: Returns: Data structure summarising the fitting and its evaluation
- pyqt_fit.plot_fit.residual_measures(res)[source]¶
Compute quantities needed to evaluate the quality of the estimation, based solely on the residuals.
Return type: ResidualMeasures Returns: the scaled residuals, their ordering, the theoretical quantile for each residuals, and the expected value for each quantile.
Plotting the residuals¶
- pyqt_fit.plot_fit.plot_dist_residuals(res)[source]¶
Plot the distribution of the residuals.
Returns: the handle toward the histogram and the plot of the fitted normal distribution
- pyqt_fit.plot_fit.plot_residuals(xname, xdata, res_desc, res)[source]¶
Plot the residuals against the X axis
Parameters: - xname (str) – Name of the X axis
- xdata (ndarray) – 1D array with the X data
- res_desc (str) – Name of the Y axis
- res (ndarray) – 1D array with the residuals
The shapes of xdata and res must be the same
Returns: The handles of the the plots of the residuals and of the smoothed residuals.
- pyqt_fit.plot_fit.scaled_location_plot(yname, yopt, scaled_res)[source]¶
Plot the scaled location, given the dependant values and scaled residuals.
Parameters: - yname (str) – Name of the Y axis
- yopt (ndarray) – Estimated values
- scaled_res (ndarray) – Scaled residuals
Returns: the handles for the data and the smoothed curve
- pyqt_fit.plot_fit.qqplot(scaled_res, normq)[source]¶
Draw a Q-Q Plot from the sorted, scaled residuals (i.e. residuals sorted and normalized by their standard deviation)
Parameters: - scaled_res (ndarray) – Scaled residuals
- normq (ndarray) – Expected value for each scaled residual, based on its quantile.
Returns: handle to the data plot
- pyqt_fit.plot_fit.plot_residual_tests(xdata, yopts, res, fct_name, xname='X', yname='Y', res_name='residuals', sorted_yopts=None, scaled_res=None, normq=None, fig=None)[source]¶
Plot, in a single figure, all four residuals evaluation plots: plot_residuals(), plot_dist_residuals(), scaled_location_plot() and qqplot().
Parameters: - xdata (ndarray) – Explaining variables
- yopt (ndarray) – Optimized explained variables
- fct_name (str) – Name of the fitted function
- xname (str) – Name of the explaining variables
- yname (str) – Name of the dependant variables
- res_name (str) – Name of the residuals
- sorted_yopts (ndarray) – yopt, sorted to match the scaled residuals
- scaled_res (ndarray) – Scaled residuals
- normq (ndarray) – Estimated value of the quantiles for a normal distribution
- fig (handle or None) – Handle of the figure to put the plots in, or None to create a new figure
Return type: Returns: The handles to all the plots
General plotting¶
- pyqt_fit.plot_fit.plot1d(result, loc=0, fig=None, res_fig=None)[source]¶
Use matplotlib to display the result of a fit, and return the list of plots used
Return type: Plot1dResult Returns: hangles to the various figures and plots
Output to a file¶
- pyqt_fit.plot_fit.write1d(outfile, result, res_desc, CImethod)[source]¶
Write the result of a fitting and its evaluation to a CSV file.
Parameters: - outfile (str) – Name of the file to write to
- result (ResultStruct) – Result of the fitting evaluation (e.g. output of fit_evaluation())
- res_desc (str) – Description of the residuals (in more details than just the name of the residuals)
- CImethod (str) – Description of the confidence interval estimation method
Return types¶
Most function return a tuple. For easier access, there are named tuple, i.e. tuples that can be accessed by name.
- class pyqt_fit.plot_fit.ResultStruct(...)¶
Note
This is a class created with pyqt_fit.utils.namedtuple().
- fct¶
Fitted function (i.e. result of the fitted function)
- fct_desc¶
Description of the function being fitted
- param_names¶
Name of the parameters fitted
- xdata¶
Explaining variables used for fitting
- ydata¶
Dependent variables observed during experiment
- xname¶
Name of the explaining variables
- yname¶
Name of the dependent variabled
- res_name¶
Name of the residuals
- residuals¶
Function used to compute the residuals
- popt¶
Optimal parameters
- res¶
Residuals computed with the parameters popt
- yopts¶
Evaluation of the optimized function on the observed points
- eval_points¶
Points on which the function has been interpolated (may be equal to xdata)
- interpolation¶
Interpolated function on eval_points (may be equal to yopt)
- sorted_yopts¶
Evaluated function for each data points, sorted in increasing residual order
- scaled_res¶
Scaled residuals, ordered by increasing residuals
- normq¶
Expected values for the residuals, based on their quantile
- CI¶
List of confidence intervals evaluated (in percent)
- CIs¶
List of arrays giving the confidence intervals for the dependent variables and for the parameters.
- CIresults¶
Object returned by the confidence interval method
- class pyqt_fit.plot_fit.ResidualMeasures(scaled_res, res_IX, prob, normq)¶
Note
This is a class created with pyqt_fit.utils.namedtuple().
- scaled_res¶
Scaled residuals, sorted
- res_IX¶
Sorting indices for the residuals
- prob¶
Quantiles of the scaled residuals
- normq¶
Expected values of the quantiles for a normal distribution
- class pyqt_fit.plot_fit.ResTestResult(res_figure, residuals, scaled_residuals, qqplot, dist_residuals)¶
Note
This is a class created with pyqt_fit.utils.namedtuple().
- res_figure¶
Handle to the figure
- residuals¶
Handles created by plot_residuals()
- scaled_residuals¶
Handles created by scaled_location_plot()
- dist_residuals¶
Handles created by plot_dist_residuals()
- class pyqt_fit.plot_fit.Plot1dResult(figure, estimate, data, CIs, *ResTestResult)¶
Note
This is a class create with pyqt_fit.utils.namedtuple(). Also, it contains all the first of ResTestResult at the end of the tuple.
- figure¶
Handle to the figure with the data and fitted curve
- estimate¶
Handle to the fitted curve
- data¶
Handle to the data
- CIs¶
Handles to the confidence interval curves
Module pyqt_fit.curve_fitting¶
Author: | Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com> |
---|
This module specifically implement the curve fitting, wrapping the default scipy.optimize.leastsq function. It allows for parameter value fixing, different kind of residual and added constraints function.
The main class of the module is the CurveFitting class.
- class pyqt_fit.curve_fitting.CurveFitting(xdata, ydata, **kwords)[source]¶
Fit a curve using the scipy.optimize.leastsq() function
Parameters: - xdata (ndarray) – Explaining values
- ydata (ndarray) – Target values
Once fitted, the following variables contain the result of the fitting:
Variables: - popt (ndarray) – The solution (or the result of the last iteration for an unsuccessful call)
- pcov (ndarray) – The estimated covariance of popt. The diagonals provide the variance of the parameter estimate.
- res (ndarray) – Final residuals
- infodict (dict) –
a dictionary of outputs with the keys:
- nfev
- the number of function calls
- fvec
- the function evaluated at the output
- fjac
- A permutation of the R matrix of a QR factorization of the final approximate Jacobian matrix, stored column wise. Together with ipvt, the covariance of the estimate can be approximated.
- ipvt
- an integer array of length N which defines a permutation matrix, p, such that fjac*p = q*r, where r is upper triangular with diagonal elements of nonincreasing magnitude. Column j of p is column ipvt(j) of the identity matrix.
- qtf
- the vector (transpose(q) * fvec)
- CI
- list of tuple of parameters, each being the lower and upper bounds for the confidence interval in the CI argument at the same position.
- est_jacobian
- True if the jacobian is estimated, false if the user-provided functions have been used
Note
In this implementation, residuals are supposed to be a generalisation of the notion of difference. In the end, the mathematical expression of this minimisation is:
\[\hat{\theta} = \argmin_{\theta\in \mathbb{R}^p} \sum_i r(y_i, f(\theta, x_i))^2\]Where \(\theta\) is the vector of \(p\) parameters to optimise, \(r\) is the residual function and \(f\) is the function being fitted.
Module pyqt_fit.bootstrap¶
Author: | Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com> |
---|
This modules provides function for bootstrapping a regression method.
Bootstrap Shuffling Methods¶
- pyqt_fit.bootstrap.bootstrap_residuals(fct, xdata, ydata, repeats=3000, residuals=None, add_residual=None, correct_bias=False, **kwrds)[source]¶
This implements the residual bootstrapping method for non-linear regression.
Parameters: - fct (callable) – Function evaluating the function on xdata at least with fct(xdata)
- xdata (ndarray of shape (N,) or (k,N) for function with k predictors) – The independent variable where the data is measured
- ydata (ndarray) – The dependant data
- residuals (ndarray or callable or None) – Residuals for the estimation on each xdata. If callable, the call will be residuals(ydata, yopt).
- repeats (int) – Number of repeats for the bootstrapping
- add_residual (callable or None) – Function that add a residual to a value. The call add_residual(yopt, residual) should return the new ydata, with the residuals ‘applied’. If None, it is considered the residuals should simply be added.
- correct_bias (boolean) – If true, the additive bias of the residuals is computed and restored
- kwrds (dict) – Dictionnary present to absorbed unknown named parameters
Return type: (ndarray, ndarray)
Returns: 1. xdata, with a new axis at position -2. This correspond to the ‘shuffled’ xdata (as they are not shuffled here)
2.Second item is the shuffled ydata. There is a line per repeat, each line is shuffled independently.
- pyqt_fit.bootstrap.bootstrap_regression(fct, xdata, ydata, repeats=3000, **kwrds)[source]¶
This implements the shuffling of standard bootstrapping method for non-linear regression.
Parameters: - fct (callable) – This is the function to optimize
- xdata (ndarray of shape (N,) or (k,N) for function with k predictors) – The independent variable where the data is measured
- ydata (ndarray) – The dependant data
- repeats (int) – Number of repeats for the bootstrapping
- kwrds (dict) – Dictionnary to absorbed unknown named parameters
Return type: (ndarray, ndarray)
Returns: 1. The shuffled x data. The axis -2 has one element per repeat, the other axis are shuffled independently.
2. The shuffled ydata. There is a line per repeat, each line is shuffled independently.
Main Boostrap Functions¶
- pyqt_fit.bootstrap.bootstrap(fit, xdata, ydata, CI, shuffle_method=<function bootstrap_residuals at 0x7f67a3780230>, shuffle_args=(), shuffle_kwrds={}, repeats=3000, eval_points=None, full_results=False, nb_workers=None, extra_attrs=(), fit_args=(), fit_kwrds={})[source]¶
This function implement the bootstrap algorithm for a regression algorithm. It is capable of spreading the load across many threads using shared memory and the multiprocess module.
Parameters: - fit (callable) –
Method used to compute regression. The call is:
f = fit(xdata, ydata, *fit_args, **fit_kwrds)
Fit should return an object that would evaluate the regression on a set of points. The next call will be:
f(eval_points)
- xdata (ndarray of shape (N,) or (k,N) for function with k predictors) – The independent variable where the data is measured
- ydata (ndarray) – The dependant data
- CI (tuple of float) – List of percentiles to extract
- shuffle_method (callable) –
Create shuffled dataset. The call is:
shuffle_method(xdata, ydata, y_est, repeat=repeats, *shuffle_args, **shuffle_kwrds)
where y_est is the estimated dependant variable on the xdata.
- shuffle_args (tuple) – List of arguments for the shuffle method
- shuffle_kwrds (dict) – Dictionnary of arguments for the shuffle method
- repeats (int) – Number of repeats for the bootstraping
- eval_points (ndarray or None) – List of points to evaluate. If None, eval_point is xdata.
- full_results (bool) – if True, output also the whole set of evaluations
- nb_worders – Number of worker threads. If None, the number of detected CPUs will be used. And if 1 or less, a single thread will be used.
- extra_attrs (tuple of str) – List of attributes of the fitting method to extract on top of the y values for confidence intervals
- fit_args (tuple) – List of extra arguments for the fit callable
- fit_kwrds (dict) – Dictionnary of extra named arguments for the fit callable
Return type: Returns: Estimated y on the data, on the evaluation points, the requested confidence intervals and, if requested, the shuffled X, Y and the full estimated distributions.
- fit (callable) –
- class pyqt_fit.bootstrap.BootstrapResult(y_fit, y_est, y_eval, CIs, shuffled_xs, shuffled_ys, full_results)¶
Note
This is a class created with pyqt_fit.utils.namedtuple().
- y_fit¶
Estimator object, fitted on the original data :type: fun(xs) -> ys
- y_est¶
Y estimated on xdata :type: ndarray
- eval_points¶
Points on which the confidence interval are evaluated
- y_eval¶
Y estimated on eval_points
- CIs_val¶
Tuple containing the list of percentiles extracted (i.e. this is a copy of the CIs argument of the bootstrap function.
- CIs¶
List of confidence intervals. The first element is for the estimated values on eval_points. The others are for the extra attributes specified in extra_attrs. Each array is a 3-dimensional array (Q,2,N), where Q is the number of confidence interval (e.g. the length of CIs_val) and N is the number of data points. Values (x,0,y) give the lower bounds and (x,1,y) the upper bounds of the confidence intervals.
- shuffled_xs¶
if full_results is True, the shuffled x’s used for the bootstrapping
- shuffled_ys¶
if full_results is True, the shuffled y’s used for the bootstrapping
- full_results¶
if full_results is True, the estimated y’s for each shuffled_ys
Module pyqt_fit.nonparam_regression¶
Author: | Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com> |
---|
Module implementing non-parametric regressions using kernel methods.
- class pyqt_fit.nonparam_regression.NonParamRegression(xdata, ydata, **kwords)[source]¶
Class performing kernel-based non-parametric regression.
The calculation is split in three parts:
- The kernel (kernel)
- Bandwidth computation (bandwidth, covariance)
- Regression method (method)
- bandwidth[source]¶
Bandwidth of the kernel.
This is defined as the square root of the covariance matrix
- fitted_method[source]¶
Method actually used after fitting.
The main method may choose to provide a more tuned method during fitting.
- kernel[source]¶
Kernel object. Should provide the following methods:
- kernel.pdf(xs)
- Density of the kernel, denoted \(K(x)\)
- kernel_type[source]¶
Type of the kernel. The kernel type is a class or function accepting the dimension of the domain as argument and returning a valid kernel object.
- method[source]¶
Regression method itself. It should be an instance of the class following the template pyqt_fit.npr_methods.RegressionKernelMethod.
Module pyqt_fit.npr_methods¶
Author: | Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com> |
---|
Module implementing non-parametric regressions using kernel methods.
Non-Parametric Regression Methods¶
Methods must either inherit or follow the same definition as the pyqt_fit.npr_methods.RegressionKernelMethod.
- pyqt_fit.npr_methods.compute_bandwidth(reg)[source]¶
Compute the bandwidth and covariance for the model, based of its xdata attribute
- class pyqt_fit.npr_methods.RegressionKernelMethod[source]¶
Base class for regression kernel methods
The following methods are interface methods that should be overriden with ones specific to the implemented method.
- fit(reg)[source]¶
Fit the method and returns the fitted object that will be used for actual evaluation.
The object needs to call the pyqt_fit.nonparam_regression.NonParamRegression.set_actual_bandwidth() method with the computed bandwidth and covariance.
Default: Compute the bandwidth based on the real data and set it in the regression object
- evaluate(points, out)[source]¶
Evaluate the regression of the provided points.
Parameters: - points (ndarray) – 2d-array of points to compute the regression on. Each column is a point.
- out (ndarray) – 1d-array in which to store the result
Return type: ndarray
Returns: The method must return the out array, updated with the regression values
Provided methods¶
Only extra methods will be described:
- class pyqt_fit.npr_methods.SpatialAverage[source]¶
Perform a Nadaraya-Watson regression on the data (i.e. also called local-constant regression) using a gaussian kernel.
The Nadaraya-Watson estimate is given by:
\[f_n(x) \triangleq \frac{\sum_i K\left(\frac{x-X_i}{h}\right) Y_i} {\sum_i K\left(\frac{x-X_i}{h}\right)}\]Where \(K(x)\) is the kernel and must be such that \(E(K(x)) = 0\) and \(h\) is the bandwidth of the method.
Parameters: - xdata (ndarray) – Explaining variables (at most 2D array)
- ydata (ndarray) – Explained variables (should be 1D array)
- cov (ndarray or callable) – If an ndarray, it should be a 2D array giving the matrix of covariance of the gaussian kernel. Otherwise, it should be a function cov(xdata, ydata) returning the covariance matrix.
- class pyqt_fit.npr_methods.LocalLinearKernel1D[source]¶
Perform a local-linear regression using a gaussian kernel.
The local constant regression is the function that minimises, for each position:
\[f_n(x) \triangleq \argmin_{a_0\in\mathbb{R}} \sum_i K\left(\frac{x-X_i}{h}\right) \left(Y_i - a_0 - a_1(x-X_i)\right)^2\]Where \(K(x)\) is the kernel and must be such that \(E(K(x)) = 0\) and \(h\) is the bandwidth of the method.
This class uses the following function:
- pyqt_fit.py_local_linear.local_linear_1d(bw, xdata, ydata, points, kernel, out)[source]¶
We are trying to find the fitting for points \(x\) given a gaussian kernel Given the following definitions:
\[\begin{split}x_0 &=& x-x_i\end{split}\]\[\begin{split}\begin{array}{rlc|rlc} w_i &=& \mathcal{K}\left(\frac{x_0}{h}\right) & W &=& \sum_i w_i \\ X &=& \sum_i w_i x_0 & X_2 &=& w_i x_0^2 \\ Y &=& \sum_i w_i y_i & Y_2 &=& \sum_i w_i y_i x_0 \end{array}\end{split}\]The fitted value is given by:
\[f(x) = \frac{X_2 T - X Y_2}{W X_2 - X^2}\]
- class pyqt_fit.npr_methods.LocalPolynomialKernel1D(q=3)[source]¶
Perform a local-polynomial regression using a user-provided kernel (Gaussian by default).
The local constant regression is the function that minimises, for each position:
\[f_n(x) \triangleq \argmin_{a_0\in\mathbb{R}} \sum_i K\left(\frac{x-X_i}{h}\right) \left(Y_i - a_0 - a_1(x-X_i) - \ldots - a_q \frac{(x-X_i)^q}{q!}\right)^2\]Where \(K(x)\) is the kernel such that \(E(K(x)) = 0\), \(q\) is the order of the fitted polynomial and \(h\) is the bandwidth of the method. It is also recommended to have \(\int_\mathbb{R} x^2K(x)dx = 1\), (i.e. variance of the kernel is 1) or the effective bandwidth will be scaled by the square-root of this integral (i.e. the standard deviation of the kernel).
Parameters: - xdata (ndarray) – Explaining variables (at most 2D array)
- ydata (ndarray) – Explained variables (should be 1D array)
- q (int) – Order of the polynomial to fit. Default: 3
- cov (float or callable) – If an float, it should be a variance of the gaussian kernel. Otherwise, it should be a function cov(xdata, ydata) returning the variance. Default: scotts_covariance
- class pyqt_fit.npr_methods.LocalPolynomialKernel(q=3)[source]¶
Perform a local-polynomial regression in N-D using a user-provided kernel (Gaussian by default).
The local constant regression is the function that minimises, for each position:
\[f_n(x) \triangleq \argmin_{a_0\in\mathbb{R}} \sum_i K\left(\frac{x-X_i}{h}\right) \left(Y_i - a_0 - \mathcal{P}_q(X_i-x)\right)^2\]Where \(K(x)\) is the kernel such that \(E(K(x)) = 0\), \(q\) is the order of the fitted polynomial, \(\mathcal{P}_q(x)\) is a polynomial of order \(d\) in \(x\) and \(h\) is the bandwidth of the method.
The polynomial \(\mathcal{P}_q(x)\) is of the form:
\[\mathcal{F}_d(k) = \left\{ \n \in \mathbb{N}^d \middle| \sum_{i=1}^d n_i = k \right\}\]\[\mathcal{P}_q(x_1,\ldots,x_d) = \sum_{k=1}^q \sum_{\n\in\mathcal{F}_d(k)} a_{k,\n} \prod_{i=1}^d x_i^{n_i}\]For example we have:
\[\mathcal{P}_2(x,y) = a_{110} x + a_{101} y + a_{220} x^2 + a_{211} xy + a_{202} y^2\]Parameters: - xdata (ndarray) – Explaining variables (at most 2D array). The shape should be (N,D) with D the dimension of the problem and N the number of points. For 1D array, the shape can be (N,), in which case it will be converted to (N,1) array.
- ydata (ndarray) – Explained variables (should be 1D array). The shape must be (N,).
- q (int) – Order of the polynomial to fit. Default: 3
- kernel (callable) – Kernel to use for the weights. Call is kernel(points) and should return an array of values the same size as points. If None, the kernel will be normal_kernel(D).
- cov (float or callable) – If an float, it should be a variance of the gaussian kernel. Otherwise, it should be a function cov(xdata, ydata) returning the variance. Default: scotts_covariance
- pyqt_fit.npr_methods.default_method¶
Defaut non-parametric regression method. :Default: LocalPolynomialKernel(q=1)
Utility functions and classes¶
- class pyqt_fit.npr_methods.PolynomialDesignMatrix(dim, deg)[source]¶
Class used to create a design matrix for polynomial regression
- __call__(x, out=None)[source]¶
Creates the design matrix for polynomial fitting using the points x.
Parameters: - x (ndarray) – Points to create the design matrix. Shape must be (D,N) or (N,), where D is the dimension of the problem, 1 if not there.
- deg (int) – Degree of the fitting polynomial
- factors (ndarray) – Scaling factor for the columns of the design matrix. The shape should be (M,) or (M,1), where M is the number of columns of the out. This value can be obtained using the designMatrixSize() function.
Returns: The design matrix as a (M,N) matrix.
Module pyqt_fit.kde¶
Author: | Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com> |
---|
Module implementing kernel-based estimation of density of probability.
Given a kernel \(K\), the density function is estimated from a sampling \(X = \{X_i \in \mathbb{R}^n\}_{i\in\{1,\ldots,m\}}\) as:
where \(h\) is the bandwidth of the kernel, \(w_i\) are the weights of the data points and \(\lambda_i\) are the adaptation factor of the kernel width.
The kernel is a function of \(\mathbb{R}^n\) such that:
The constraint on the covariance is only required to provide a uniform meaning for the bandwidth of the kernel.
If the domain of the density estimation is bounded to the interval \([L,U]\), the density is then estimated with:
where \(\hat{K}\) is a modified kernel that depends on the exact method used. Currently, only 1D KDE supports bounded domains.
Kernel Density Estimation Methods¶
- class pyqt_fit.kde.KDE1D(xdata, **kwords)[source]¶
Perform a kernel based density estimation in 1D, possibly on a bounded domain \([L,U]\).
Parameters: - data (ndarray) – 1D array with the data points
- kwords (dict) –
setting attributes at construction time. Any named argument will be equivalent to setting the property after the fact. For example:
>>> xs = [1,2,3] >>> k = KDE1D(xs, lower=0)
will be equivalent to:
>>> k = KDE1D(xs) >>> k.lower = 0
The calculation is separated in three parts:
- The kernel (kernel)
- The bandwidth or covariance estimation (bandwidth, covariance)
- The estimation method (method)
- bandwidth[source]¶
Bandwidth of the kernel. Can be set either as a fixed value or using a bandwidth calculator, that is a function of signature w(xdata) that returns a single value.
Note
A ndarray with a single value will be converted to a floating point value.
- cdf_grid(N=None, cut=None)[source]¶
Compute the cdf from the lower bound to the points given as argument.
- covariance[source]¶
Covariance of the gaussian kernel. Can be set either as a fixed value or using a bandwidth calculator, that is a function of signature w(xdata) that returns a single value.
Note
A ndarray with a single value will be converted to a floating point value.
- grid(N=None, cut=None)[source]¶
Evaluate the density on a grid of N points spanning the whole dataset.
Returns: a tuple with the mesh on which the density is evaluated and the density itself
- icdf_grid(N=None, cut=None)[source]¶
Compute the inverse cumulative distribution (quantile) function on a grid.
- kernel[source]¶
Kernel object. This must be an object modeled on pyqt_fit.kernels.Kernel1D. It is recommended to inherit this class to provide numerical approximation for all methods.
By default, the kernel is an instance of pyqt_fit.kernels.normal_kernel1d
- lambdas[source]¶
Scaling of the bandwidth, per data point. It can be either a single value or an array with one value per data point.
When deleted, the lamndas are reset to 1.
- method[source]¶
Select the method to use. The method should be an object modeled on pyqt_fit.kde_methods.KDE1DMethod, and it is recommended to inherit the model.
Available methods in the pyqt_fit.kde_methods sub-module.
Default: pyqt_fit.kde_methods.default_method
Bandwidth Estimation Methods¶
- pyqt_fit.kde.variance_bandwidth(factor, xdata)¶
Returns the covariance matrix:
\[\mathcal{C} = \tau^2 cov(X)\]where \(\tau\) is a correcting factor that depends on the method.
- pyqt_fit.kde.silverman_covariance(xdata, model=None)¶
The Silverman bandwidth is defined as a variance bandwidth with factor:
\[\tau = \left( n \frac{d+2}{4} \right)^\frac{-1}{d+4}\]
- pyqt_fit.kde.scotts_covariance(xdata, model=None)¶
The Scotts bandwidth is defined as a variance bandwidth with factor:
\[\tau = n^\frac{-1}{d+4}\]
- pyqt_fit.kde.botev_bandwidth(N=None, **kword)¶
Implementation of the KDE bandwidth selection method outline in:
Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel density estimation via diffusion. The Annals of Statistics, 38(5):2916-2957, 2010.
Based on the implementation of Daniel B. Smith, PhD.
The object is a callable returning the bandwidth for a 1D kernel.
Module pyqt_fit.kde_methods¶
Author: | Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com> |
---|
This module contains a set of methods to compute univariate KDEs. See the objects in the pyqt_fit.kde module for more details on these methods.
These methods provide various variations on \(\hat{K}(x;X,h,L,U)\), the modified kernel evaluated on the point \(x\) based on the estimation points \(X\), a bandwidth \(h\) and on the domain \([L,U]\).
The definitions of the methods rely on the following definitions:
These definitions correspond to:
- \(a_0(l,u)\) – The partial cumulative distribution function
- \(a_1(l,u)\) – The partial first moment of the distribution. In particular, \(a_1(-\infty, \infty)\) is the mean of the kernel (i.e. and should be 0).
- \(a_2(l,u)\) – The partial second moment of the distribution. In particular, \(a_2(-\infty, \infty)\) is the variance of the kernel (i.e. which should be close to 1, unless using higher order kernel).
References:¶
[1] | (1, 2) Jones, M. C. 1993. Simple boundary correction for kernel density estimation. Statistics and Computing 3: 135–146. |
Univariate KDE estimation methods¶
The exact definition of such a method is found in pyqt_fit.kde.KDE1D.method
- pyqt_fit.kde_methods.generate_grid(kde, N=None, cut=None)[source]¶
Helper method returning a regular grid on the domain of the KDE.
Parameters: - kde (KDE1D) – Object describing the KDE computation. The object must have been fitted!
- N (int) – Number of points in the grid
- cut (float) – for unbounded domains, how far past the maximum should the grid extend to, in term of KDE bandwidth
Returns: A vector of N regularly spaced points
- pyqt_fit.kde_methods.compute_bandwidth(kde)[source]¶
Compute the bandwidth and covariance for the model, based of its xdata attribute
- class pyqt_fit.kde_methods.KDE1DMethod[source]¶
Base class providing a default grid method and a default method for unbounded evaluation of the PDF and CDF. It also provides default methods for the other metrics, based on PDF and CDF calculations.
Note: - It is expected that all grid methods will return the same grid if used with the same arguments.
- It is fair to assume all array-like arguments will be at least 1D arrays.
The following methods are interface methods that should be overriden with ones specific to the implemented method.
- fit(kde)[source]¶
Method called by the KDE1D object right after fitting to allow for one-time calculation.
Parameters: kde (pyqt_fit.kde.KDE1D) – KDE object being fitted Default: Compute the bandwidth and covariance if specified as functions
- pdf(kde, points, out)[source]¶
Compute the PDF of the estimated distribution.
Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- points (ndarray) – Points to evaluate the distribution on
- out (ndarray) – Result object. If must have the same shapes as points
Return type: ndarray
Returns: Returns the out variable, updated with the PDF.
Default: Direct implementation of the formula for unbounded pdf computation.
- grid(kde, N=None, cut=None)[source]¶
Evaluate the PDF of the distribution on a regular grid with at least N elements.
Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
- cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type: (ndarray, ndarray)
Returns: The array of positions the PDF has been estimated on, and the estimations.
Default: Evaluate \(pdf(x)\) on a grid generated using generate_grid()
- cdf(kde, points, out)[source]¶
Compute the CDF of the estimated distribution, defined as:
\[cdf(x) = P(X \leq x) = \int_l^x p(t) dt\]where \(l\) is the lower bound of the distribution domain and \(p\) the density of probability
Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- points (ndarray) – Points to evaluate the CDF on
- out (ndarray) – Result object. If must have the same shapes as points
Return type: ndarray
Returns: Returns the out variable, updated with the CDF.
Default: Direct implementation of the formula for unbounded CDF computation.
- cdf_grid(kde, N=None, cut=None)[source]¶
Evaluate the CDF of the distribution on a regular grid with at least N elements.
Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
- cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type: (ndarray, ndarray)
Returns: The array of positions the CDF has been estimated on, and the estimations.
Default: Evaluate \(cdf(x)\) on a grid generated using generate_grid()
- icdf(kde, points, out)[source]¶
Compute the inverse cumulative distribution (quantile) function, defined as:
\[icdf(p) = \inf\left\{x\in\mathbb{R} : cdf(x) \geq p\right\}\]Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- points (ndarray) – Points to evaluate the iCDF on
- out (ndarray) – Result object. If must have the same shapes as points
Return type: ndarray
Returns: Returns the out variable, updated with the iCDF.
Default: First approximate the result using linear interpolation on the CDF and refine the result numerically using the Newton method.
- icdf_grid(kde, N=None, cut=None)[source]¶
Compute the inverse cumulative distribution (quantile) function on a grid.
Note: The default implementation is not as good an approximation as the plain icdf default method.
Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
- cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type: (ndarray, ndarray)
Returns: The array of positions the CDF has been estimated on, and the estimations.
Default: Linear interpolation of the inverse CDF on a grid
- sf(kde, points, out)[source]¶
Compute the survival function, defined as:
\[sf(x) = P(X \geq x) = \int_x^u p(t) dt = 1 - cdf(x)\]Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- points (ndarray) – Points to evaluate the survival function on
- out (ndarray) – Result object. If must have the same shapes as points
Return type: ndarray
Returns: Returns the out variable, updated with the survival function.
Default: Compute explicitly \(1 - cdf(x)\)
- sf_grid(kde, N=None, cut=None)[source]¶
Compute the survival function on a grid.
Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
- cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type: (ndarray, ndarray)
Returns: The array of positions the survival function has been estimated on, and the estimations.
Default: Compute explicitly \(1 - cdf(x)\)
- isf(kde, points, out)[source]¶
Compute the inverse survival function, defined as:
\[isf(p) = \sup\left\{x\in\mathbb{R} : sf(x) \leq p\right\}\]Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- points (ndarray) – Points to evaluate the iSF on
- out (ndarray) – Result object. If must have the same shapes as points
Return type: ndarray
Returns: Returns the out variable, updated with the inverse survival function.
Default: Compute \(icdf(1-p)\)
- isf_grid(kde, N=None, cut=None)[source]¶
Compute the inverse survival function on a grid.
Note: The default implementation is not as good an approximation as the plain isf default method.
Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
- cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type: (ndarray, ndarray)
Returns: The array of positions the CDF has been estimated on, and the estimations.
Default: Linear interpolation of the inverse survival function on a grid
- hazard(kde, points, out)[source]¶
Compute the hazard function evaluated on the points.
The hazard function is defined as:
\[h(x) = \frac{p(x)}{sf(x)}\]where \(p(x)\) is the probability density function and \(sf(x)\) is the survival function.
Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- points (ndarray) – Points to evaluate the hazard function on
- out (ndarray) – Result object. If must have the same shapes as points
Return type: ndarray
Returns: Returns the out variable, updated with the hazard function
Default: Compute explicitly \(pdf(x) / sf(x)\)
- hazard_grid(kde, N=None, cut=None)[source]¶
Compute the hazard function on a grid.
Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
- cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type: (ndarray, ndarray)
Returns: The array of positions the hazard function has been estimated on, and the estimations.
Default: Compute explicitly \(pdf(x) / sf(x)\)
- cumhazard(kde, points, out)[source]¶
Compute the cumulative hazard function evaluated on the points.
The hazard function is defined as:
\[ch(x) = \int_l^x h(t) dt = -\ln sf(x)\]where \(l\) is the lower bound of the domain, \(h\) the hazard function and \(sf\) the survival function.
Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- points (ndarray) – Points to evaluate the cumulative hazard function on
- out (ndarray) – Result object. If must have the same shapes as points
Return type: ndarray
Returns: Returns the out variable, updated with the cumulative hazard function
Default: Compute explicitly \(-\ln sf(x)\)
- cumhazard_grid(kde, N=None, cut=None)[source]¶
Compute the hazard function on a grid.
Parameters: - kde (pyqt_fit.kde.KDE1D) – KDE object
- N (int) – minimum number of element in the returned grid. Most methods will want to round it to the next power of 2.
- cut (float) – for unbounded domains, how far from the last data point should the grid go, as a fraction of the bandwidth.
Return type: (ndarray, ndarray)
Returns: The array of positions the hazard function has been estimated on, and the estimations.
Default: Compute explicitly \(-\ln sf(x)\)
- name¶
Type: str Specify a human-readable name for the method, for presentation purposes.
But the class also provide a number of utility methods to help implementing you own:
Estimation methods¶
Here are the methods implemented in pyqt_fit. To access these methods, the simplest is to use the instances provided.
- pyqt_fit.kde_methods.unbounded¶
Instance of the KDE1DMethod class.
- pyqt_fit.kde_methods.renormalization¶
Instance of the RenormalizationMethod class.
- pyqt_fit.kde_methods.reflection¶
Instance of the ReflectionMethod class.
- pyqt_fit.kde_methods.linear_combination¶
Instance of the LinearCombinationMethod class.
- pyqt_fit.kde_methods.cyclic¶
Instance of the CyclicMethod class.
- pyqt_fit.kde_methods.transformKDE1D(trans, method=None, inv=None, Dinv=None)[source]¶
Creates an instance of TransformKDE1DMethod
- pyqt_fit.kde_methods.default_method¶
Method used by pyqt_fit.kde.KDE1D by default. :Default: reflection
Classes implementing the estimation methods¶
- class pyqt_fit.kde_methods.RenormalizationMethod[source]¶
This method consists in using the normal kernel method, but renormalize to only take into account the part of the kernel within the domain of the density [1].
The kernel is then replaced with:
\[\hat{K}(x;X,h,L,U) \triangleq \frac{1}{a_0(u,l)} K(z)\]where:
\[z = \frac{x-X}{h} \qquad l = \frac{L-x}{h} \qquad u = \frac{U-x}{h}\]
- class pyqt_fit.kde_methods.ReflectionMethod[source]¶
This method consist in simulating the reflection of the data left and right of the boundaries. If one of the boundary is infinite, then the data is not reflected in that direction. To this purpose, the kernel is replaced with:
\[\hat{K}(x; X, h, L, U) \triangleq K(z) + K\left(\frac{x+X-2L}{h}\right) + K\left(\frac{x+X-2U}{h}\right)\]where:
\[z = \frac{x-X}{h}\]See the pyqt_fit.kde_methods for a description of the various symbols.
When computing grids, if the bandwidth is constant, the result is computing using CDT.
- class pyqt_fit.kde_methods.LinearCombinationMethod[source]¶
This method uses the linear combination correction published in [1].
The estimation is done with a modified kernel given by:
\[\hat{K}(x;X,h,L,U) \triangleq \frac{a_2(l,u) - a_1(-u, -l) z}{a_2(l,u)a_0(l,u) - a_1(-u,-l)^2} K(z)\]where:
\[z = \frac{x-X}{h} \qquad l = \frac{L-x}{h} \qquad u = \frac{U-x}{h}\]
- class pyqt_fit.kde_methods.CyclicMethod[source]¶
This method assumes cyclic boundary conditions and works only for closed boundaries.
The estimation is done with a modified kernel given by:
\[\hat{K}(x; X, h, L, U) \triangleq K(z) + K\left(z - \frac{U-L}{h}\right) + K\left(z + \frac{U-L}{h}\right)\]where:
\[z = \frac{x-X}{h}\]When computing grids, if the bandwidth is constant, the result is computing using FFT.
- pyqt_fit.kde_methods.create_transform(obj, inv=None, Dinv=None)[source]¶
Create a transform object.
Parameters: - obj (fun) – This can be either simple a function, or a function-object with an ‘inv’ and/or ‘Dinv’ attributes containing the inverse function and its derivative (respectively)
- inv (fun) – If provided, inverse of the main function
- Dinv (fun) – If provided, derivative of the inverse function
Return type: Transform
Returns: A transform object with function, inverse and derivative of the inverse
The inverse function must be provided, either as argument or as attribute to the object. The derivative of the inverse will be estimated numerically if not provided.
Note: All the functions should accept an out argument to store the result.
- class pyqt_fit.kde_methods.TransformKDE1DMethod(trans, method=None, inv=None, Dinv=None)[source]¶
Compute the Kernel Density Estimate of a dataset, transforming it first to a domain where distances are “more meaningful”.
Often, KDE is best estimated in a different domain. This object takes a KDE1D object (or one compatible), and a transformation function.
Given a random variable \(X\) of distribution \(f_X\), the random variable \(Y = g(X)\) has a distribution \(f_Y\) given by:
\[f_Y(y) = \left| \frac{1}{g'(g^{-1}(y))} \right| \cdot f_X(g^{-1}(y))\]In our term, \(Y\) is the random variable the user is interested in, and \(X\) the random variable we can estimate using the KDE. In this case, \(g\) is the transform from \(Y\) to \(X\).
So to estimate the distribution on a set of points given in \(x\), we need a total of three functions:
- Direct function: transform from the original space to the one in which the KDE will be perform (i.e. \(g^{-1}: y \mapsto x\))
- Invert function: transform from the KDE space to the original one (i.e. \(g: x \mapsto y\))
- Derivative of the invert function
If the derivative is not provided, it will be estimated numerically.
Parameters: - trans – Either a simple function, or a function object with attributes inv and Dinv to use in case they are not provided as arguments. The helper create_transform() will provide numeric approximation of the derivative if required.
- method – instance of KDE1DMethod used in the transformed domain. Default is pyqt_fit.kde_methods.KDE1DMethod
- inv – Invert of the function. If not provided, trans must have it as attribute.
- Dinv – Derivative of the invert function.
Note: all given functions should accept an optional out argument to get a pre-allocated array to store its result. Also the out parameter may be one of the input argument.
Module pyqt_fit.kernels¶
Author: | Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com> |
---|
Module providing a set of kernels for use with either the pyqt_fit.kde or the kernel_smoothing module.
Kernels should be created following this template:
Helper class¶
This class is provided with default implementations of everything in term of the PDF.
- class pyqt_fit.kernels.Kernel1D[source]¶
A 1D kernel \(K(z)\) is a function with the following properties:
\[\begin{split}\begin{array}{rcl} \int_\mathbb{R} K(z) &=& 1 \\ \int_\mathbb{R} zK(z)dz &=& 0 \\ \int_\mathbb{R} z^2K(z) dz &<& \infty \quad (\approx 1) \end{array}\end{split}\]Which translates into the function should have:
- a sum of 1 (i.e. a valid density of probability);
- an average of 0 (i.e. centered);
- a finite variance. It is even recommanded that the variance is close to 1 to give a uniform meaning to the bandwidth.
- cut¶
Type: float Cutting point after which there is a negligeable part of the probability. More formally, if \(c\) is the cutting point:
\[\int_{-c}^c p(x) dx \approx 1\]
- lower¶
Type: float Lower bound of the support of the PDF. Formally, if \(l\) is the lower bound:
\[\int_{-\infty}^l p(x)dx = 0\]
- upper¶
Type: float Upper bound of the support of the PDF. Formally, if \(u\) is the upper bound:
\[\int_u^\infty p(x)dx = 0\]
- cdf(z, out=None)[source]¶
Returns the cumulative density function on the points z, i.e.:
\[K_0(z) = \int_{-\infty}^z K(t) dt\]
- dct(z, out=None)[source]¶
DCT of the kernel on the points of z. The points will always be provided as a grid with \(2^n\) points, representing the whole frequency range to be explored.
- fft(z, out=None)[source]¶
FFT of the kernel on the points of z. The points will always be provided as a grid with \(2^n\) points, representing the whole frequency range to be explored. For convenience, the second half of the points will be provided as negative values.
- pdf(z, out=None)[source]¶
Returns the density of the kernel on the points z. This is the funtion \(K(z)\) itself.
Parameters: - z (ndarray) – Array of points to evaluate the function on. The method should accept any shape of array.
- out (ndarray) – If provided, it will be of the same shape as z and the result should be stored in it. Ideally, it should be used for as many intermediate computation as possible.
Gaussian Kernels¶
- class pyqt_fit.kernels.normal_kernel(dim)[source]¶
Returns a function-object for the PDF of a Normal kernel of variance identity and average 0 in dimension dim.
- class pyqt_fit.kernels.normal_kernel1d[source]¶
1D normal density kernel with extra integrals for 1D bounded kernel estimation.
- cdf(z, out=None)[source]¶
Cumulative density of probability. The formula used is:
\[\text{cdf}(z) \triangleq \int_{-\infty}^z \phi(z) dz = \frac{1}{2}\text{erf}\left(\frac{z}{\sqrt{2}}\right) + \frac{1}{2}\]
- pdf(z, out=None)[source]¶
Return the probability density of the function. The formula used is:
\[\phi(z) = \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}}\]Parameters: xs (ndarray) – Array of any shape Returns: an array of shape identical to xs
Tricube Kernel¶
- class pyqt_fit.kernels.tricube[source]¶
Return the kernel corresponding to a tri-cube distribution, whose expression is. The tri-cube function is given by:
\[\begin{split}f_r(x) = \left\{\begin{array}{ll} \left(1-|x|^3\right)^3 & \text{, if } x \in [-1;1]\\ 0 & \text{, otherwise} \end{array}\right.\end{split}\]As \(f_r\) is not a probability and is not of variance 1, we use a normalized function:
\[f(x) = a b f_r(ax)\]\[a = \sqrt{\frac{35}{243}}\]\[b = \frac{70}{81}\]- cdf(z, out=None)[source]¶
CDF of the distribution:
\[\begin{split}\text{cdf}(x) = \left\{\begin{array}{ll} \frac{1}{162} {\left(60 (ax)^{7} - 7 {\left(2 (ax)^{10} + 15 (ax)^{4}\right)} \mathrm{sgn}\left(ax\right) + 140 ax + 81\right)} & \text{, if}x\in[-1/a;1/a]\\ 0 & \text{, if} x < -1/a \\ 1 & \text{, if} x > 1/a \end{array}\right.\end{split}\]
- pm1(z, out=None)[source]¶
Partial moment of order 1:
\[\begin{split}\text{pm1}(x) = \left\{\begin{array}{ll} \frac{7}{3564a} {\left(165 (ax)^{8} - 8 {\left(5 (ax)^{11} + 33 (ax)^{5}\right)} \mathrm{sgn}\left(ax\right) + 220 (ax)^{2} - 81\right)} & \text{, if} x\in [-1/a;1/a]\\ 0 & \text{, otherwise} \end{array}\right.\end{split}\]
- pm2(z, out=None)[source]¶
Partial moment of order 2:
\[\begin{split}\text{pm2}(x) = \left\{\begin{array}{ll} \frac{35}{486a^2} {\left(4 (ax)^{9} + 4 (ax)^{3} - {\left((ax)^{12} + 6 (ax)^{6}\right)} \mathrm{sgn}\left(ax\right) + 1\right)} & \text{, if} x\in[-1/a;1/a] \\ 0 & \text{, if } x < -1/a \\ 1 & \text{, if } x > 1/a \end{array}\right.\end{split}\]
Epanechnikov Kernel¶
- class pyqt_fit.kernels.Epanechnikov[source]¶
1D Epanechnikov density kernel with extra integrals for 1D bounded kernel estimation.
- cdf(xs, out=None)[source]¶
CDF of the distribution. The CDF is defined on the interval \([-\sqrt{5}:\sqrt{5}]\) as:
\[\begin{split}\text{cdf}(x) = \left\{\begin{array}{ll} \frac{1}{2} + \frac{3}{4\sqrt{5}} x - \frac{3}{20\sqrt{5}}x^3 & \text{, if } x\in[-\sqrt{5}:\sqrt{5}] \\ 0 & \text{, if } x < -\sqrt{5} \\ 1 & \text{, if } x > \sqrt{5} \end{array}\right.\end{split}\]
- pdf(xs, out=None)[source]¶
The PDF of the kernel is usually given by:
\[\begin{split}f_r(x) = \left\{\begin{array}{ll} \frac{3}{4} \left(1-x^2\right) & \text{, if} x \in [-1:1]\\ 0 & \text{, otherwise} \end{array}\right.\end{split}\]As \(f_r\) is not of variance 1 (and therefore would need adjustments for the bandwidth selection), we use a normalized function:
\[f(x) = \frac{1}{\sqrt{5}}f\left(\frac{x}{\sqrt{5}}\right)\]
- pm1(xs, out=None)[source]¶
First partial moment of the distribution:
\[\begin{split}\text{pm1}(x) = \left\{\begin{array}{ll} -\frac{3\sqrt{5}}{16}\left(1-\frac{2}{5}x^2+\frac{1}{25}x^4\right) & \text{, if } x\in[-\sqrt{5}:\sqrt{5}] \\ 0 & \text{, otherwise} \end{array}\right.\end{split}\]
- pm2(xs, out=None)[source]¶
Second partial moment of the distribution:
\[\begin{split}\text{pm2}(x) = \left\{\begin{array}{ll} \frac{5}{20}\left(2 + \frac{1}{\sqrt{5}}x^3 - \frac{3}{5^{5/2}}x^5 \right) & \text{, if } x\in[-\sqrt{5}:\sqrt{5}] \\ 0 & \text{, if } x < -\sqrt{5} \\ 1 & \text{, if } x > \sqrt{5} \end{array}\right.\end{split}\]
Higher Order Kernels¶
High order kernels are kernel that give up being valid probabilities. We will say a kernel \(K_{[n]}\) is of order \(n\) if:
PyQt-Fit implements two high order kernels.
Module pyqt_fit.utils¶
Author: | Pierre Barbier de Reuille <pierre.barbierdereuille@gmail.com> |
---|
Module contained a variety of small useful functions.
- pyqt_fit.utils.namedtuple(typename, field_names, verbose=False, rename=False)[source]¶
Returns a new subclass of tuple with named fields.
>>> Point = namedtuple('Point', 'x y') >>> Point.__doc__ # docstring for the new class 'Point(x, y)' >>> p = Point(11, y=22) # instantiate with positional args or keywords >>> p[0] + p[1] # indexable like a plain tuple 33 >>> x, y = p # unpack like a regular tuple >>> x, y (11, 22) >>> p.x + p.y # fields also accessable by name 33 >>> d = p._asdict() # convert to a dictionary >>> d['x'] 11 >>> Point(**d) # convert from a dictionary Point(x=11, y=22) >>> p._replace(x=100) # _replace() is like str.replace() but targets named fields Point(x=100, y=22)
- pyqt_fit.utils.approx_jacobian(x, func, epsilon, *args)[source]¶
Approximate the Jacobian matrix of callable function func
Parameters: - x (ndarray) – The state vector at which the Jacobian matrix is desired
- func (callable) – A vector-valued function of the form f(x,*args)
- epsilon (ndarray) – The peturbation used to determine the partial derivatives
- args (tuple) – Additional arguments passed to func
Returns: An array of dimensions (lenf, lenx) where lenf is the length of the outputs of func, and lenx is the number of
Note
The approximation is done using forward differences