Welcome to EmpiricalRisks’s documentation!¶
This package provides the basic components for (regularized) empirical risk minization, which is generally formulated as follows
As we can see, this formulation involves several components:
- Prediction model:
, which takes an input
and a parameter
and produces an output (say
).
- Loss function:
, which compares the predicted output
and a desired response
, and produces a real value that measuring the loss. Generally, better prediction yields smaller loss.
- Risk model:
, the prediction model and the loss together are referred to as the risk model. When the data x and y are given, the risk model can be considered as a function of theta.
- Regularizer:
is often introduced to regularize the parameter, which, when used properly, can improve the numerical stability of the problem and the generalization performance of the estimated model.
This package provides these components, as well as the gradient computation routines and proximal operators, to support the implementation of various empirical risk minimization algorithms.
All functions in this packages are well optimized and systematically tested.
Contents:¶
Prediction Models¶
A prediction model is a function with two arguments: the input feature
and the predictor parameter
. All prediction models are instances of an the abstract type
PredictionModel
, defined as follows:
abstract PredictionModel{NDIn, NDOut}
# NDIn: The number of dimensions of each input (0: scalar, 1: vector, 2: matrix, ...)
# NDOut: The number of dimensions of each output (0: scalar, 1: vector, 2: matrix, ...)
Common Methods¶
Each prediction model implements the following methods:
-
inputlen
(pm)¶ Return the length of each input.
-
inputsize
(pm)¶ Return the size of each input.
-
outputlen
(pm)¶ Return the length of each output.
-
outputsize
(pm)¶ Return the size of each output.
-
paramlen
(pm)¶ Return the length of the parameter.
-
paramsize
(pm)¶ Return the size of the parameter.
-
ninputs
(pm, x)¶ Verify the validity of
x
as a single input or as a batch of inputs. Ifx
is valid, it returns the number of inputs in arrayx
, otherwise, it raises an error.
-
predict
(pm, theta, x)¶ Predict the output given the parameter
theta
and the inputx
.Here,
x
can be either a sample or an array comprised of multiple samples.
Predefined Models¶
The package provides the following prediction models:
(Univariate) Linear Prediction¶
- parameter:
, a vector of length
d
. - input::
, a vector of length
d
. - output:: a scalar.
immutable LinearPred <: PredictionModel{1,0}
dim::Int
LinearPred(d::Int) = new(d)
end
(Univariate) Affine Prediction¶
Here, b
is a model constant to serve as the base of the bias term.
- parameter:
, a vector of length
d + 1
, in the form[w; a]
. - input:
, a vector of length
d
. - output:: a scalar.
immutable AffinePred <: PredictionModel{1,0}
dim::Int
bias::Float64
AffinePred(d::Int) = new(d, 1.0)
AffinePred(d::Int, b::Real) = new(d, convert(Float64, b))
end
(Multivariate) Linear Prediction¶
- parameter:
, a matrix of size
(k, d)
. - input:
, a vector of length
d
. - output: a vector of length
k
.
immutable MvLinearPred <: PredictionModel{1,1}
dim::Int
k::Int
MvLinearPred(d::Int, k::Int) = new(d, k)
end
(Multivariate) Affine Prediction¶
Here, b
is a model constant to serve as the base of the bias term.
- parameter:
, a matrix of size
(k, d+1)
, in the form[W a]
, whereW
is a coefficient matrix of size(k, d)
anda
is a bias-coefficient vector of size(k,)
. - input:
, a vector of length
d
. - output: a vector of length
k
.
immutable MvAffinePred <: PredictionModel{1,1}
dim::Int
k::Int
bias::Float64
MvAffinePred(d::Int, k::Int) = new(d, k, 1.0)
MvAffinePred(d::Int, k::Int, b::Real) = new(d, k, convert(Float64, b))
end
Examples¶
Here is an example that illustrates a prediction model.
pm = MvLinearPred(5, 3) # construct a prediction model
# with input dimension 5
# output dimension 3
inputlen(pm) # --> 5
inputsize(pm) # --> (5,)
outputlen(pm) # --> 3
outputsize(pm) # --> (3,)
paramlen(pm) # --> 15
paramsize(pm) # --> (3, 5)
W = randn(3, 5) # W is a parameter matrix
x = randn(3) # x is a single input
ninputs(pm, x) # --> 1
predict(pm, W, x) # make prediction: --> W * x
X = randn(3, 10) # X is a matrix with 10 samples
ninputs(pm, X) # --> 10
predict(pm, W, X) # make predictions: --> W * X
Loss Functions¶
Generally, a loss function is to measure the loss between the predicted output
u
and the desired response y
.
In this package, all loss functions are instances of the abstract type Loss
, defined as below:
# N is the number of dimensions of each predicted output
# 0 - scalar
# 1 - vector
# 2 - matrix, ...
#
abstract Loss{N}
typealias UnivariateLoss Loss{0}
typealias MultivariateLoss Loss{1}
Common Methods¶
Methods for Univariate Loss¶
Each univariate loss function implements the following methods:
-
value
(loss, u, y)¶ Compute the loss value, given the predicted output
u
and the desired responsey
.
-
deriv
(loss, u, y)¶ Compute the derivative w.r.t.
u
.
-
value_and_deriv
(loss, u, y)¶ Compute both the loss value and derivative (w.r.t.
u
) at the same time.Note
This can be more efficient than calling
value
andderiv
respectively, when you need both the value and derivative.
Methods for Multivariate Loss¶
Each multivariate loss function implements the following methods:
-
value
(loss, u, y) Compute the loss value, given the predicted output
u
and the desired responsey
.
-
grad!(loss, g, u, y)
Compute the gradient w.r.t.
u
, and write the results tog
. This function returnsg
.Note
g
is allowed to be the same asu
, in which case, the content ofu
will be overrided by the derivative values.
-
value_and_grad!(loss, g, u, y)
Compute both the loss value and the derivative w.r.t.
u
at the same time. This function returns(v, g)
, wherev
is the loss value.Note
g
is allowed to be the same asu
, in which case, the content ofu
will be overrided by the derivative values.
For multivariate loss functions, the package also provides the following two generic functions for convenience.
-
grad
(loss, u, y)¶ Compute and return the gradient w.r.t.
u
.
-
value_and_grad
(loss, u, y)¶ Compute and return both the loss value and the gradient w.r.t.
u
, and return them as a 2-tuple.
Both grad
and value_and_grad
are thin wrappers of the type-specific methods grad!
and value_and_grad!
.
Predefined Loss Functions¶
This package provides a collection of loss functions that are commonly used in machine learning practice.
Absolute Loss¶
The absolute loss, defined below, is often used for real-valued robust regression:
immutable AbsLoss <: UnivariateLoss end
Squared Loss¶
The squared loss, defined below, is widely used in real-valued regression:
immutable SqrLoss <: UnivariateLoss end
Quantile Loss¶
The quantile loss, defined below, is used in models for predicting typical values. It can be considered as a skewed version of the absolute loss.
immutable QuantileLoss <: UnivariateLoss
t::Float64
function QuantileLoss(t::Real)
...
end
end
Huber Loss¶
The Huber loss, defined below, is used mostly in real-valued regression, which is a smoothed version of the absolute loss.
immutable HuberLoss <: UnivariateLoss
h::Float64
function HuberLoss(h::Real)
...
end
end
Hinge Loss¶
The hinge loss, defined below, is mainly used for large-margin classification (e.g. SVM).
immutable HingeLoss <: UnivariateLoss end
Smoothed Hinge Loss¶
The smoothed hinge loss, defined below, is a smoothed version of the hinge loss, which is differentiable everywhere.
immutable SmoothedHingeLoss <: UnivariateLoss
h::Float64
function SmoothedHingeLoss(h::Real)
...
end
end
Logistic Loss¶
The logistic loss, defined below, is the loss used in the logistic regression.
immutable LogisticLoss <: UnivariateLoss end
Sum Loss¶
The package provides the SumLoss type that turns a univariate loss into a multivariate loss. The definition is given below:
Here, intern
is the internal univariate loss.
immutable SumLoss{L<:UnivariateLoss} <: MultivariateLoss
intern::L
end
SumLoss{L<:UnivariateLoss}(loss::L) = SumLoss{L}(loss)
Moreover, recognizing that sum of squared difference is very widely used. We provide a SumSqrLoss
as a typealias as follows:
typealias SumSqrLoss SumLoss{SqrLoss}
SumSqrLoss() = SumLoss{SqrLoss}(SqrLoss())
Multinomial Logistic Loss¶
The multinomial logistic loss, defined below, is the loss used in multinomial logistic regression (for multi-way classification).
Here, y
is the index of the correct class.
immutable MultiLogisticLoss <: MultivariateLoss
Risk Models¶
The prediction model together with a (compatible) loss function constitutes a risk model, which can be expressed as .
In this package, we use a type SupervisedRiskModel
to capture this:
abstract RiskModel
immutable SupervisedRiskModel{PM<:PredictionModel,L<:Loss} <: RiskModel
predmodel::PM
loss::L
end
We also provide a function to construct a risk model:
-
riskmodel
(pm, loss)¶ Construct a risk model, given the predictio model
pm
and a loss functionloss
.Here,
pm
andloss
need to be compatible, which means that the output of the prediction and the first argument of the loss function should have the same number of dimensions.Actually, the definition of
riskmodel
explicitly enforces this consistency:riskmodel{N,M}(pm::PredictionModel{N,M}, loss::Loss{M}) = SupervisedRiskModel{typeof(pm), typeof(loss)}(pm,loss)
Note
We may provide other risk model in addition to supervised risk model in future. Currently, the supervised risk models, which evaluate the risk by comparing the predictions and the desired responses, are what we focus on.
Common Methods¶
When a set of inputs and the corresponding outputs are given, the risk model can be considered as a function of the parameter .
The package provides methods for computing the total risk and the derivative of the total risk w.r.t. the parameter.
-
value
(rmodel, theta, x, y)¶ Compute the total risk w.r.t. the risk model rmodel, given
- the prediction parameter
theta
; - the inputs
x
; and - the desired responses
y
.
Here,
x
andy
can be a single sample or matrices comprised of a set of samples.Example:
# constructs a risk model, with a linear prediction # and a squared loss. # # risk := (theta'x - y)^2 / 2 # rmodel = risk_model(LinearPred(5), SqrLoss()) theta = randn(5) # parameter x = randn(5) # a single input y = randn() # a single output risk(rmodel, theta, x, y) # evaluate risk on a single sample (x, y) X = randn(5, 8) # a matrix of 8 inputs Y = randn(8) # corresponding outputs risk(rmodel, theta, X, Y) # evaluate the total risk on (X, Y)
- the prediction parameter
-
value_and_addgrad!(rmodel, beta, g, alpha, theta, x, y)
Compute the total risk on
x
andy
, and its gradient w.r.t. the parametertheta
, and add it tog
in the following manner:Here,
x
andy
can be a single sample or a set of multiple samples. The function returns both the evaluated value andg
as a 2-tuple.Note
When
beta
is zero, the computed gradient (or its scaled version) will be written tog
without using the original data ing
(in this case,g
need not be initialized).
-
value_and_grad
(rmodel, theta, x, y)¶ Compute and return the gradient of the total risk on
x
andy
, w.r.t. the parameterg
.This is just a thin wrapper of
value_and_addgrad!
.
Note that the addgrad!
method is provided for risk model with certain combinations of prediction models and loss functions. Below is a list of combinations that we currently support:
LinearPred
+UnivariateLoss
AffinePred
+UnivariateLoss
MvLinearPred
+MultivariateLoss
MvAffinePred
+MultivariateLoss
If you have a new prediction model that is not defined by the package, you can write your own addgrad!
method, based on the description above.
Regularizers¶
Regularization is important, especially when we don’t have a huge amount of training data. Effective regularization can often substantially improve the generalization performance of the estimated model.
In this package, all regularizers are instances of the abstract type Regularizer
.
Common Methods¶
Each regularizer type implements the following methods:
-
value
(reg, theta)¶ Evaluate the regularization value at
theta
and return the value.
-
value_and_addgrad!(reg, beta, g, alpha, theta)
Compute the regularization value, and its gradient w.r.t.
theta
and add it tog
in the following way:Note
When
beta
is zero, the computed gradient (or its scaled version) will be written tog
without using the original data ing
(in this case,g
need not be initialized).
-
prox!(reg, r, theta, lambda)
Evaluate the proximal operator, as follows:
This method is needed when proximal methods are used to solve the problem.
In addition, the package also provides a set of generic wrappers to simplify some use cases.
-
value_and_grad
(reg, theta)¶ Compute and return the regularization value and its gradient w.r.t.
theta
.This is a wrapper of
value_and_addgrad!
.
-
prox
(reg, theta[, lambda])¶ Evaluate the proximal operator at
theta
. Whenlambda
is omitted, it is set to1
by default.This is a wrapper of
prox!
.
Predefined Regularizers¶
The package provides several commonly used regularizers:
Zero Regularizer¶
The zero regularizer always yields the zero value, which is mainly used to supply an regularizer argument to functions to request it (but you do not intend to impose any regularization).
immutable ZeroReg <: Regularizer end
Squared L2 Regularizer¶
This is one of the most widely used regularizer in practice.
immutable SqrL2Reg{T<:FloatingPoint} <: Regularizer
c::T
end
SqrL2Reg{T<:FloatingPoint}(c::T) = SqrL2Reg{T}(c)
L1 Regularizer¶
This is often used for sparse learning.
immutable L1Reg{T<:FloatingPoint} <: Regularizer
c::T
end
L1Reg{T<:FloatingPoint}(c::T) = L1Reg{T}(c)
Elastic Regularizer¶
This is also known as L1/L2 regularizer, which is used in the Elastic Net formulation.
immutable ElasticReg{T<:FloatingPoint} <: Regularizer
c1::T
c2::T
end
ElasticReg{T<:FloatingPoint}(c1::T, c2::T) = ElasticReg{T}(c1, c2)
Auxiliary Utilities¶
The package also provides other convenience tools.
-
no_op
(...)¶ This function accepts arbitrary arguments and returns nothing.
It is mainly used as a callback function where you don’t really need to callback to do anything.
-
shrink
(x, t)¶ Compute the following function:
Here,
x
can be either a scalar or an array (for vectorized computation).