Welcome to dl’s documentation!¶
Part I: Applied Math and Machine Learning Basics¶
5 Machine Learning Basics¶
5.4 Estimator, Bias and Variance¶
Pointe estimator¶
Denote a point estimate of a parameter \(\hat{\theta}\). Let \(\{x^{1} ... x^{m}\}\) be a set of m independent and identically distributed data (iid). A point estimator or statistic is any function of the data:
Frequentist perspective on statistics: assume true parameter value :math: theta is fixed but unknown. The point estimate is a function of the data. Since data is drawn from random process, any function of the data is random. Therefore \(\hat{\theta}\) is a random process.
Bias¶
Expectation is over the data. :math: theta is true underlying value used to define the data generating distribution. :math: theta is unbiased when \(bias(\hat{\theta}) = 0\). :math: theta is asymptotically nbiased when \(lim_{m\rightarrow \infty}bias(\hat{\theta}) = 0\)
See all the examples from p121 ~ p124 for Estimation mean, square …
Variance¶
Variance of the estimator:
Standar Error:
The variance provides a measure of how we would expect the estimate we compute from data to vary as we independently resample the dataset from the underlying data-generating process.
The standard error of mean \(SE(\hat{\mu}_m)=\sqrt{Var(\frac{1}{m} \sum_{i} x^i}) = \frac{\sigma}{\sqrt{m}}\) is very useful because we often estimate the generalization error by computing sample mean of the error on the test set. See example of Bernoulli Distribution in P125
Trading off Bias and Variance to Minimizing Mean Error¶
- Bias measures the expected deviation from the true value of the function or parameter.
- Variance provides a measure of teh deviation from the expected estimator value that any particular sampling of the data is likely to cause.
How do we compare a model with large bias and model with large variance? Most common way: use cross-validation. Alternatively, mean square error (MSE):
To prove it, try it from right to left.
When generalization erro (defined in 5.2 in P107, as expected value of error on a new input) is measured by the MSE (where bias and variance are meaningful components of generalization error), increasing capacity tends to increase variance and decrease bais.
5.6 Bayesian Statistics¶
Part II: Modern Practical Deep Networks¶
1. Introduction
Deep feedforward networks, also called feedforward neural networks, or multilayer perceptrons(MLPs).
- Goal: to approximate some function \(f^*(x)\).
- Example: A classifier such as Digit Recognizer . We are trying to map an input \(\boldsymbol{x}\) to a category \(y\). As the example of Digit Recognizer, the input is an image of hand written digit. We are trying to approximate this function \(f^*(x)\) to map this image to the correct digit.
It forms the basic of many important commercial application such as Convolutional Neural Network and Recurrent Neural Network.
2. Why we call it Network?
It represent the combination of different functions. For example, we might have three functions \(f_1, f_2 and f_3\) connectedin a chain, to form \(f(x) = f_3(f_2(f_1(x)))\). These chain structures are the most commonly used structures of neural networks. In this case, \(f_1\) is called the first layerof the network, \(f_2\) is called the second layer, and so on.
The final layer of a feedforward network is called theoutput layer. During neural network training, we drive \(f(x)\) to match \(f^∗(x)\).
3. Hidden Layers
The training examples specify directly what the output layer must do at each point \(x\); it must produce a value that is close to \(y\). The behavior of the other layers is not directly specified by the training data. The learning algorithm must decide how to use those layers to produce the desired output, but the training data do not say what each individual layer should do. Instead, the learning algorithm must decide how to use these layers to best implement an approximation of \(f^∗\). Because the training data does not show the desired output for each of these layers, they are called hidden layers
8 Optimization for Training Deep Models¶
Here is the link to the test book.
Batch Normalization¶
One of the assignment in class cs231n is to implement the function of bacth normalization, in both forward and backward propagation. Forward propagation is pretty straigh forward but backward took me really long time to understand and implement.
As you can see from the assignment, there are two ways to implement backpropagation.
- Simulate forward and backward prop through multiple steps
- Calculate the relationship between input and output. Then give a one step calculation.
No matter what method you want to choose, you have to draw a computation graph in order to help yourself to understand the operation dynamic. Below is the figure I had in my scratch paper.
There some few things to note:
- This figure did not show process from \(intmdt\_out\) to \(out\) in order to simplify the problem.
- For the node that has two path coming out of it, the gradient should be the sum of the gradients calculated individually on different path. So in order to get the gradient of \(x_centered\), we need calculate the gradients with respect of path (1) and path (2) and add them together.
Simulation method¶
9 The convolutional Networks¶
9.1 The Convolutional Operation¶
Convolution is an operation on two function of a real valued argument.
The conv operation is usually denoted with an asterisk
In the case of indiscrete value
In the case of dsicrete value
Note: w needs to be 0 for all negative arguments
- x: input
- w: kernel
- s: feature map
For 2D input and 2D kernel:
It is commutative, so we can also right
Cross-coorelation:
Many machine learning libaries implement cross correlations but call it covolution. Check reference for the difference.
Any Neural Network algorithm that work with matrix multiplication and does not depend on specific properties of the matrix structure should work with convolution, without requiring any further changes to the neural network.
9.2 Motivation¶
Three advantage of convolution * Sparse interpretation * Parameter Sharing * Equivariant Representation
Sparse Interpretation¶
- Traditional Matrix Multiplication: Every output unit interact with every inout unit.
- Convolution: Sparse connectivity
We need to store fewer parameters, which improves
- Memory requirement
- statistical efficiency
This allows the network to efficiently describe complicated interaction between many variables by construction such interaction from simple building blocks that each describe only sparse interaction.
Parameter sharing¶
Use the same parameter for more than one function in a model.
- Matrix Multiplication: Each element of the weight matrix is used exactly once when computing the output. It is multiplied by one element of the input and never revisited.
- Convolution: Tied weights, value of weight applied to one input is tied the value of a weight applied to applied elsewhere. Rather than learning a seperate set of parameters for every location, we learn only one set. Dramatically decrease the storage requriements.
Combine sparse connectivity and parameter sharing:
Equivariance to Translation¶
If the input changes, the output changes in the same way. Specifically, a function f(x) is equivariant to function g if \(f(g(x)) = g(f(x))\). Eg:
- g: shift image
- f: convolution
When processing time-serise data: convolution produces a sort of timeline that shows when different features appear in the input. If we move an event later in time in the input, the exact same representation of it will appear in the outout
Convolution is not natually quivariant to some other transformation such as changes in scale or rotation of an image.
Enables processing flexible shaped data¶
Discussed in 9.7
Resource¶
9.3 Pooling¶
A pooling function replaces the output of the net at a certain lcoation with a summary statistic of the nearby output. Eg:
- Max Pooling
- Average Pooling
- L2 Norm
- Weighted Average based on the distance from the central pixels
Pooling helps to make representation approximately invariant to small translation of the input. Invariance to local translation can be a useful property if we care mor e about whether some feature is present than exactly where it is.
Use of pooling: add an infinitely strong prior that the funnction the layer learns must be invariant to small translation. Pool over the output of seperately parameterized convolutions, the features can learn which transformations to become invarient to.
Improve the
- computational efficiency
- statistic efficiency and memory requirement: when the next layer is a function of its input side.
Pooling is essential for handling inputs of varying size.
Which kind of pooing to use:
- Dynamically pool features together
- Learn a single pooling structure that is applied to all images.
9.4 Convolution and Pooling as a Infinitely Strong Prior¶
Prior: probability distribution oever the parameters of the model that encode our believe about what models are reasonable.
- A weak prior: high entropy. Eg: Gaussian with high variance. Such prior allows the data to move more or less freely.
- A strong prior: low entropy. Eg: Gaussian with low variance. Such piror plays a more active role in determine where the parameters end up.
Check this and try different variance.
We can imagine conv net as being similar to a fully connected net, but with an infinitely strong prior over its weight: weights for one hidden unit must be identical to weights of its neighbor but shifted in space.
Over all, infinitely strong layer:
- Conv layer: the function the layer should learn contains only local interaction and is equivariant to translation
- Pooling layer: each unit should be invariant to small translation
Like any other prior, conv and pooling are only useful when the assumption made by the prior are reasonably accurate. If not, underfitting. When a task involves incorporating information from very distance locations in the input, then the prior imposed by conv maybe inpropriate.
Resource¶
9.5 Variants of the Basic Convolution Function¶
Convolution in the context of NN means an operation that consists of many applications of convolution in parallel.
- Kernel K with element \(K_{i, j, k, l}\) giving the connection strength between a unit in channel i of output and a unit in channel j of the input, with an offset of k rows and l columns between the output unit and the input unit.
- Input: \(V_{i, j, k}\) with channel i, row j and column k
- Output Z same format as V
- Use 1 as first entry
Full Convolution¶
0 Padding 1 stride¶
0 Padding s stride¶
Convolution with a stride greater than 1 pixel is equivalent to conv with 1 stride followed by downsampling:
Some 0 Paddings and 1 stride¶
Without 0 paddings, the width of representation shrinks by one pixel less than the kernel width at each layer. We are forced to choose between shrinking the spatial extent of the network rapidly and using small kernel. 0 padding allows us to control the kernel width and the size of the output independently.
Special case of 0 padding:
- Valid: no 0 padding is used. Limited number of layers.
- Same: keep the size of the output to the size of input. Unlimited number of layers. Pixels near the border influence fewer output pixels than the input pixels near the center.
- Full: Enough zeros are added for every pixels to be visited k (kernel width) times in each direction, resulting width m + k - 1. Difficult to learn a single kernel that performs well at all positions in the convolutional feature map.
Usually the optimal amount of 0 padding lies somewhere between ‘Valid’ or ‘Same’
Tiled Convolution¶
Learn a set of kernels that we rotate through as we move through space. Immediately neighboring locations will have different filters, but the memory requirement for storing the parameters will increase by a factor of the size of this set of kernels. Comparison on locally connected layers, tiled convolution and stardard convolution:
K: 6-D tensor, t different choice of kernel stack
Local connected layers and tiled convolutional layer with max pooling: the detector units of these layers are driven by different filters. If the filters learn to detect different tranformed version of the same underlying features, then the max-pooled units become invariant to the learned transformation.
Review:
Back prop in conv layer¶
Back prop of conv layer:
- K: Kernel stack
- V: Input image
- Z: Output of conv layer
- G: gradient on Z

Bias in after conv¶
We generally add some bias term to each output before applying nonelinearity.
- For local conncted layers: give each unit its one bias
- For tiled conv layers: share the biases with the same tiling pattern as the kernels
- For conv layers: have one bias per channel of the output and share it accross all locations within each convolution map. If the input is fixed size, it is also possible to learn a seperate bias at each location of the output map.
9.6 Structured Output¶
Convolution networks can be used to output a high-D structured object, rather than just redicting a class label for a classification task or a real value for regression tasks. Eg:
The model might emit a tensor S where \(S_{i, j, k}\) is the probability that pixel (j, k) of the input belongs to class i.
Issue, the output plane can be smaller than input plane. Review:
Strategy for size reduction issue:
- avoid pooling altogether
- emit a lower-resolution grid of labels
- pooling operator with unit stride
One strategy for pixel-wise labeling of images is to produce an initial guess of the image label.
- produce an initial guess of the image labels.
- refine this initial guess using the interactions between neighboring pixels.
Repeat this refinement step serveral times corresponds to using the same convolution at each stage, sharing weights between last layers of the deep net. Recurrent Convolutional Network:
9.7 Data Type¶
Conv Net can also process input with varing spetial extents.
Image with different size:
- No further design needed: label each pixel
- Further design such as add pooling layers whose pooling region scale in size proportional to the size of input: one label per image
Review:
Convolution of different size of input
- Make sense: Same kind of observation, such as different size of image, different length of recording, different width of observation over space, and so forth.
- Not make sense: the input optionally have different kinds of observation such as convolving the same weights over features corresponding to the grades as well as the features corresponding to the test scores.
9.8 Efficient Convolution Algorithms¶
How to speed up convolution?
Parallel Computation Resources
Selecting Appropriate Algorithms
- Fourier transform: converting input and kernel into frequency space. Perform point-wise multiplication. Convert them back to time domain using an inverse Fourier transform.
- When d-D kernel can be expressed as outer product of o vectors, the kenel is called seperable. Composing d 1-D convolution with each of these vectors is significantly faster than performing 1 d-D convolution with their outer product. The naive approach requires \(O(w^d)\) runtime and parameter storage place. Seperable approach requires \(O(w * d)\) runtime and storage place.
Even techniques that improve the efficiency of only forward propagation are useful because in the commercial settings, it is typical to devote more resource to deployment of network than to its training.
9.9 Random or Unsupervised Features¶
Typically, the most expensive part of conv network training is learning the features. There are 3 basic strategies for obtaining convolution kernels without supervised training.
Simply initialize randomly: random filters work well in convolutional networks. Inexpensive way to choose the architecture of a convolutional network:
- Evaluate the performance of several convolutional network architecture by training only the last layer
- Take the best of these architectures and train the entire architecture using a more expensive approach.
Design them by hand
Learn the kenel with an unsupervised criterion: Learning the features from unsupervised criterion allows them to be determined seperatly from the classifier layer at the top of the architecture.
Intermediate approach, greedy layer-wise pretraining, e.g.: Convolutional Deep Believe Network.
Instead of training an entire convolutional layer at a time, we can train a model of small patch, we can use the parameters from this patch-based to define the kernels of a convolutional layer.
Today, most convolution networks are trained in a purely supervised fashion, using full forward and back-propagation through the entire network on each training iteration.
11 Practical Methodoloogy¶
One can usually do much better with a correct application of a commonplace algorithm than by sloppily applying an obscure algorithm. Below is the recommended design process.
- Determine your goals. Error metric, target value for the error metric.
- End to end pipline estimation of the approppriate performace metrics.
- Instrument the system well the determine bottlenecks in performance. Diagnose which components are performing worse than expected and whether prro performacen is due to overfitting, underfitting, or a defect in the data or software.
- Repeatedly make incremental changes such as gathering new data, adjusting hyperparameters, or changing algorithms, based on specific finds from your instrumentation.
11.1 Performance Metrics¶
It is impossible to achieve absolute 0 error error. Reason
- Your input features may not contain complete information about the output variable
- The system is intrinsically stochastic.
- Limited training data
12 Applications¶
External Resources On Computer Vision¶
YOLO¶
See Yolo Paper
1. Intro¶
Over all Sturcture/Pipeline:
- Resize to 448 * 448
- Conv Net
- Threshhold the resulting detection by the model’s confidence using Non max suppression.
A single conv net simultaneously predicts multiple bounding boxes and class probabilities for those boxes. It is trained on full image and directly optimizes detection performance. Benefit over traditional method of object detection.
- Fast. Regression problem, no need for complex pipeline. High precision
- Reason gloabally about the image when making prediction.
- Leans generalizable representation of objects. Less likely to break down when applied to new domain or unexpected inputs.
Drawback: Lag behind state-of-art detection system in accuracy. Struggle to precisely localize some objects. especially small ones
2. Unified Detection¶
Divide the image into S * S grid. If the center of the an object falls into a grid cell, that grid cell is responsible for detecting that object. Each grid cell predicts B bounding boxes and confidence scores for those boxes. We define confidence as \(Pr(Object) * IOU_{true pred}\). IOU is between the predicted box and the groud truth.

For each bounding box (not grid cell), we have 5 prediction cell: x, y, w, h, and confidence.
- (x, y) represent the center of the box relative to the bound of the grid cell.
- (w, h) represent width and height relative to the whole image
- confidence represent IOU between the prediction box and any ground truth box.
Each grid cell (not bounding box) also predict C conditional class probabilities \(Pr(Class_i | Object)\). These probabilities are conditioned on the grid cell contraining an object. We only predict one set of class probability per grid cell, regardless of the number of the boxes.
At test time, class-specific confidence scores for each box:
These score encode both
- probability of that class appearing in the bounding box
- How well the predicted box fits the object
2.1 Network Desgin¶
2.2 Training¶
Pretrain: on ImageNet 1000-class dataset. First 20 conv layers with a fully connected layer.
Convert to perform detection. Add 4 Conv and 2 fully connected with randomly initiated rate. Input resolution 224 * 224 -> 448 * 448
Normalize witdh, height, x, y to be bounded between 0 to 1
Activation
- Final Layer: Linear Activation
- Other latyer: Leaky rectified linear activation function, see equation 2
Loss function: sum-squared error
- Reason: Easy to optimize
- Problem: (1) Does not perfectly align with our goal of maximize average precision. (2) In every image many gird cells do not contain any object. This pushes the confidence scores of those cells towards 0, often overpowering the gradient from cells that do contain object.
- Solution: increase loss from bounding box coordinate predictions and decrease the loss from confidence predictions from boxes that don’t contain objects. We use two parameters \(\lambda_{coord}\) = 5 and \(\lambda_{noobj}\) = 0.5
- Sum-squared error also equally weidgts errors in large boxes and small boxes
Only one bounding box should be responsible for each obejct. We assign one predictor to be responsible for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to the specialization between the bounding box prediction. Each predictor is getting better at predicting certain sizes, aspect of ratio, or class of object, improving overall recall but struglle to generalize. See the limitation of YOLO below.
- Loss from bound box coordinate (x, y) Note that the loss comes from one bounding box from one grid cell. Even if obj not in grid cell as ground truth.
- Loss from width w and height h. Note that the loss comes from one bounding box from one grid cell, even if the object is not in the grid cell as ground truth.
- Loss from the confidence in each bound box. Not that the loss comes from one bounding box from one grid cel, even if the object is not in the grid cell as ground truth.
- Loss from the class probability of grid cell, only when object is in the grid cell as ground truth.
For generalization.
- Use dropout with Dropout rate 0.5.
- Data augmentation:random scaling, translation of up to 20% of the original size.
- Randomly adjust the exposure and saturation of the image by up to a factor of 1.5 in the HSV color space.
2.3 Inference¶
YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based method.
Often it is clear which grid cell an object falls in to and the network predict one box per object. However some large objects or objects near the border of the multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections.
Limitation of YOLO¶
- Impose strong spatial contraints on bounding box predictions since each grid cell predicts B boxes can only have one class. Limit the number of nearby objects that our model can predict. Struggles with small objetc such as flocks of birds.
- Learns to predict bounding box from data -> struggles to generalize to object in new or usual aspect ratios or configurations. Uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.
- Loss function treat error the same in small bounding boxes versus large bounding boxes. A small error in large box is generally okay but a small error in small box has a much greater effect on IOU. The main source of error comes from localization.
To understand the second limitation, revisit the paragraph from Training:
YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.
To understand the second limitation, consider the following example:
Prediction on large bounding box
- Predicted bounding box size: 16 * 16
- Ground truth boudning box size: 25 * 25
Prediction on large bounding box
- Predicted bounding box size: 9 * 9
- Ground truth boudning box size: 16 * 16
See the picture below
Here is the result
Large bounding box:
- cost on width and height: 4
- IOU: 0.4096 <- (16 * 16) / (25 * 25)
Small bounding box
- cost on width and height: 4
- IOU: 0.3164 <- (9 * 9) / (16 * 16)
The cost from the width and height of the bounding box is the same for large and small bounding box. But the IOU is not the same. The cost does not penalize the smaller IOU of the small bounding box.
Comparison to other models¶
Deformable Models (DFM): Sliding window approach. Disjoint pipeline to extract features, classify regions, prediction bounding box for high scoring regions.
RCNN: region proposal about 2000 from Selective Search. CNN extract features. SVM score the boxes. A linear model adjust the bounding box.
Fast and faster RCNN: share computation and using Neural Network to propose region instead of Selective Search. Still fall short of real time performance.
Deep MultiBox: predict region of interest instead of Selective Search. Cannot perform genark object detection and is still just a piece in a large detection pipeline, requiring further image classification.
OverFeat: efficiently performs sliding window detection but it is still a disjoint system. Optimize for localization not detection performance. The localizer only see local information when making a prediction
MultiGrasp: similar design with YOLO to work on grasp detection. Much simpler task than object detection. Only needs to prediect a single graspable region for an image containing one object. Does not have to estimate the size location or boundaries of the object or predict it’s class, om;y find a region suitable for grasping
YOLO
finish the following task concurently
- Feature extraction
- Bound box prediction
- Non-maximal suppression
- YOLO is a general purpose detector that learns to detect a variety of objects simultaneously.
- YOLO reason about global context and thus requires significant post-processing to produce coherent detection.
YOLO 2¶
1. Introduction¶
Comparison:
- Object Detection: less dataset, less labels/tags/class
- Classification Problem: more dataset, more labels/tags/class
Our method uses a herachical view of object classification that allows us to combine distinct datasets together.
A joint training algorithm that allow to train detectors on both detection and classification data. The model leverages labeled detection images to learn to precisely localize objects while it uses classification images to increase its vocabulary and robustness.
This model
- Improve the base detector YOLO to YOLOv2.
- use dataset combination method and joint training algorithm to train model
2. Better¶
Shortcomings compared to region proposal-based method:
- Significant number of localization errors..
- Low recall
Thus YOLO2 focus mainly on improving recall and localization while maintaining classification accuracy.
Computer Vision generally trends towards larger and deeper networks. intead of scaling up the network, YOLO2 simplify the network and then make the representation easier to learn. Ideas to achieve this
Batch Normalization: with batch normalization, we can remove dropout from the model without overfitting (more than 2% mAP increase).
High Resolution Classification (4% mAP increase):
- First fine tune the classification network at the full 448 * 448 resolution for 10 epochs on ImageNet. -> Gives the network time to adjust its filter to work better on higher resolution input.
- Then fine tune the resulting network on detection.
Convolution With Anchor Boxes
RCNN¶
1. Introduction¶
Challenge 1: localizing object:
- Frame localization as regression problem: does not fare well in practice.
- Sliding window detector: units high up in RCNN have large receptive fields and strides in the input image, which makes the precise localization within sliding window paradigm an open technique challenge.
- RCNN: Operating within “recognition using regions” paradigm.
Steps in test time:
- Generate around 2000 category-independent region proposal for the input image.
- Extract fixed-length feature vector from each proposal using a CNN.
- Categorize each region with category-specific linear SVM.
RCCN use affine image warping to compute a fixed-size CNN input from each region proposal. Over all structure shown below
Challenge 2: labeled data is scarce and amount currently available is insufficient to train a large CNN.
- Unsupervised pretraining, followed by supervised fine tuning.
- Supervised pretraining on a large auxiliary dataset, followed by a domain-specific fine-tuning on small dataset. RCNN uses this method.
Challenge 3: efficiency
The only class-specific computations are reasonably small matrix-vector product and greedy non-max suppression.
2. Object detection with RCNN¶
2.1 Module design¶
Region Proposal available method:
- objectness
- selective search (used in RCNN)
- category independent object proposals
- constrained parametric min-cut
- multi-scale combinatorial grouping
Feature extraction: RCNN extract 4096 from each region proposal. Features are computed by foward propagating a mean-substracted RGB image through five conv layer and 2 fully connected layer.
2.2 Test-time detection¶
- RCNN runs selective search on the test image to extract around 2000 region proposal.
- Warp each proposal into same size.
- Forward propogate through a CNN in order to compute features.
- For each class, we score each extracted feature vector using SVM trained for that class.
- Given all scored region in an image, we apply a non-max suppression (for each class independently) that reject region if it has IOU overlap with a higher scoring selected region larger than a learned threshold.
Two propertied make the computation efficient:
- All CNN parameters are shared accross all categories.
- Feature vector computed by CNN are low dimension when compared with other common approaches.
The feature matrix is typically 2000 * 4096 and the SVM weight matrix is 4096 * N where the N is number of classese.
2.3 Training¶
Pretrain the CNN on a large auxiliary dataset using image level annotation.
Continue stochastic gradient descent training of the CNN parameters using only warped region proposal.
- Replace the CNN’s ImageNet-specific 1000-way classification layer with randomly initialized (N+1)-way classfication layer with randomly initialized (N+1)-way classification layer (N + 1=number of object classes + 1 for background)
- CNN architecture is unchanged.
- Treat all region proposal with >= 0.5 IOU with ground truth bounding box as positive for the box’s class and the rest as negative.
- Start SGD at a learning rate of 0.001 (1/10th of the initial pretaining rate). It allows fine-tuning to make progress without clobbering the initialization.
- In each SGD iteration, we uniformly sample 32 positive window (over all classes) and 96 background window to construct a minibatch of size 128. We bias the sampling towards positive window because they are extremely rare compared to background.
Object category classifier
- How to label a region that is partially overlaps a car: threshold IoU. The threshold 0.3 is selected by grid search over {0, 0.1, …., 0.5}. Positive examples are defined simply to be the ground truth bounding boxes for each class.
- Once features are extracted and training labels are applied, we optimize one SVM per class. Since the training data is too large to fit into memory, we adopt standard hard negative mining method.
3 Visualization, ablation and mode of errors¶
3.1 Visualizing learned features¶
First-layer filter can be visualized directly are easy to understand. They capture oriented edges and opponent colors. Understanding the subsequent layer is more challenging. The idea is to single out a particular unit (feature) in the network and use it as if were an object detector in its own right. We compute the unit’s activation on a large set of held-out region proposal, sort the proposal from the highest to lowest activation, perform non-maximum suppresion and then display the top scoring region. The model lets the selected unit “speak for itself” by showing exactly which input it fires on.
Focal Loss for Dense Object Detection¶
1. Introduction¶
Current state of art object detector: R-CNN framewor. Process
- Generates a sparse set of condidate object locations
- Classifies each condidate location as one of the foreground classes or as background
This 2-stage framework consitently achieves top accuracy on the challenging COCO benchmark.
One stage detector: YOLO, SSD. Faster with accuracy within 10-40% relative to state of art two-stage methods.
This paper: one-stage object detector, matches state of art COCO AP of more complex 2-stage detector. To achive this result, we identify imbalance during training as the main obstacle impeding 1-stage detector and propose a new loss function that eliminates this barrier.
Class imbalance is addressed in R-CNN-like detectors by 2-stage cascade and sampling heuristics.
- The proposal stage rapidly narrows down the number of candidate object locatios to a small number (e.e 1-2k).
- In the second classification stage, sampling heuristics are performed to maintain a managable balance between foreground and background.
1-stage detector must process a much larger set of canidate object location regularly sampled accross an image (~ 100k locations). While similar sampling heuristics may also be applied, they are inefficient as the training procedure is still dominated by easily classifid background examples. This inefficiency is typically addressed via techniques such as bootstrapping or hard example mining.
This paper, we propose a new loss function that act as a more effective alternative to previous approaches for dealing with class imbalance. Loss function is dynamically scaled cross entropy loss, where the scaling factor decays to 0 as confidence in the correct class increases.
This scaling factor can automatically down-weight the contribution of easy examples during training and rapidly focus the model on hard exampls. Finally we note that the exact form of the focal loss is not crucial, and we show other instantials can achive similar results.
3. Focal Loss¶
I am not using the example of binary class in the paper, instead, I found a great explanation tutorial from here
Cross Entropy:
Where \(y_i = 1\) if the training example belongs to class i (\(\vec{y}\) is an one-hot vector), \(P_i\) is the predicted probability. Even examples that are easily classified with \(P_i >> 0.5\) incur a loss with non-trivial magnitude. Because, even though the contribution from a single example to the loss is small, the total contribution from a large amount of examples can be non-trivial. When summed over a large number of easy examples, these small loss values can be overwhelm the rare class.
3.2 Focal Loss Definition¶
\((1 - p_i)\): Modulating factor. \(\gamma\) focusing parameter.
- When the class is misclassified and \(p_i\) is small, the modulating factor is close to 1, the loss is unaffected. As \(p_i \to 1\), the factor goes to 0, the well-classified examples is down-weighted.
- The focusing paramter \(\gamma\) smoothly adjusts the rate at which easy examples are down weighted. When \(\gamma = 0\), FL is equivalent to cross entropy. As \(\gamma\) is increased the effect of the modulating fatcor is likewise increased.
In practice, we use \(\alpha-balanced\) varient of the focal loss
Slightly improved accuracy over the \(non-\alpha-balanced\) form. Implementation of the loss layer combines the sigmoid operation for computing p with the loss computation, resulting in greater numerical stability.
3.3 Class Imbalance and Model Initialization¶
Problem: initialize equal probability to all classes, the loss due to the fequent class can domintate total loss and cause instability in the early training. Solution: introduce “prior” for the value of p etimated by the model for the rare class at the start of training. We denote the prior by \(\pi\) and set it so that the model’s estimated p for examples of the rare class is low. This improve training stability for both the cross entropy and focal loss in the case of heavy class imbalance.
3.4 Class Imbalance and 2-stage detectors¶
How 2-stage detector address class imbalance:
cascase
- reduce nearly infinite set of possible object locations down to 1000 or 2000.
- not random, likely to correspond to true object locations, which removes vast majority of easy negatives.
biased minibatch
- construct mini-batch that contains, for instance, 1:3 ratio of positive and negative examples. The ratio is like an implicit \(\alpha-balance\) factor that is implemented via sampling.
4. RetinaNet Detector¶
One backbone network: compute convolutional feature map over an entire image, off-the-shelf conv net.
Two task specific subnetworks
- Convolutional object classification on the backbone’s output
- Bounding box regression
Feature Pyramid Network (FPN) Backbone: augments a std convolutional network with a top-dwon pathway and lateral connections so the network efficiently construcs a rich, multi-scale feature pyramid from a single resolution imput image. Each level of the pyramid can be used for detecting objectes at a different scale. The use of FPN backbone is preliminary; experience using features from only the final ResNet layer yield low AP.
Anchors : In total there are A = 9 anchors per level and cross levels they cover the scale range 32 - 813 pixels with respect to the network input image. Each anchor is assigned a length K one-hot vector of classification targets where K is the number of object classes, and a 4-vector of box regression targets. Anchors are assigned to ground-truth object boxes using an IoU threshold of 0.5 and to background if their IoU is in [0, 0.4). As each anchor is assigned to at most one object box, we set the corresponding entry in its length K label vector to 1 and all other entries to 0. If an anchor is unassigned, which may happen with IoY in [0.4, 0.5), it is ignored during training. Box regression targets are compued as the offset between each anchor and its assigned object box or omitted if there is no assignment.
Classification Subnets: predicts the probability of object presence at each spatial position for each of the A anchors and K object classes. This subnet is a small FCN attatched to each FPN level. parameters of this subnet are shared accross all pyramid levels. Taking an input feature map with C channels from a given pyramid level, the subnet applies 4 3 * 3 conv layers, each with C filters and each followed by RelU, followed by a 3 * 3 conv layer with KA filters. Finally sigmoid activations are attached to output the KA binary predictions per spatial location.
Box regression subnet: In parallel with the object classification subnet, we attatch another small FCN to each pyramid level for the purpose of regression the offset from each anchor box to a nearby ground-truth object. The design of the box regression subnet is identical to the classification subnet except that it ternubates in 4A linear outputs per spatial location, these 4 output predict the relatibe offset between the anchor and the ground truth box
4.1 Inference and training¶
Inference: Inference involves simply forwarding an image through the network. To improve speed, we only decode box predictions from at most 1k top scoring prediction per FPN level, after thresholding detector confidence at 0.05. The top predictions from all levels are merged and non-maximim suppression with a threshold of 0.5 is applied to yield the final detections.
Focal loss: it is applied to all ~100k anchors in each sampled image. The total focal loss of an image is computed as the sum of the focal loss over all ~100k anchors, normalized by the number of anchors assigned to a ground truth box. Reason: vast majority of anchors are easy negatives and receive negligible loss value value under the focal loss. In general, \(\alpha\) should be decreased slightly as \(\gamma\) is increased.
Initialization: pretrained on ImageNet 1k. All new conv layers except the final one in the RetinaNet subnets are intialized with bias b=0 and a Gaussian weight fill with \(\sigma=0.01\). For the final layer of classification subnet, we set the bias initialization to be \(b = -log((1-\pi)/\pi)\), where \(\pi\) the start of training every anchor should be labeled as foreground with confidence of \(\pi\). This init prevents the large number of background anchors from generating a large, destabilizing loss value in the first iteration of training.
<<<<<<< HEAD Optimization: trained with stochastic gradient descent (SGD). ======= ############################ 5. Experiments ############################
5.1 Training on dense detection¶
- Depth 50 or 101 Resnet
- Feature Pyramid Network constructed on Resnet
- 600 Pixel image
Belows are the attemps to improve the learning
Network Initialization
- No modification, fail quickly
- Simply change last layer such that the prior probability of detecting an object is 0.01 enables effective learning.
Use \(\alpha-balanced\) learning.
Use Focal Loss
Analyze Focal Loss
- Model: Resnet-101, trained with \(\gamma = 2\)
- Apply this model to large number of random images
- Sample the predicted distribution for ~ \(10^7\) negative window and \(10^5\) postive window.
- Sperately for positive and negative, compute Focal Loss. Normalize the loss so that it will sum to 1.
- Given the normalized loss, we can sort the loss from the lowest to the highest and plot its cumulative distribution function (CDF) for both positive and negative samples and for different settings of \(\gamma\)
- Observed that CDF looks fairly similar for different value of \(\gamma\).
- Observed that CDF looks dramatically different for different value of \(\gamma\). As \(\gamma\) increases, subtantially more weights become more concentrated on the hard negative examples. With \(\gamma = 2\), vast majority of the loss comes from a small fraction of samples.
Online Hard Example Mining, proposed to train 2-stage detector by constructing Mini batch using high-loss examples. Result: FL is more effective than OHEM for training dense detector
- Each example is score by its loss
- Non-maximum suppression is then applied
- Minibatch is then constructed with highest-loss examples.
- Unlike FL, OHEM completely discards easy examples.
Hinge loss. Set loss to 0 above certain value of \(p_t\).
5.2 Model Architecture Design¶
Anchor Density: sweep over number of scales and aspect ratio anchors used at each spatial position and each pyramid level in FPN. 3 scales and 3 aspects ratio yield the best result.
Speed vs. Accuracy: larger backbone better accuracy, slower inference time.
Deep Driving: Learning Affordance for Direct Perception in Autonomous Driving¶
1 Introduction¶
Mediated perception approach¶
involve multiple components:
- recognizing driving relevant objects: lanes, traffic light …
- combinen recognizing result with world representation of the cars immediate surroundings
- AI based engine takes akk of this inform into account and make decision.
Encompasses the current state of the art approaches for autonomous driving.
Small portion of the detected objects are relevant to driving decision, unnecessary complexity. Final output:
- direction
- speed
Mediated perceptiion computes a high-D world representation, possibly redundant information.
Most of these system rely on laser range finders, GPS, radar and very accurate maps of the environment to reliable parse objects in a scene. Many open challenges, increase complexity and cost.
Behavior reflex approach¶
Construct a direct mapping from the sensory input to a driving action.
Struggle to deal with traffic and complicated driving maneuvers for several reasons.
- even when input images are similar, different drivers may make completely different decisions, resulting in ill-posed problem that is confusing when training a regressor.
- decision-making for behaviour reflex is too low-level. Direct mapping cannot see a bigger picture of the situation. E.g. passing a car: turning slighly in one direction and the in the other direction for some period of time. This level of abstraction fails to capture what is really going on.
- Input: whole image. Learning algorithm must determine which part of the image are relevant. The level of supervision too weak to force the alogrithm to learn this critical information.
Proposal¶
Directly predicts the affordance for driving action, instead of visually parsing the entire scene or blindly mapping an image to steering angles. Learn a mapping from an image to several meaningful affordance indicators of the road situation, including
- angle of the car relative to the road
- distance to the lane markings
- distance to the cars in the current and adjacent lanes
training: ConvNet. screen shot playing video game for 12 hours.
Define affordance: Affordances are clues about how an object should be used, typically provided by the object itself or its context. For example, even if you’ve never seen a coffee mug before, its use is fairly natural. The handle is shaped for easy grasping and the vessel has a large opening at the top with an empty well inside.
2. Learning affordance for driving preception¶
from the game engine, collect
- speed of host car
- relative position to the roads’s central line
- distance to the preceding cars
prepare:
- drive the label collecting car, first person view
- collect the ground truth values of affordance indicators.
training
testing: at each time step
- the trained model takes a driving scene image from the game and estimates the affordance indicators for driving.
- A driving controller processes the indicator and computes the steering the acceleration/brake command.
- Driving commands are then sent back to the game and drive the host car.
2.1 Mapping from an image to affordance¶
ConvNet as direct preception model to map an image to the affordance indicators.
Train a single ConvNet to handle 3 configurations: 1 lane, 2 lanes and 3 lanes.
two major types of action:
- follow the lane certer line
- change lane and slow down to avoid collision with proceding cars
to support these action, define two sets of representation under 2 coordinate system
- In lane system
- On marking system
3 types of indicators to represent driving situation:
- heading angle
- distance to nearby lane markings
- distance to the precding cars
In total, 13 indicators:
All 13 indicators:
Always
- angle: angle between the car’s heading and the tangent of the road
“in lane system”, when driving in the lane
- toMarking_LL: distance to the left lane marking of the left lane
- toMarking_ML: distance to the left lane marking of the current lane
- toMarking_MR: distance to the right lane marking of the current lane
- toMarking_RR: distance to the right lane marking of the right lane
- dist_LL: distance to the proceding car in the left lane
- dist_MM: distance to the proceding car in the current lane
- dist_RR: distance to the proceduing car in the right lane
“on marking system”, when driving on the lane marking
- toMarking_L: distance to the left lane marking
- toMarking_M: distance to the central lane marking
- toMarking_R: distance to the right lane marking
- dist_L: distance to the proceding car in the left lane
- dist_R: distance to the proceding car in the right lane
The inline system and on marking system are activated under different conditions. To have smooth transition, we define an overlapping area.
Except for the heading angle, all the indicators may output an inactive state. There are 2 cases in which a indicator will be inactive:
- when the car is driving in either the “inlane system” or “on marking system” and the other system is deactivated, then all the indicators belonging to that system is inactive.
- when the car is driving on boundary lanes (left most or right most lane), and there is either no left lane or no right lane, then the indicators corresponding to the non-existing adjacent lane is inactive.
External Resource on Natural Language Processing¶
This section is about the application of natural language processing topics from outside of the textbook.
Building a Production Model for Retrieval-Based Chatbots¶
Introduction¶
Chatbot dialogue system:
- Generative approach (not focused in this paper)
- Retrival approach: better control over response quality than generative approaches. Selct outputs from a whitelist of candidate responses.
Challenge:
- Inference speed
- Whitelist selction process and the associated retrieval evaluation. Recall: over-simplified.
3. Model Architecture¶
2 inputs:
- context c: concatenate of all utterances in the conversation. We use special tokens to indicate whether each utterance comes from the customer or the agent.
- candidate response r.
1 output
A score s(c, r) indicating the relenvance of the response to the context
3.1 Dual encoder¶
Core of the model, 2 neural encoders \(f_c\) and \(f_r\) to encode the context and the response, respectively. They have identical architecture but sperate weights.
Input of encoders: \(w = {w_1, w_2, ... w_n}\), either text or response.
Use fastText as word embedding method due to the prevalence of typos in both user and agent utterance in real chats.
Each encoder consists of 1 recurrent neural network and 1 multi-headed attention layer.
Recurrent neural network: multi-layer, bidirectional SRUs, each layer of SRU involves the following computation
where \(\sigma\) is the sigmoid activation function. \(W, W_f, W_r \in R^{d_h * d_e}\) are learned matrices and \(v_f, v_r, b_f, b_v \in R^{d_h}\) are learned parameters
The multi-headed attention layer compresses the encoded sequence \(h = {h_1, h_2, ... h_n}\) into a single vector. For each attention head i, attention weights are generating with the following computation
where \(\sigma\) is a non-linear activation function, \(W_a^{(i)} \in R^{d_h * d_a}\) is a learned parameter matrix and \(v_a^{(i)} \in R^{d_a}\) is a learned parameter vector.
The encoded sequence representation is the pooled to a single vector for each attention head i by summing the attended representation
Finally, the pooled encodings are averaged accross the \(n_h\) attention head
The output of the encoder function is the vector \(f(w) = \hat{h}\)
3.2 Scoring¶
To determine the relevance of a response r to a context c, the model computes a matching scoring between the context encoding \(f_c(c)\) and the response encoding \(f_r(r)\). This score is simply the dot product of the encodings:
- ..math::
- s(c, r) = f_c(c) cdot f_r(r)
3.3 Training¶
We optimize the model the maximize the score between the context c and responce \(r^+\) actually sent by the agent while minimize the score between the context and each of k random “negative” response \(r_1^-, ..., r_k^-\). Although negative response could be sampled seperately for each context-response pair, we instead share a set of negative response accross all examples in a batch. This has the benefit of reducing the number of responsees that need to be encoded in each batch of size of b from O(kb) to O(k + b)
3.5 Whitelist Generation¶
2 method of create a whitelist from which our model selects responses at inference time:
- Frequency-based method: select 1,000 or 10,000 most common agent responses.
- Clustering-base method: encode all agent responses using response encoder \(f_r\) and use k-means clustering with k = 1000 or k = 10,000 to cluster the reponse. Then selected the most common response from each cluster to create whitelist
Part III: Deep Learning Research¶
14 Autoencoders¶
14.1 Undercomplete Autoencoders¶
An autoencoder whose code dimension is less than the input dimension is called undercomplete.
The learning process: minimizing a loss function
where L is a loss function penalizingg g(f(x)) for being dissimilar from x, such as the mean squared error.
When the decoder is linear and L is the mean squared error, an undercomplete autoencoder learns to span the same subspace as PCA. In this case, an autoencoder trained to perform the copying task has learned the principal subspace of the training data as a side effect.
In PCA, we apply a transform of higer dimension data \(x \in R^d\) with \(U \in R^{(d*p)}\) where \(d > p\):
To reconstruct the orginal data you can apply:
You can imagin an simple autoencoder with applying \(U^T\) as encoder and applying \(U\) as decoder. That is the same logic as PCA
Review on info theory¶
Low possibility: more informational. High possibility: less informational.
Information gain:
Entropy (expected information gain)
KL divergence, which measure the similarity of two distribution:
KL divergence is not . The KL divergence above is of Q with respect to P. There are two important properties of KL divergence:
- \(KL >= 0\)
- \(KL(P||Q) \neq KL (Q||P)\) (not symmetric)
Variational inference¶
Suppose x is observation and z is hidden variable. In order to compute the relationship between z and x, we have to compute:
Computing \(P(x) = \int P(x|z)p(z)dz\) is very challenging (imagin z with very high dimemsion). Sometime it is intractable. There are two ways to deal with this challenge.
- Monte Carlo Method (not focused here)
- Variational Inference (explained below)
We can approach P(z|x) with Q(z). If we choose Q to be tractabke distribution, such as Gaussian or Bernoulli, then we have an approximate distribution that we can compute. So the goal is minimize:
So we can rewrite the equation as
logP(x) is given, if we want to minimize \(KL (q(z) || p(z|x))\), we should maximize \(\sum q(z) log \frac {p(x, z)}{q(z)}\) (we can call it L). This term L is also called variational lower bound. Since KL divergence is always non-negative, logP(x) is always greater or equal to L.
To maximize L we need to make \(E_{q(z)} log p(x|z)\) as great as possible. We can think of \(E_{q(z)} log p(x|z)\) as reconstruction gain. If x is a Gaussian distribution, maximizing \(E_{q(z)} log p(x|z)\) is the same as minimizing \(|x - \hat{x}|^2\). If x is a Bernoulli, it is the same as cross entropy.
q(z) as similar to p(z) as possible and make
Reference¶
14.3 Representational Power, Layer Size and Depth¶
Review¶
The universal approximation theorem (Horniket al., 1989; Cybenko, 1989) states that a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function (suchas the logistic sigmoid activation function) can approximate any Borel measurable function from one finite-dimensional space to another with any desired nonzero amount of error, provided that the network is given enough hidden units.
According to the universal approximation theorem, there exists a network largeenough to achieve any degree of accuracy we desire.
A feedforward network with a single layer is sufficient to representany function, but the layer may be infeasibly large and may fail to learn and generalize correctly. In many circumstances, using deeper models can reduce the number of units required to represent the desired function and can reduce the amount of generalization error.
Choosing a deep model encodes a very general belief that the function wewant to learn should involve composition of several simpler functions.
Apply to autoencoder¶
Encoder and decoder are both feedfoward networks, they can individually benefit from depth or as a whole.
Autoencoder with a single hidden layer is able to represent the identity function. But mapping from input to code is shallow -> not able to enforce arbitrary constraints, eg: the code should be sparse.
Depth exponetially reduce: 1. computational cost. 2. amount of training data needed
Common strategy of training: greedily pretrain the deep architecture by train a stack of shallow autoencoder.
14.4 Stochastic Encoders and Decoders¶
Given a hidden code h, we may think of the decoder as providing a conditional distribution \(p_{decoder}(x|h)\). We may train the autoencoder by minimizing \(-lpg P_{decoder}(x|h)\).
- x is Gaussian, negative log-likehood yield mean squared error
- x is Bernoulli, yield softmax
See p129 5.5.1 for review
14.5 Denoising Autoencoders¶
Denoising autoencoder (DAE) is an autoencoder that receives a corrupted data point as input and is trained to predict the original, uncorrupted data point as its output.
Learns reconstruction distribution \(p_{reconstruct}(x|\hat{x})\) by
- Sample x from training data
- Sample a corrupted version \(\hat{x}\) from \(C(\hat{x}|x = x)\)
- Use \((x|\hat{x})\) as training example for estimating the autoencoder reconstruction distribution p_{reconstruct}(x|hat{x}) which is equal to \(p_{decoder}(x|h)\). (h: the output of encoder \(f(\hat{x})\). decoder: g(h))
We can view the DAE as performing statistic gradient descent on the following expectation:
Where \(\hat{p}_{data}(x)\) is the training distribution
Score matching: encourage the model to have the same score as the data distribution at every training point x. In this context, the score is
Learning the gradient field of \(logP_{data}(x)\) is one way to learn the structure of \(P_{data}\) itself.
Important property of DAE: conditionally Gaussian p(x | h) makes the autoencoder learn a vector field (g(f(x)) - x) that estimate the score of the data distribution.
Training with squared error criterion
and corruption:
A generic encoder-decoder architecture may be made to estimate the score.
Reference¶
14.6 Learning Manifolds with Autoencoder¶
Tangent planes: At a point x on a d-dimensional manifold, the tangent plane is given by d basis vectors that span the local directions of variation allowed on the manifold.
Two forces of autoencoder training procedure:
Learning a representation h of training examples x. Autoencoder need not successfully reconstruct inputs that are not probale under data generating distribution since x is drawn from the training data, not directly from data generating distribution.
Satisfying the constraint or regularization penalty such as below which perfer solutions that are less sensitive to the input.
- Architectural constraint
- regularization term
The two forces together are useful because they force the hidden representation to capture information about the structure of the data-generating distribution. Autoencoder can afford to repersent only the variation that are needed to reconstruct training examples. Encoder learns a mapping from the input space x to a repersentation space, a mapping that is only sensitive to changes along the manifold directions, but that is insensitive to changes orthorgnal to the manifold.
How to characterize a manifold: A repersentation of the data points on (or near) the manifold. Such a repersentation for a perticular exmaple is also called “embedding”. It is typically given by a low-D vector, with fewer D than the ambient space of which the manifold is a low-D subset.
Nearest Neighbor Graph:
A global coordinate system can then be obtained through an optimization or by solving a linear system
Foundamental problem with such local nonparametric approach: if the manifold is not very smooth, one may need a very large number of training examples to cover each one of these varitions, with no chance to generalize to unseen variations. This motivates the use of distributed represenations and deep learning for capturing manifold structure.
14.7 Contractive Autoencoders¶
The contractive autoencoder introduces an explicit regularizer on the code h =f(x), encouraging the derivatives of f to be small as possible
This penalty: sum of squared elements of Jacobian matrix of partial deirvatives associated with encoder function.
With small Gaussian noise on input, DAE is equivalent to a contractive penalty on the reconstruction function that maps x to \(r=g(f(x))\). In orther words, DAE make the reconstruction function resist small but finit sized perturbations of the input value.
Review from 14.5.1 Estimating the score
Score matching (Hyvärinen, 2005) is an alternative to maximum likelihood. It provides a consistent estimator of probability distributions based on encouraging the model to have the same score as the data distribution at every training point x. In this context, the score is a particular gradient field:∇xlog p(x).
Denoising training of a specific kind of autoencoder (sigmoidal hidden units,linear reconstruction units) using Gaussian noise and mean squared error as the reconstruction cost is equivalent (Vincent, 2011) to training a specific kind of undirected probabilistic model called an RBM with Gaussian visible units.
CAE is trained to resist perturbation of its input, it is encouraged to map a neighborhood of input to a smaller neighborhood of output space.
CAE is contractive only locally – all pertubations if a training point x are mapped near to f(x). Globally, two different points x and x’ maybe mapped to f(x) and f(x’) points that are farther apart from original points.
We can think of J at point x as approximating nonlinear encoder as be a linear operator. A linear operator is said to be contractive if the norm Jx remains less than or equal to 1 for all unit-norm x. We can think of the CAE as penalizing the Frobenius norm of the local linear approaximation of f(x) at every training point x in order to encourage each of these lcoal linear operators to become a contraction.
Review on 14.6: Two forces of autoencoder training procedure:
- Learning a representation h of training examples x. Autoencoder need not successfully reconstruct inputs that are not probale under data generating distribution since x is drawn from the training data, not directly from data generating distribution.
- Satisfying the constraint or regularization penalty such as below which perfer solutions that are less sensitive to the input.
In case of CAE, we have two force
- Reconstruction error
- Contractive penalty
Direction with large Jx rapidly change h => likely to be the direction that approximate the tangent planes of the manifold. Training CAE results in most singular values of J dropping below 1. Th direction with largest singular values are interpreted as the tangent directions that CAE has learned.
14.8 Predictive Sparse Decomposition¶
Training Proceeds by minimizing
During training, h is controlled by the optimization algorithm. The training alternates between minimization with respect to h and minimization respect to the model parameters.
The interactive optimization is used only during training. The parametric encoder f is used to compute the learned feature when the model is deployed. PSD models maybe stacked and used to initialize a deep network to be trained with another criterion.
14.9 Applications of Autoencoders¶
Autoencoders have been successfully applied to dimensionality reduction and information retrieval.
Lower dimensional representations can improve performance on many tasks, such as classification.
Information retrieval: the task of finding entries in a database that resemble a query entry. Benefit:
- Usual benefits from dimensionality reduction that other tasks do
- search can be extremely efficient in certain kind of low dimensional spaces. Semantic hashing: train the D reduction algorithm to produce a code that is low-dimensional and binary, then we can store all database entries in a hash table that maps binary code vector to entries. We cam also search over slightly less similar entries efficiently, just by flipping individual bits from the encoding of the query.
Semantic hashing is applied to both textual input and image.
15 Representation Learning¶
Information processing could be easy or difficult depending on how the information is represented.
Eg 1:
- 210 / 6: easy, straightforward.
- CCX / VI: difficult.
Eg2 Insert a number into
- sorted linked list: O(n) runtime
- red-black tree: O(log n)
Feedforward network trained by supervised learning: last layer linear classifier or nearest neighbor classifier. The rest of the network learns to provide a representation to this classifier. Training with a supervised criterion naturally leads to the representatino at every hidden layer taking on properties that make the classification easier.
Just like superviser network, unsupervised deep learning algorithms have a main training objective but also learn a representation as a side effect. Most of representation learning problem face the trade off between
- Researve as much info about input as possible
- attain nice properties such as independence
15.1 Gready Layer-Wise Unsupervised Pretraining¶
Greedy Layer-Wise Unsupervised Pretraining relies on single-layer representation learning algorithm. Each layer is pretrained using unsupervised learning, taking the output of previous layer and producing as output a new representation of the data, whose distribution is hopefully simpler.
Greedy layer-wise unsupervsied pretraining name explanation:
Gready: Optimize each piece of the solution independently, on piece at a time.
Layer-Wise: The independent pieces are the layer of the network. Training proceeds once layer at a time, training the k-th layer while keeping the previous ones fixed.
Unsupervised: each layer is trained with an unsupervised representation learning algorithm.
Pretraining:
- 1st step/phase: Greedy Layer-Wise Unsupervised Pretraining
- Fine-Tune all the layers together
In the context of supervised learning, it could be viewed as
- Regulirizer
- A form a parameter initialization
Greedy Layer-Wise unsupervised pretraining can also be used as initialization for other unsupervised learning algorithms such as
Deep Autoencoders
Probabilistic models with many layers of latent variables, including:
- Deep Belief Network
- Deep Boltzman Machines
When and why does unsupervised pretraining work¶
Unsupervised pretrining combines 2 ideas:
Idea1: Choice of initial parameters for a deep NN can have a significant regularizing effect on the model¶
It remains possible that pretraining initializes the model in a location that would otherwise be inaccessible. Eg:
- a region surrounded by areas where the cost function varies so much from one example to another that minibatches give only a very noisy estimate of the gradient
- a region surrounded by areas where the Hessian matrix is so pooly conditioned that gradient descent methods must use very small steps.
We cannot charactorize the exactly what aspects of the pretrained parameters are retained during the supervised training stages. This is one of the reason why modern approaches typically use simultaneous unsupervised learning and supervised learning rather than two sequential stages.
You may also freeze the parameters for the feature extractor and use supervised learning only to add a classifier on top of the learned features.
Idea2: Learning about the input distribution can help with learning about the mapping from input to output.¶
Some features that are useful for the unsupervised task may also be useful for supervised task. This is not yet understood at a mathmatical, theoretical level. Many aspects of this approach are highly dependent on the specific models used. Eg: if we wish to add a linear classifer on top of pretrained features, the featrues mush make the underlyin classes linearly seperable. This is another reason that simultaneous supervised and unsupervised learning can be preferable.
Interpretation to the function of unsupervised training¶
Unsupervised training as learning a representation: We can expect unsupervised pretraining to be more effective when the initial representation is poor. Eg: the use of word embedding. It is less useful for image processing perhaps because images lie in a rich vector space where distance provide a low quality similarity metrics.
Unsupervised training as a regularizer: We can also expect unsupervised pretraining to be most helpful when the number of labeled examples is very small.
Unsupervised pretraining is likely to be most useful when the function to be learned is extremely complicated.
Success case of Unsupervised Pretraining¶
Unsupervised pretraining has usually been used to improve classifier is usually interesting from the point of view of reducing test set error. It can help tasks other than classification.
Improvement to training error and test error maybe both explained in terms of unsupervised pretraining taking training into a region that would otherwise be inaccessible.
- NN with unsupervised pretraining: halt in the same smaller region of function space. Suggesting that pretraining reduces the variance of function estimation process => reduce the risk of severe over fitting.
- NN wihout unsupervised pretraining: halt in another region.
In another words: unsupervised pretraining initialize NN parameters into a region that they do not escape, and the results following this initializations are more consistent and less likely to be very bad than without pretraining.
Disadvantage of Unsupervised Pretraining¶
Operating with two seperate training phase.
- Unsupervsied pretraining does not offer a clear way to adjust the strength of the regularization arising from the unsupervised stage. When we perform unsupervised and supervised learning simultaneously, instead of using the pretraining strategy, there is a single hyperparameter, usually a coefficient attatched to the unsupervised cost, that determine hows strongly supervsied objective will regularize the supervsied model.
- Two seperate training phases has its own hyperparameters. The performance of the second phase cannot be predicted during the first phase, so there is a long delay between proposing hyperparameters for the first phase and being able to update them using feedback from the second phase. Most principled approach: use validation set error in the supervised phase to select hyperparameters of the pretraining phase.
Today, unsupervised pretraining has been largely abandoned, except for natural language processing.
Deep learning techniques based on supervised learning, regularized with dropout or batch normalization, are able to achieve human level performance on many tasks, but onlu with extremely large labeled datasets. On extremely small datasets, such as the alternative splicing dataset, Bayesian methods outperform methods based on unsupervised learning.
The idea of pretraining has been generalized to supervised pretraining, as a common approach for transfer learning.
Supervised pretraining for transfer learning is popular for use with convolutional networks pretrained on ImageNet dataset. Practitioners publish the parameters of the trained networks for this purpose, just as pretrained word vectors are published for natural language processing tasks.
Review on 8.7.4 Supervised pretraining:
15.2 Transfer Learning and Domain Adaptation¶
Transfer learning and domain adaptation refer to situation where what has been learned in one setting (e.g. distribution P1) is exploited to improve generalization in another settings (say, distribution P2).
Transfer learning¶
The learner must perform two or more different tasks, we assume that many of the factors that explain the variations in P1 are relavant to the variations that need to be captured for learning P2.
Transfer learning, multitask learning and domain adptation can be achieved via representation learning when there exist features that are useful for different settings or tasks, corresponding to underlying factors that appear in more than one setting. Review:
Two extreme form of transfer learning:
(1) One-shot learning¶
Only one example of transfer task is given for one-shot learning. It is possible because the representation learns cleanly seperate the underlying classes during first stage. During the transfer learning stage, only one labeled example is needed to infer the label of many possible test examples that all cluster around the same point in representation space.
(2) Zero-shot learning¶
No labeled examples are given at all. It is only possible because additional information has been exploited during training. Three random variable
- traditional input x
- traditional output y
- additional random varible describing the task T
The model is trained estimate P(y|x, T). E.g. Even we may not have labeled exmaples translating word A in language X to word B in language Y, we can generalize and guess a translation for word A. Illustrated below:
Domain Adaptation¶
The task (and the optimal input-to-output mapping) remains the same between each setting, but the input distribution is slightly different. e.g. A sentiment predictor trained on customer review of media content, such as book, video is later used to analyze comments about comsumer eletronics such as TVs and smartphones. Simple unsupervised pretraining (with DAE) has been found to be very successful for sentiment analysis with domain adaptation.
In all these cases, the objective is to take advantage of data from the first setting to extract info that maybe useful when learning or even when directly making predictions in the second setting. Core idea of representation learning: same representation maybe useful in both settings
15.3 Semi-Supervised Disentangling of Casual Factors¶
- Question: What makes one representation better then another
- One hypothesis: the one in which the features within the representation corresponds to the underlying causes of the observed data, with seperate features or directions in the feature space corresponding to different causes, so that the representation disentangles the causes from one another.
This hypothesis motivates approaches in which we first seek a good representation for p(x). Such a representation may also be a good representation for computing p(y|x) if y is among the most salient causes. The following properties coindide for many AI task:
- We are able to abtain the underline explanations for what we observe
- It generally becomes easier to isolate individual attributes from the others.
Situation 1: unsupervised learning fails to help supervised learning¶
- P(x): uniformly distributed
- Gola: Learn E(y | x)
Situation 2: unsupervised learning succeeds to help supervised learning¶
If y is closely associated with one of the causal factors of x, then p(x) and p(y|x) will be strongly tied and unsupervised representation learning tries to disentangle the underlying factors of variation is likely to be useful as a semi-supervised learning strategy.
Problem: Most of observations are formed by an extremely large number of underlying causes.
Brute Force Solution: An unsupervised learner learns a representation that captures all the reasonably salient generative factors and disentangles them from each other, thus making it easy to predict y from h, regardless of which \(h_i\) is associate with y. Not feasible. It is not possible to capture all or most of the factors of variation that influence an observation.
Two Main Strategies used
- Use a supervised learning signal at the same time as the unsupervised learning signal so that the model will choose to capture most relevant factors of variation.
- Use much larger representation if using purely unsupervised learning.
An emerging strategy for unsupervised learning is to modify the definition of which underlying causes are most salient. Historically, autoencoder and generative models have been trained to optimize a fixed criterion. These fixed criteria determine which causes are considered salient. See figure below how MSE failed to learn to encode a small ping pong ball:
Other definitions of salient are possible. e.g. Recognizable Pattern. One way to implement such a definition of salience is to use Generatibe Adverserial Network:
- A feedforward classifier: attempts to recognize all samples from the generative model as being fake and samples from training set as real
- Generative model: provide “fake samples”
The network will learn what is salient.
If the true generative process has x as an effect and y as a cause, then modeling p(x|y) is robust to changes in p(y). If the cause-effect relationship is reserved, that would not be true. Very often, when we consider changes in distribution due to different domains, temporal nonstationary or changes the nature of task, the causal mechanisms remain invarianct. So, better generalization and robustness to all kinds of change can be expected via learning a generative model that attempts to recover the causal factors h and p(x|h).
15.4 Distributed Representation¶
Distributed representations: representations composed of many elements that can be set seperately from each other.
Symbolic representation: the input is associated with a single simbol or a category.
- If there are n symbols in the dictionary, one can imagine n feature detctors, each corresponding to the detection of the presence of the associated category.
- Only n different configurations of the representation space are possible, carving n different regions in input space.
- Also called one-hot representation
- One example of non-distributed representation.
Nondistributed representations: representations that may contain many entries but without significant meaningful seperate control over each entry.
Examples of learning algo based on nondistributed representation learning:
- Clustering method, include the k-means algorithm: each input assigned to exactly one cluster
- K Nearest neighbor: If k > 1, multiple values describe each input, but they cannot be controlled seperately from each other, so this does not qualify as a true distributed representation.
- Decision tree: Only one leaf is activated when the input is given
- Gaussian mixture and mixtures of expert: each input is represented with multiple values, but those values cannot be readily be controlled seperatly from each other.
- Kernel machine with a Gausian kernel
- Language or translation models based on g-grams
See the review for some of the ML algorithms at the end of this summary E.g. of 3 binary features representation:
E.g. of non-distributed representation: nearest neighbor
One important concept in distributed representation: generalization arised due to shared attributes between different concepts. Distributed representation include a rich similarity space, in which semantically close concepts (or inputs) are close in distance. This is a property absnet from purely symbolic representation.
Distributed representation can have statistical advantages when an apparently complecated structure can be compactly represented using a small number parameters.
E.g. See figure 15.7. Each binary feature binary feature in this representation divides \(\mathbb{R}^d\) into a pair of half spaces. The exponentially large number of intersection of n of correcponding half spaces determines how many regions this distributed representation learner can distinguish. The number of regions this binary feature representation can distinguish is
We can see a growth that is exponential in the input size and polynomial in the number of hidden units.
With O(nd) parameters (for n linear threshold features in \(\mathbb{R}^d\)), we can distinctly represent \(O(n^d)\) region in the input space. If we have assumption at all about the data, and used a representation with one unique symbol for each region and seperate parameters for each symbol to recognize its corresponding portion of \(\mathbb{R}^d\), the specify \(O(n^d)\) region would require \(O(n^d)\) examples.
If a parametric transformation with k parameters can learn about r regions in input space, with k << k, and if obtaining such a representation was useful to the task of interest, then we could potentially generalize much better in this way than in a nondistributed settings, where we would need O(r) exmaples to obtain the same features and associated partitioning of the input space into r regions.
Another reason why distributed representation generalize well: their capacity remains limited despite being able to distincly encode so many different regions. Reason of limited capacity: While we can assign very many unique codes to representation space, we cannot use absolutely all the code space, nor can we learn arbitrary functions mapping from the representation space h to the outputs space y using a linear classfier. The use of distributed representation combined with a linear classifier thus express a prior belief that the classes to recognized are linearly seperable as a function of the underlying causal factors captured by h. e.g we don’t want to partition data into the set of all red cars and green trucks as one class and the set of all green cars and red trucks as another class.
Valdation of distributed representation:
Review on KNN 5.7.3¶
One weakness of k-nearest neighbors is that it cannot learn that one feature is more discriminative than another. For example,imagine we have a regression task withx ∈ R 100 drawn from an isotropic Gaussian distribution, but only a single variable \(x_1\) is relevant to the output. Suppose further that this feature simply encodes the output directly, that \(y=x_1\) in all cases. Nearest neighbor regression will not be able to detect this simple pattern.The nearest neighbor of most pointsx will be determined by the large number of features \(x_2\) through \(x_{100}\), not by the lone feature \(x_1\). Thus the output on small training sets will essentially be random.
Review on decision tree 5.7.3¶
Resources¶
15.5 Exponential Gains from Depth¶
Mulitlayer perceptrons are universal approximators, some fonctions can be represented by exponetionally smaller deep networks compared to shallow networks. Decrease in model size -> statistical efficiency.
The factors that can be chosen almost independently from each other yet still correspond to meaningful inputs are more likely to be very high levek and related in high non-linear ways to the input. -> Demands deep distributed representation.
- Higher level features: seen as functions of the input
- Higher level factors: seen as generative causes
Are obtained through the composition of many non-linearirties.
Exponential boost to statistical efficiency comes from
- Organizing computation through the composition of many nonlinearities
- A hierarchy of reused features
- Distributed representation
Theoretically: there are families of functions that can be represented efficiently by an architecture of depth k, but that would require an exponential number of hidden units (with respect to the input size) with insufficient depth (depth 2 or depth k - 1).
- Deterministic feed forward networks: universal approximator of functions
- Structured probablistic models with single hidden layer of latent variables, including restricted Bolzmann Machines and deep believe networks: universal approxiamators or probablistic distribution.
Resource¶
15.6 Providing Clues to Discover Underlying Causes¶
Most strategies of representation learning are based on introducing clues that help the learning find these underlying factors of variations. The clues can help the learner seperate these underlying factors from others.
- Supervised learning provides a very strong clue: a label y that usually specifies the value of at least one of the factors of variation directly.
- To make use of abundant unlabeled data, representation learning makes use of other, less direct hints. These hints tak form of implicit prior beliefs we impose in order to guide the learning.
Generic regularization strategies (to achieve better generalization)
- Smoothness: This is the assumption that \(f(x+\epsilon d) \approx f(x)\) for unit d and small \(\epsilon\). Allows learners to generalize from training examples to nearby points in the input space. Insufficient to overcome the curse of dimensionality.
- Linearity: Many learning algorithms assume that relationship between some variables are linear.
- Multiple explanatory factors: assume that the data is generated by multple underlying explanatory factors and that most tasks can be solved easily given the state of each of these factors.
- Causal factors: Treats the factors of variation described by the learned representation h as the causes of the observed data x, and not vice versa.
- Depth or a hierarchical organization of explanatory factors: high-level, abstract concepts can be defined in terms of simple concepts, forming a hierarchy.
- Shared factors accoss tasks
- Natural Clustering: each connected manifold in the input space may be assinged to a single class.
- Temproal and spatial coherence: Slow Feature Analysis assumes that the most important explanatory factors change slowly over time .
- Sparcity: Most features should be presumably not relevant to describing most inputs.
- Simplicity of factor dependency: In good high-level representations, the factors are related to each other through simple dependencies. This is assumed when plugging a linear predictor or a factorize prior on top of a learned representation.
16 Structured Probablistic Models for Deep Learning¶
A structured probablistic model is a way of describing a probablistic distribution, using a graph to descrie which random variable in the probablistic distribution interact with each other directly.
16.1 The Challenge of Unstructured Modeling¶
The goal of deep learning: scale ML to kinds of challenges needed to solve artificial intelligence. This means being able to understand high-D data with rich structure.
The process of classification discards most of the information in the input and produces a single output. The classifier is also often able to ignore many parts of the input. It is possible to ask probablistic models to do many other tasks. Most require a complete understanding of the entire structure of the input, with no option to ignore sections of it.
- Density Estimation: Given an input x, the ML system returns an estimation of the true density p(x) under the data generating distribution.
- Denoising: Given a damaged or incorrectly observed input \(\tilde{x}\), the ML system returns an estimate of the original or correct x.
- Missing value imputation: Given the observations of some element of x, the model is asked to return estimates of or a probablistic distribution over some or all of the unobserved elements of x.
- Sampling: The model generates new samples from the distribution p(x)
In general, if we wish to model a distribution over a random vector x containing n discrete variables capable of taking k values each, then the naive approach of respresenting P(x) by storing a lookup table with one probability value per possible outcome require \(k^n\) parameters. This is not feasible because:
- Memory: the cost of storing the representations
- Statistical efficiency: as the number of parameters in a model increase so does the amount of training data needed to choose the values those parameters using a statistical estimators.
- Runtime-the cost of inference: Suppose we want to perform an inference to compute \(P(x_1)\) or \(P(x_1|x_2)\). Computing these distribution will require summing accross the entire table.
- Runtime-the cost of sampling: Suppose we want to draw a sample from the model. Naive way: sample some value \(u \sim U(0, 1)\) then iterate through the table, adding up the probability values until they exceed u and return the outcome corresponding to that position in the table. This requires reading through the whole table in the worst case, so it has the same exponential cost as the older operations.
Problem with table based approach: we are explicitly modeling every possible kind of interaction between every possible subset of variables. Real task we encounters; much simpler than this.
Structured probablistic models provide a formal framework for modeling only direct interactions between random variables. This allow the model to have
- significant fewer parameters
- therefor be estimated reliably from less data
- dramastically reduced computational cost in terms of storing the model, performing inference in the model and drawing samples from the model.
16.2 Using Graphs to Describe Model Structure¶
Structured probablistic models use graphs to represent interactions between random variables. Each node represents a random variable. Each edge represents a direct interaction
16.2.1 Directed Models¶
Directed graphical model, aka, belief network or Bayesian network. The edges are directed. The direction of the arrow/edge indicates which variable’s probablistic distribution is defined in terms of the other’s. E.g. :
A directed graphical model defined on variable x is defined by a directed acyclic G whose vertices are the random variables in the model, and a set of local conditional probability distribution \(p(x_i|Pa_G(x_i))\), where \(Pa_G(x_i)\) gives the parents of \(x_i\) in G. The probability distribution over x given by
In general, to model n discrete variables each having k values, the cost of the single table approach scales like \(O(k^n)\). Suppose we build a directed graphical model over these variables. If m is the maximum number of variables appearing (on either side of the conditioning bar) in a single conditional probability distribution, then the cost of the tables for the directed models scales like \(O(k^m)\). As long as we can design a model such that m << n, we can get very dramatic savings.
In ithe words, as long as each variable has few parents in the graph, the distribution can be represented with very few parameters. Some restrictions such as requiring it to be a tree can guarantee that operations like computing marginal or conditional dirstibution over subsets of variable are efficient.
The graph encodes only simplifing assumptions about which variables are conditionally independent from each other. The directed graph syntaxt does not place any constraint on how we define our conditional distributions. It only defines which variables they are allowed to take in as arguments.
16.2.2 Undirected Models¶
Undirected model, aka, Markov random fields (MRFs) or Markov Networks.
- Directed models: applicable to situation where there is a clear reason to draw each arrow in one particular directions, where we understand the causality and causality flows in only one direction.
- Undirected model: when the interactions seem to have no intrinsic direction, or to operate in both directions
As with directed model, if 2 nodes in an undirected model are connected by an edge, then the random variables corresponding to those nodes interact with each other directly. e.g
An undirected graphical model is a structured probabilitic model defined on an undirected graph G. For each clique C in the graph, a factor \(\phi(C)\) (also called a clique protential) measures the affinity of the variables in that clique for being in each of their possible joint states. \(\phi(C) > 0\). Unnormalized probability distribution:
A clique of the graph is a subset of nodes that are all connected to each other by an edge of the graph.
It is efficient as long as all the cliques are small. It encodes the idea that states with higher affinity are more likely. Check the example in Page 558.
16.2.3 The partition Function¶
To obtain a valid probability distribution, we must use the corresponding normalized probability distribution.
where Z is the value that results in the probability distribution summing or integrting to 1
You can think of Z as constant when \(\phi\) functions are held constant. Note that if the \(\phi\) functions have parameters, then Z is a function of those parameters. Normalizing function Z is known as partition function.
Z: integral or sum over all possible joint assignment of the state z -> often intractable to compute. So, the model structure and the definitions of the \(\phi\) must be constructive to computing Z efficiently. In the context of DL, we must resort to approximations.
When designing undirected models: it is possible to specify the factors in such a way that Z does not exist. This happens if some of the variables in the model are continous and the integral of \(\hat{p}(x)\) over their domain diverges. Sometimes the choice of some parameter of the \(\phi\) determines whether the probability distribution is defined.
Difference between directed and undirected modeling
- Directed models: defined directly in terms of probability distribution from the start.
- Undirected models: defined more loosely by \(\phi\) functions that are then converted into probability distributions.
Domain of each of the variables has dramatic effect on the kind of probability distribution that a given set of \(\phi\) functions corresponds to. E.g.
- An n-D vector valued random variable x
- Undirected model parameterized by a vector of bias b.
- If \(x \in \{ 0, 1 \}^n\)
Suppose:
Then we have:
- If x is one-hot vector {[1, 0, 0, 0 …], [0, 1, 0, 0 …], [0, 0, 1, 0 …]}
so we have
So:
Oftenm it is possible to leverage the effect of a carefully chosen domain of a variable to obtain complicated behaviour from a relatively simple set of \(\phi\) functions.
16.2.4 Energy-Based Model¶
Many interesting theoretical results about undirected models depend on the assumption that \(\forall x, \hat{p}(x) > 0\). A convinient way to enforce this condition is to use energy based model (EBM) where
and E(x), aka, energy function. By learning the energy function, we can use unconstrained optimization. Any distriburuib of the form given by the equation above is an example to Boltzmann distribution.
- Boltzmann Machine is today most often used to disignate models with latent variables
- Boltzmann machines without latent variables are more often called Markov random field or log-linear models.
Because exp(a)exp(b) = exp(a + b), this means that different cliques in the undirected graph corresponds to different terms of the energy function. In other words, an energy-based model is just a special kind of Markove network: the exponentiation makes each term in the energy function correspond to a factor for a different clique.
One can view an energy-based model with multiple terms in its energy function as being a product of experts. Each term in the energy function corresponds to another factor in the probability distribution. Each term of the energy function can be thought of as an “expert” that determines whether a particular soft constraint is satisfied. Each expert may enfoce only one constraint that concern only a low-D projection of the random variables, but we combined by multiplication of probabilities, the expert together enforce a complicated high-D constraint.
Mony Algorithms that operate on probabilistic models need to compute not \(p_{model}(x)\) but only \(log \hat{p}_{model}(x)\). For EBM with latent variables h, these algorithms are sometimes phrased in terms of the negative of this quantity, called the free energy:
16.2.5 Sparation and D-Separation¶
We often need to know which variables indirectly interact. Some of these interactions can be enabled or disabled by observing other variables. More formallym we would like to know which subsets of variables are conditionally independent from each other, given the values of other subsets of variables.
Conditional independence implied by the graph is called seperation. We say a set of variables A is seperated from another set of variable B give a third set of varibles S if the graph structure implies that A is independent from B given S.
- If two variables a and b are connected by a path involving only unobserved variables, then those varibles are not seperated.
- If no path exists between them, or all paths contain an observed variable, then they are seperated.
- Refer path involving only unobserved variables as “active”
- Refer path involving an observed variable as “inactive”
e.g.
In the context of directed model, those concerpts are referred to as d-seperation. The “d” stands for dependence. A set of variables A is d-seperated from another set of variable B given a third set of variables S if the graph structure implies that A is independent from B given S.
Two variables are dependent if there is an active path between them and d-sperated if no such path exists. How to check if an active path exists:
Note: seperation and d-seperation tell us only about those conditional independences that are implied by the graph. There is no requirement that the graph imply all independences that are present. Some distributions contain independences that are not possible to represent with existing graphical notation. Context-independences are independences that are present dependent on the value of some variables in the network.
In general, a graph will never imply that independence exists when it does not. However, it may fail to encode an independece.
16.2.6 Converting between undirected and directed graphs¶
We typically refer to RBMs as undirected and sparse coding as directed. Some models are most easily described using a directed graph, or most easily described using a undirected graph.
We may choose to use either directed or undirecte modeling based on
- which approach uses fewest edges to describe the edges
- if we observe a certain subset of variables
- if we wish to perform different computational task.
- others
e.g.
- directed model description often provides a straight forward approach to efficiently draw samples from the model (16.3).
- undirected model description is often useful for deriving approximate inference procedures (P640 equation 19.56)
Every probability distribution can be represented by either directed or undirected model. In the worst case, one can alwasy represent any distribution by using a “completed graph”
- For directed graph: impose some ordering on the random variables.
- For undirected graph: simply a graph containing single clique encompassing all the variables
Please note: “arc from each variable to every variable that comes after it in the ordering”.
- Directed models are able to use one specific substructure that undirected models cannot represent: Immorality
- Vice Versa. Specifically when undirected model contains a loop of length greater than , unless the loop contains a chord. A chord is connection between any two nonconsecutive variables in the sequence defining a loop. If undirected model has loops of length >= 4 and does not have chords for those loops, we must add the chords before we can convert it to a directed model. Adding chords discards some of the independence information that was encoded in U. The graph formed by adding chords to U is known as a chordal or triangulated graph, because all the loops can be described in terms of smaller, triangular loops. To build a directed graph from the chordal graph, we need to also assign directions to the edges. When doing so we must not create a directed cycle in the model. One way to assign direction to edges is to impose an ordering on teh random variables.
Examples
16.2.7 Factor Graphs¶
Factor graphs are another way of drawing undirected models that resolve an ambiguity in the graphical represenatation of std undirected model syntax. In an undirected model, the scope of every \(\phi\) function must be a subset of some clique in teh graph. Ambiguity arises because it is not clear if each clique actually has a corresponding factor whose scope encompasses the entire clique. Circle: random variables. squares: factors \(\phi\).
No factor maybe connected to another factor in the graph, nor can a variable be connected to a variable.
Resources¶
16.3 Sampling from Graphical Models¶
Advantage of directed model: ancestral sampling can produce a sample from the joint distribution represented by the model.
Sort the variables in the graph into a topological ordering: for all i and j, j > i if \(x_i\) is parent of \(x_j\). We first sample \(x1~p(x_1)\) then sample \(p(x_2|Pa_G(x2))\) and so on.
Ancestral sampling could be used with any of other available topological ordering.
Drawback of ancestral sampling
only applies to directed graphical model.
- We can sample from undirected models by converting them into directed models: requires solving intractable inference problems or requiring introducing so many edges that the resulting directed model becomes intractable.
- Sample from undirected model without converting them into directed models: require resolving cyclical dependencies. Expensive multipass process.
does not support every conditional sampling operation.
- we often require that all conditioning variables come earlier than the variables to be sampled in the ordered graph. In this case we sample from the local conditional probability distributions specified by the model distribution.
- Otherwise, the conditional distributions we need to sample from are the posterior distributions given the observed variables. The posterior distributions are usually not explicitly specified and parameterized in the model. Infering these posterior distributions can be costly. Ancestral sampling is no longer efficient.
Sample from undirected model, conceptually simplest approach is Gibbes sampling. e.g: n-D vector x. We interactively visit each variable \(x_i\) and dram a sample conditioned on all other variables from \(p(x_1 | x\__i)\). Due to the seperation properties of the graphical model, we can equivalentely condition on only the neighbors of \(x_i\). We mut repeat the process and resample all n variables using the updated values fo their neighbors. Asymptitically, after many repetitions, this process converges to sampling from the correct distribution. It can be difficult to determine when the samples have reached a sufficient accurate approximation of the desierd distribution.
16.4 Advantages of Structured Modeling¶
Allow us to dramatically reduce the cost of representing probability distributions as well as learning and inference.
Sampling accelerated in the case of directed model.
Allow us to explicitly seperate representation of knowledge from learning of knowledge or inference given existing knowledge. Makes our model easier to model or debug.
- We can design, analyze and evaluating learning algorithms and inference algorithmsthat are applicable to broad class of grapgs.
- Independently, wen can design models that capture the relationships we believe are important in our data.
- Then combine different algorithms and structures.
16.5 Learning about Dependencies¶
A good generative model needs to accurately capture the distribution over the observed or visible variables v. Often the different elements of v are highly dependent on each other. Common solution in DL: introduce several latent variables, h. The model can capture dependenies between any pair of variables \(v_i\) and \(v_j\) indirecylt, via direct dependencies between \(v_i\) and h, and direct dependenies between \(v_j\) and h.
Good model of v which did not contain any latent variables would need to have large number of parents per model in Bayesian network or very large cliques in a Markov network. Costly in
- Comutational sense: number of parameters must be stored in memory scales exponetially with number of member in a clique
- Statistical sense: exponential number of parameters requires a wealthy of data to estimate accurately.
Model intended to capture dependencies between visible variables with direct connections, must be designed to connect those variables that are tightly coupled and omit edges between other variables. Structured Learning devoted to solve this problem: a structure is proposed and a model with that structure is trained, then given a score. The score reward high training set accuracy and penalizes model complexicty. Add or remove small amount of edges as next proposal.
Using latent variable: avoids the need of 1. discrete searches 2. multiple round of training. A fixed structure over visible and hidden variables can use direct interactions between visible and hidden units to impose indirect interactions between visible units.
Many approaches accomplish feature learning by learning latent varibales.
16.6 Inference and Approximate Inference¶
In latent variable model, we might want to extract features E[h | v]. Sometime we need to solve such problems in order to perform other tasks. We often train our model using the principle of maximum likehood. Because;
we often want to compute p(h|v) in order to implement a learning rule. All these are examples of inference problem in which we must predict the vvalue of some variables gieb other variables, or predict the probability distribution over some variables given the value of other variables.
For most deep models, these inference problems are intractable, even when we use a structured graphical model to simplify them. This motivates the use of approximate inference. In the context of DL, this usually refer to variational inference, which we approximate the true distribution p(h|v) by seeking an approximate distribution q(h|v) that is as close to the true one as possible.
16.7 The Deep Learning Approach to Structured Probabilistic Models¶
In the context of graphical model, the depth of a model: a latent variable \(h_i\) as being at depth j if shortest path from \(h_i\) to an observed variable is j steps.
Many genarative models used for DL have no latent variables or only one layer of latent variables but use deep learning computational graphs to define the conditional distribution within a model.
DL
- always makes use of the idea of distributed representations.
- DL models typically have more latent variables than observed variables.
- Complicated nonlinear interaction between variable are accomplished via indirect connection that flow through multiple latent variables.
- does not intend for the latent variables to take on any specific semantic ahead of time, training algorithms is free to invent the concepts it needs to model a particular dataset.Hard to interpret, easier to scale and reuse.
- Large group of units are typically all connected to other groups of units, so that the interaction between 2 groups can be described by a single matrix.
- Loopy belief propagation is almost never used for DL. Most deep models are instead designed to make Gibbs sampling or variational inference algorithms efficient.
- Large number of latent varible makes efficient numerical code essential. Additionally motivates: grouping the units into layers with a matrix describing the interaction between 2 layers. This allows individual steps of the algorithm to be implemented with efficient matrix product operation, or sparsely connected generalization.
- Increase the power of model until it is just barely possible to train. The DL approach is often to figure out what the minimum amount if info we absolutely need is, and then to figure out how to get a reasonable approximation of that information as quickly as possible.
Traditional graphical model
- usually contain mostly variables that are at least occasionally observed, even if many many of the variables are missing at random from some training examples.
- Use high order terms and structure learning to capture complicated nonlinear interactions between variables.
- latent variable designed with some specific semantic in mind. Easier to interpret have more theoretical guarantees, less able to scale to complex problem and not reusable
- Typically have very few connections. The choice of each variable might be individually designed.
- The design of the model structure is tightly linked with the choice of inference algorithms.
- Typically aim to maintain the tractability of exact inference. When this constraint is too limiting, a popular approximate approach algorithms are called loopy belief propagation.
16.7.1 Example: The Restricted Boltzman Machine¶
- Units are organized into large groups called layers
- Connectivilty described by matrix
- Connectivity relatively dense
- Desiged to allow efficient Gibbs Sampling
- Freeing the training algorithm to learn latent variables whose semantics design is not specified by the designer.
Review on undirected model and energy based model:
An undirected graphical model is a structured probabilitic model defined on an undirected graph G. For each clique C in the graph, a factor \(\phi(C)\) (also called a clique protential) measures the affinity of the variables in that clique for being in each of their possible joint states. \(\phi(C) > 0\). Unnormalized probability distribution:
A clique of the graph is a subset of nodes that are all connected to each other by an edge of the graph.
Many interesting theoretical results about undirected models depend on the assumption that \(\forall x, \hat{p}(x) > 0\). A convinient way to enforce this condition is to use energy based model (EBM) where
and E(x), aka, energy function. By learning the energy function, we can use unconstrained optimization. Any distriburuib of the form given by the equation above is an example to Boltzmann distribution.
- Boltzmann Machine is today most often used to disignate models with latent variables
- Boltzmann machines without latent variables are more often called Markov random field or log-linear models.
Now we come back to RBM
The canonical RBM RBM is an energy based model with binary visible and hidden units
With the graph above, we can see that there is no triangle clique. So each edge forms 1 clique and its factor function of clique \(\{v_i, h_j\}\) would look like this:
That is why the factor function with the graph as a whole looks like:
We also have Z as normalizing constant:
Now we derive the conditional distribution from the joint distribution (ignored the vector sign for simplicity):
Since we are conditioning on the visible units v, we can treat these as constant with respect to the distribution P(h|v). Next, we can derive:
Together, those properties allow for efficient block Gibbs sampling, which alternates between:
- sample all of h simultaneously
- sample all of v simultaneously
Compare to linear factor model:
Efficient Gibbs sampling and efficient derivatives makes training convenient.
RBM demonstrats the typical DL approach to graphical models: representation learning accomplished via layers of latent variables, combined with efficient interaction between layers parameterized by matrices.
17 Monte Carlo Methods¶
Randomized algorithms, roughly 2 categories
Las Vagas algorithms
- always return precisely correct answer or report failed
- consume random amount of resources: time or memory
Monte Carlo algorithms
- return answers with a random amount of error
- error amount can typically be reduced by expending more resources
- fixed computational budget, a Monte Carlo algorithm can provide an approximate answer
17.1 Sampling and Monte Carlo Methods¶
Many ML goals:
- drawing samples from some probability distribution
- using these samples to from a Monte Carlo estimate of some desired quantity.
17.1.1 Why sampling¶
- Sampling provides a flexible way to approximate many sum and integrals at reduced cost.
- Intractable sum or integral, such as gradient of the log partition function of an undirected model.
- We want to train a model that can sample from the training disrtibution.
17.1.2 Basics of Monte Carlo Sampling¶
The idea: view the sum or integral as if it were an expectation under some distribution and to approximate the expectation by a corresponding average. The sum or integral to estimate:
We can approximate s by drawing n samples \(x^{(1)}, x^{(2)} .... x^{(n)}\) from p and then forming the empirical average
Properties of this approximation
- estimator is unbiased
- law of large number
The law of large number, if samples \(x^{(i)}\) are i.i.d, then the average converges almost surely to the expected value:
provided that the variance of the individual tarms, \(Var[f(x^{(i)})]\) is bounded.
The variance \(Var[\hat{s}_n]\) decreases and converges to 0, so long as \(Var[\hat{s}_n] < 0\).
This result also tells us how to estimate the uncertainty in a Monte Carlo average or equivalently the amount of expected error of the Monte Carlo approximation:
- Compute empirical average: \(E(f(x))\)
- Compute empirical variance: \(Var[f(x)]\)
- Divide the estimated variance by number of samples n to obtain an estimator of \(Var[\hat{s}_n]\)
Review on Central limit thorem (3.9.3 P-62):
The central limit theorem shows that the sum of many independent random variables is approximately normally distributed. This means that in practice, many complicated systems can be modeled successfully as normally distributed noise, even if the system can be decomposed into parts with more structured behavior
So, the distrution of the average, \(\hat{s}_n\), converges to a normal distribution with mean s and variance \(\frac{Var[f(x)]}{n}\), allowing us to estimate the confidence intervals around the estimate \(\hat{s}_n\) using the cumulative distribution of the normal density.
All this relies on our abibilty sample from base distribution p(x). When it is not feasible to sample from p
- use important sampling
- form a sequence of estimators that converge toward the distribution of interest. This is the approach of Monte Carlo Markov chains.
Resources¶
17.2 Importance Sampling¶
Review basics of Monte Carlo Samplings:
Note that there is no unique decomposition because
now we sample from q and average \(\frac{pf}{q}\). The optimal \(q^*\) corresponds to what is called optimal importance sampling.
Monte Carlo estimator:
is now transformed into:
Expected value of estimator:
The variance of an importance sampling:
The minimum variance occurs when q is
where Z is the normalization constant, chosen so that \(q^*(x)\) sums or integrates to 1 as appropriate.
Better importance sampling distribution put more weight where the integrant is larger. When f(x) does not change sign, \(Var[\hat{s}_{q^*}] = 0\), meaning that a single example is sufficient when the optimal distribution is used. This is only because the computation of \(q^*\) has essentially solved the original problem, os it is usually not practical to use this approach.
Sampling from \(q^*\) is usually infeasible, but other choices of q can be feasible while still reducing the variance somewhat
Biased importance sampling, not require normalizing p or q
where \(\tilde{p}, \tilde{q}\) are the unormalized form of p and q and the \(x^{(i)}\) are the samples from q. This estimator is biased because \(E[\hat{s}_{BIS}] \neq s\), exexpt when \(n \to \infty\) and the denominator of the first equation above converges to 1. Hence this estimator is called asymptotically unbiased.
q distribution is usually chosen to be simple distribution so that it is easy to sample from. When x is high dimensional, this simplicity in q causes it to match p or p|f| poorly.
- When \(q(x^{(i)}) >> p(x^{(i)})|f(x^{(i)})|\), importance sampling collects useless samples (summing up tiny numbers or zeros)
- When \(q(x^{(i)}) << p(x^{(i)})|f(x^{(i)})|\) (more rarely), the ratio can be huge
yield typical underestimation of s, compensated rarely by gross overestimation. Such very large or very small numbers are typical when x is high D, because in high D the dynamic range of joint probabilistic can be very large
Very useful
- section 12.4.3.3 P457: accelerate training in neural language model with a large vocabulary.
- section 18.7 p614
- 20.10 p687
Importance sampling may also be used to improve the estimate of the gradient of the cost function used to train model parameters with stochastic gradient descent. Sampling more difficult examples more frequently can reduce the variance of the gradient in such case.
17.3 Markov Chain Monte Carlo Methods¶
- Problem: we wish to use Monte Carlo technique but there is no tractible method for drawing exact samples from the distribution \(p_{model}\) or from a good (low variance) importance sampling distribution q(x). In DL, most happens when \(p_{model}\) is represneted by an undirected model.
- Solution: Markov chain to to approximately sample from \(p_{model}\).
Markov chain Monte Carlo (MCMC): Use Markov chain to perform Monte Carlo estimates.
Most std generic guarantees for MCMC techniques are only applicable when the model does not assign zeros probability to any state. It is most convenient to present these techniques as sampling from an energy-based model.
Review on Energy based model
Many interesting theoretical results about undirected models depend on the assumption that \(\forall x, \hat{p}(x) > 0\). A convinient way to enforce this condition is to use energy based model (EBM) where
and E(x), aka, energy function. By learning the energy function, we can use unconstrained optimization. Any distriburuib of the form given by the equation above is an example to Boltzmann distribution.
- Boltzmann Machine is today most often used to disignate models with latent variables
- Boltzmann machines without latent variables are more often called Markov random field or log-linear models.
Problem of sample from EBM. E.g: 2 variables, defining a distribution p(a, b). In order to sample a, we must draw a from p(a|b), vice versa. Chicken egg problem. Could be solved by Markov Chain.
Core idea of Markov Chain: have a state x that begins as a arbitrary value. Over time, we randomly update x repeately. Eventually x becomes (very nearly) a fair sample from p(x). Formally:
- random state x
- transition distribution \(T(x'|x)\) specifying the probability that a random update will go to state \(x'\) if it starts in state x
- repeately updating the state x to a value x sampled from \(T(x'|x)\)
We restrict our attention to the case where the random variable \(\vec{x}\) has countably many states. We can then represent the state as just a positive integer x.
Consider run infinitely many Markov chains in parallel. All the states of different Markov chains are drawn from some distribution \(q{(t)}(x)\), where t indicate the number of times steps that have elapsed. \(q{(t)}(x)\) is influenced by all the Markov chain steps that have run so far.
v: a vector to describe probability distribution q. \(q(x = i) = v_i\)
The probability of a single state ladnding in a state \(x'\) is given by
Define matrix A
rewrite the Markov Chain update:
Applying Markov chain repeatedly
Each column represents a probability disrtibution. Such matrices are called stochastic matrices.
If there is a non-zero probability transitioning from any state x to any other state x’ for some power t, it can be guanranteed that the largest eigenvalue is real and equal to 1.
This process causes all the eigenvalues that are not equal to 1 to decay to zero. Under some additional mild conditions, A is guaranteed to have only one eigenvector with eigenvalue 1. The process thus converges to a stationary distribution, sometimes also called the equilibrium distribution. At convergence,
and this same condition holds for every additional step. To be a stationary point, v must be an eigenvector with corresponding eigenvalue 1. This condition gurantees that once we have reached the stationary distribution. Repeated applications of the transition sampling procedure do not change the distribution over the state of all the various Markov chain.
If choose T correctly, the stationary distribution q will be equal to the distribution p we wish to sample from.
In general, a Markov chain with transition operator T will coverge, under mild conditions, to a fixed point described by the equation.
Running Markov chain until it reaches its equilibrium distribution is called burning in the Markov chain. After the chain has reached equilibrium, a sequence of infinitely many samples may be drawn from the equilibrium distribution.
Problem: 2 successive samples highly correlated with each other.
Solution:
- return only every n successive samples, so that our estimate of the statistics of the equilibrium distribution is not as biased by the correlation between MCMC sample and the next several samples. Expensive: burn in time + time required to transition from one sample to another reasonably decorrelated samples.
- Run multiple Markov chain in parallel. Extra computation.
- DL practitioners usually use a number of chains that is similar to the number of examples in a Minibatch and then draw as many samples as are needed from this fixed set of Markov chains. Common used number of Markov chain: 100.
How many steps the Markov chain must run before reaching its equilibrium distribution: Mixing time. The difficulties to know whether a Markov chain has mixed:
- To know the mixing time is difficult
- To test whether a Markov chain has reached equilibrium is difficult.
- Number of states that our probabilistic model can visit is exponentially large in the number of variables, so it is in feasible to represent v, A or the eigenvalues of A.
Instead, we simply run Markov chain for an amount of time that we roughly estimate to be sufficient, then use heuristic methods to determine whether the chain has mixed. Those heuristic methods:
- manually inspect samples
- measure correlations between successive samples.
17.4 Gibbs Sampling¶
Draw sample from distribution q(x): updating \(x \leftarrow x' \sim T(x'|x)\)
Problem: how to ensure that q(x) is a useful distribution
Solution:
- Derive T from a goven learned model \(p_{model}\)
- Directly parameterize T and learn it, so that its stationary distribution implicitely define the \(p_{model}\) of interest.
In DL, we commonly use Markov chain to draw samples from an energy-based model defining a distribution \(p_{model}(\vec{x})\). In this case, we want the \(q(\vec{x})\) for the Markov chain to be \(p_{model}(\vec{x})\). To obtain the desired q(x), we must choose an appropriate \(T(x'|x)\)
Gibbs sampling: sampling from \(T(x'|x)\) is accomplished by selecting one variable \(x_i\) and sample it from \(p_{model}\) conditioned on it neighbor in the undirected graph G defining the structure of the energy based model. We can also sample several variables at the same time as long as they are conditionally independent given all their neighbors.
e.g Restricted Boltzman Machine
all the hidden units of an RBM maybe sampled simultaneously because they are conditionally independent from each other given all the hidden units. Gibbs sampling approaches that update many variables simultaneously in this way are called blocled Gibbs sampling.
17.5 The challenge of Mixing between Seperated Modes¶
Primary difficulty involved in with Monte Carlo Markov Chain: tendency to mix poorly.
Slow mixing or failure to mix: Especially in high-D cases, MCMC samples become very correlated. Could be seen as inacdertently performing noisy gradient descent on the energy function. The chain tends to take small steps, from configuration \(x^{(i-1)}\) to a configuration \(x^{(i)}\), with the energy \(E(x^{(i)})\) lower or approximately equal to the energy \(E(x^{(i - 1)})\), with preference for moves that yield lower energy configuration.
Once the chain has found a region of low energy, which we call a mode, the chain will tend to walk around that mode. Once in a while it will step out of that mode and generally return to it or move toward another mode. Problem: seccessful escape routes are rare for many interesting distribution.
Consider the probability of going from one mode to a nearby mode within a given number of steps. Determined by shape of the “energy barrier” between those mode. Transition with between two modes are seperated by a high energy barrier (a region of low barrier) are exponetially less likely.
Problem arrises when: there are multiple modes with high probability are seperated by region of low probability, especially when each Gibbs sampling step must update only a small subset of variables whose values are largely determined by the other variables.
Example:
- variable a and b, taking on values -1 and 1
- E(a, b) = -wab for some large positie value of w
The model express a strong belief that a and b habe the same sign. Consider updating b using Gibbs sampling step with a = 1. The conditional distribution over b is given by:
I still don’t understand how the \(\sigma(w)\) get derived.
If w is large
- when a = 1, the probability of b assigned with 1 is close to 1
- when a = -1, the probability of b assigned with -1 is close to -1
According to \(P_{model}(a, b)\), both signs of both variable are equally likely. According to \(P_{model}(a|b)\) This means that Gibbs sampling will only rarely flip the sign of these variables.
In more practical scenarios, we care about making transition between all the many modes that a real model might contain -> greater challenge.
Sometimes this problem could be solved by finding groups of highly dependent units and updating all of them simultaneously in a block. Unfortunately, when the dependencies are complicated, it can be computationally intractable to draw a sample from the group. After all, the problem that the Monte Carlo Markov Chain was originally introduced to solve is this problem of sampling from a large group of variables.
In the context of models with latent variables, we often draw samples of x by alternating between sampling from \(p_{model}(x|h)\) and sampling from :math:’p_{model}(h|x)’.
- From the point of view of mixing rapidly, we would like :math:’p_{model}(h|x)’ to have high entropy.
- From the point of view of learning a useful representation of h, we would like h to encode enough information about x to reconstruct it well, which implies that h and should have high mutual information
These two goals are at odds with each other. We often learn generative that very precisely encode x into h but are not able to mix very well.
17.5.1 Tempering to Mix between Modes¶
- Problem: when a distribution has sharp peaks of high probability surrounded by region of low probability, it is difficult to mix between the different modes of distribution.
- Solution: Several techniques for faster mixing are based on constructing alternative version of the target distribution in which the perks are not as high and surrounding valleys are not as low.
Energy based model:
Energy based models may be augmented with an extra parameter \(\beta\) controlling how sharply peaked the distribution is
\(\beta\) : The recoprocal of the temperature.
- When the temperature falls to zero, and \(\beta\) rises to infinity, the energy based model becomes deterministic
- Wehn the temperature rises to infinity, and \(\beta\) falls to 0, the distribution becomes uniform
Typically, a model is trained to evaluated at \(\beta = 1\), we can also make use of other temperature, particular those where \(\beta < 1\).
Tempering: general strategy of mixing between modes of \(p_1\) rapidly by drawing samples with \(\beta < 1\)
Markov chain based on tempered transitions: temporarily resume sampling from higher-temperature distribution to mix to different modes, then resume sampling from the unit temperature distribution
Parallel tempering: Markov chain simulates many different states in parallel, at different termperature. The highest temperature states mix slowly, which the lowest temperature states, at temperature 1, provide accurate samples from the model. The transition operator includes stochasticelly swapping states between 2 different temperature levels, so that a sufficiently high-probability sample from a high temperature slot can jump into a lower temperature slow
17.5.2 Depth May Help Mixing¶
- Problem: When drawing samples from a latent variable model p(h, x), we have seen that if p(h|x) encodes x too well, the sampling from p(x|h) will not change x very much, and mixing will be poor.
- Solution: one way to solve this problem is to make h a deep representation, encoding x into h in such a way that a Markov chain in the space of h can mix more easily.
Many representation learning algorithms, such as RBM tend to yield a marginal distribution over h that is more uniform and more unimodal than the original data distribution over x.
Training an RBM in that higher-level space allowed Gibbs sampling to mix faster between modes.
18 Confronting the Partition Function¶
Review:
An undirected graphical model is a structured probabilitic model defined on an undirected graph G. For each clique C in the graph, a factor \(\phi(C)\) (also called a clique protential) measures the affinity of the variables in that clique for being in each of their possible joint states. \(\phi(C) > 0\). Unnormalized probability distribution:
A clique of the graph is a subset of nodes that are all connected to each other by an edge of the graph.
To obtain a valid probability distribution, we must use the corresponding normalized probability distribution.
where Z is the value that results in the probability distribution summing or integrting to 1
You can think of Z as constant when \(\phi\) functions are held constant. Note that if the \(\phi\) functions have parameters, then Z is a function of those parameters. Normalizing function Z is known as partition function.
Computing Z is intractable for many interesting models. How to confront the challenge
- models designed to have a tractable normalizing constant
- designed to be used in ways that do not involve comupting p(x) at all.
- directly confront the challenge of intractable partition function, as described in this chapter
18.1 The log-likehood Gradient¶
Target probability distribution:
The gradient
- \(\nabla_{\theta} \log\hat{p}(x;\theta)\) : Positive phase. Models with no latent variables or with few interaction between the latent variables typically have a tractable positive phase.
- \(\nabla_{\theta} \log Z(\theta)\): Negative phase. For undirected model, the negative phase is difficult.
Closer look at gradient of log Z, assuming p(x) > 0 for all x:
This is the basis for a variety of Monte Carlo methods for approximately maximizing the likehood of models with intractable partition function.
- In the positive phase, we increase \(\log \hat{p}(x)\) for drawn from the data.
- In the negative phase, we decrease the partition function by decreasing \(\log \hat{p}(x)\) drawn from the model distribution.
Review on Monte Carlo Methods:
The idea: view the sum or integral as if it were an expectation under some distribution and to approximate the expectation by a corresponding average. The sum or integral to estimate:
We can approximate s by drawing n samples \(x^{(1)}, x^{(2)} .... x^{(n)}\) from p and then forming the empirical average
18.2 Stochastic Maximum Likehood and Contrastive Divergence¶
Naive way of implement:
is by burning in a set of Markov chains from a random initialization everytime the gradient is needed. When learning is performed using stochastic gradient decent, the chains must be burned in once per gradient step.
Review on Monte Carlo Markov Chain:
Core idea of Markov Chain: have a state x that begins as a arbitrary value. Over time, we randomly update x repeately. Eventually x becomes (very nearly) a fair sample from p(x). Formally:
- random state x
- transition distribution \(T(x'|x)\) specifying the probability that a random update will go to state \(x'\) if it starts in state x
- repeately updating the state x to a value x sampled from \(T(x'|x)\)
algorithm 1: A naive MCMC
The hight cost of burning in the Markov chains in the inner loop makes this procudure computationally infeasible.
We can view the MCMC approach to maximum likehood as trying to achive balance between two forces:
- one pushing up on the model distribution where the data occurs. It corresponds to maximize \(\log \hat{p}(x)\)
- another pushing down on the model distribution where the model samples occurs. It corresponds to minimize \(\log Z\).
Because the negative phase involves drawing samples from the model’s distribution, we can think of it as finding points that the model belives in strongly. Because the negative phase acts to reduce the probability of those points, they are generally considered to represent the model’s incorrect beliefs about the world.
- problem: The main cost of naive MCMC algo: the cost of burning in Markove Chains from a random initialization at each step.
- solution: natural solution, initialize the Markove chain from a distrbution that is very close to the model distribution, so that the burn in operation does not take as many steps, as described below
algorithm 2: contrastive divergence (CD or CD-k to indicate CD with k Gibbs sampling). CD initializes the Markov chain at each step with samples from the data distribution.
Initially, the data distribution is not close to the model distribution, so the negative phase is not accurate. Positive phase can still accurately increase the model probability of the data. After positive phase had some time to act, the model distribution is closer to the data distribution, and the negative phase starts to become accurate.
Where CD qualitatively fails: it fails to suppress region of high probability that are far from actual traininng examples.
Spurious modes: regions that have probability under the model but low probability under the data-generating distribution. Modes in the model distribution that are far from the data distribution that are far from the data distribution will not be visited by Markov chains initialized at training points, unless k is very large.
Review:
- Boltzmann Machine is today most often used to disignate models with latent variables
- Boltzmann machines without latent variables are more often called Markov random field or log-linear models.
CD estimator is biased for RBM and fully visible Boltzman machiens. It is useful for training shallow models like RBMs. These can in turn be stacked to intialize deeper models like DBNs or DBMs. It does not provide much help for training deep models directly,. Because it is difficult to obtain samples of the hidden units given samples of the visible units.
CD algorithm can be thought of as penalizing the model for having Markov Chain that changes the input rapidly when the input comes from the data. Useful for pretraining shallow models that will later be stacked. This is because the earlist model in the stack are encouraged to copy more information up to the latent variables, thereby making it available to the later models.
algorithm 3: stochastic maximum likehood (SML), AKA, persistent contrastive divergence (PCD). Initialize the markov chains at each gradient step with their states from the previous gradient step.
Basic idea of SML: as long as the steps taken by the stochastic gradient algorithm are small, the model from the previous step will be similar to the model from the current step. It follows that the samples from the previous model’s distribution will be very close to being fair samples from the current model’s distribution, so Markov chain initialized with these samples will not require much time to mix.
Pros:
- Considerably more resistant to forming models with spurious modes than CD is.
- Because it is possible to store the state of all the sampled variables, whether visible or latent, SML provides an initialization point for both the hidden and visible units. SML is able to train deep models efficiently.
Cons:
- SML is vulnerable to becoming inaccurate if the stochastic gradient algorithm can move the model faster than the Markov chian can mix between steps. This can happen when k is too small or \(\epsilon\) is too large. If the \(\epsilon\) is too high for k, the human operator will be able to observe much more variance in the negative phase samples accross gradient steps than accross different Markov Chain.
- Care must be taken when evaluating the samples from a model trained with SML. It is neccessary to draw the samples starting from a fresh Markov chain initialized from a random starting point after the models is done training.
- CD proves to have lower variance than the estimator based on exact sampling. SML has higher variance.
All these methods based on using MCMC to draw samples from model can in principle be used with almost any variant of MCMC
Fast PCD involves replacing parameters \(\theta\) with
There are twice as many parameters as before and they are added together element-wise to provide the parameters used by the original model definition. The fast copy of the parameters is trained with a much larger learning rate, allowing it to adapt rapidly in respond to the negative phase of learning and push Markov chain to new territory. Typically one also applies significant weight decay while the fast weights, encouraging them to converge to small values, after transiently taking on large values long enough to encourage the Markov chain to change modes.
Key benefit of MCMC based methods: we can decompose the problem into the \(\log \hat{p}\) contribution and log Z contribution.
18.3 Pseudolikehood¶
- Monte Carlo approximation to partition function and its gradient: directly confront the partition function.
- Some other approach: based on the observation that it is easy to compute ratios of probabilities in an undirected probablistic model. No need to compute the partition function.
Pseudolikehood is based on the observation that conditional probabilities take this ratio-based form and thus can be computed without knowledge of the partition function. Suppose that we partition x into a, b and c where
- a: variables we want to find the conditional distribution over
- b: variables we want to condition on
- c: variables that are not part of our query
This quantity requires marginalizing out a, which bca be a very efficient operation provided that a and c do not contain many variables.
In order to compyte the log likehood, we need to marginalize out large sets of variables. If there are n variables total, we must marginalize a set of size n - 1.
In this case, we have made a maximally small, but c can be as large as \(x_{2:n}\). This yield the pseudolikehood objective function, based on predicting the value of feature \(x_i\) given all the other features \(x_{-i}\):
If each random variable has k different values, this requires only k * n evaluations of \(\hat{p}\) to compute. As opposed to the \(k^n\) evaluations needed to compute the partition function.
Generalized peeudolikehood estimator: uses m different set \(S^{(i)}, i = 1, 2, 3 ... m\) of indices of variables that appear together on the left side of the conditioning bar. The objective function:
Pseudolikehood tends to perform
- poorly on tasks that require a good model of the full joint distribution of p(x), such as density estimation and sampling
- better than maximum likehood for tasks that require only the conditional distribution used during training, such as filling in small amount of missing values.
Generalized pseudolikehood techniques are especially powerful if the data has regular structure that allows the S index sets to be designed to capture the most important correlations while leaving out groups of variables that have only negligible correlation.
Weakness of pseudolikehood estimator: cannot be used with other approaximations that provide only a lower bound on \(\hat{p}(x)\), such as variational inference. This is because \(\hat{p}\) appears in the denominator. A lower bound on the denominator provides only an upper bound on the expression as a whole, and there is no benefit to maximizing an upper bound. -> hard to apply to deep models such as deep Boltzman Machines. Variational methods are one of the dominant approaches to an approaximately marginalizing out the many layers of hidden variable that interact with each other.
Pseudolikehood can be used to train single layer models or deep models using approximate inference methods that are not based on lower bound.
Pseudolikehood can be thought of as having something resembling a negative phase. The denominators of each conditional distribution result in the learning algorithm suppressing the probability of all states that have only one variable differing from a training exmaple.
Resources¶
18.4 Score Matching and Ratio Matching¶
Score:
The strategy of score matching: minimize the expected squared difference between the derivatives of the model’s log density with respect to the input and the direvative of the data’s log density with respect to the input
This objective function avoids the difficulties associated with differentiating the partition function Z because Z is not a function of x and therefor \(\nabla_x Z = 0\).
Minimizing the expected value of \(L(x, \theta)\) is the equivalent to minimizing the expected value of
where n is the dimensionality of x. Score matching is not applicable to discrete data. The latent variable in the model may be discrete.
It is not compatible with methods that provide only a lower bound on \(\log \hat{p}(x)\), because score matching requires the derivatives and second derivative of \(\log \hat{p}(x)\), and lower bound conveys no information about its derivative. This means that score matching cannot be applied to estimating models with complicated interaction between the hidden units, such as sparse coding models and deep Boltzman machine. While score mathcing can be used to pretrain the first hidden layer of a large model, it has not been applied as pretraining strategy for the deeper layers of a larger model.
An approach to extending the basic ideas of score matching to discrete data is ratio matchin. Ratio matching applies specifically to ratio data. Objective function:
where f(x, j) returns x with the bit at position j flipped. Ratio matching avoids the partition function using the same trick as the pseudolikehood estimator: in a ration of two probabilities, the partition function cancels out.
Ratio matching requires n evaluations of \(\hat{p}\) per data point, computational cost per update roughly n times higher than that of SML
review: stochastic maximum likehood (SML), AKA, persistent contrastive divergence (PCD). Initialize the markov chains at each gradient step with their states from the previous gradient step.
As with pseudolikehood, ratio matching can be thought of as pushing down all fantasy states that have only one variable different from a training example.
Ratio matching can be useful as the basis for dealing with high D sparse data, such as word count vectors. This kind of data poses a challenge for MCMC-based methods because the data is extremely expensive to represent in dense format, yet the MCMC sampler does not yield sparse values until the model has learned to represent the sparsity in the data distribution.
18.5 Denoising Score Matching¶
We may wish to regularize score matching, by fitting a distribution
rather than the true \(p_{data}\). The distribution q(x|y) is a corruption process, usually one that form x by adding a small amount of noise to y.
Denoising score matching is epecially useful because in practice, we usually do not have access to the true \(p_{data}\) but rather only an emprirical distribution defined by samples from it.
Several autoencoder training algorithms are equivalent to score matching or denoising score matching. These autoencoder training algorithms are therefore a way of overcoming the partition function problem.
18.6 Noise-Contrastive Estimation¶
Noise-contrastive estimation estimates probability distribution by
where c is explicitly introduced as an approximation of \(-\log Z(\theta)\). Rather than estimate only \(\theta\), the NCE procedure treats c as just another parameter and estimate \(\theta\) and c simultaneously, using same algorithm for both. The resulting \(\log p_{model}(x)\) might not be correspond to exactly to a valid probability distribution, but it will become close and closer to being valid as the estimateof x improves.
NCE works by reducing the unsupervised learning problem of estimating p(x) to that of learning a probablistic binary classifier in which one of the categories corresponds to the data generated by the model. Specifically, we introduce a noise distribution \(p_{noise}(x)\) which is tractable to evaluate and sample from. We can now construct a model over both x and a new, binary class variable y. In the new joint model:
y is a switch variable that determines whether we will generate x from the model or from the noise distribution.
We can construct a similar joint model of training data. In this case, the switch variable determines whether we draw x from the data or from the noise.
We can now just use standard maximum likehood learning on the supervised learning problem of fitting \(p_{joint}\) to \(p_{train}\):
\(p_{joint}\): a logistic regression model applied to the difference in log probabilities of the model and the noise distribution:
NCE is simple to apply as long as
- \(\hat{p}_{model}\) is easy to back-propagate
- \(p_{noise}(x)\) is easy to evaluate in order to evaluate \(p_{joint}(x)\)
- \(p_{noise}(x)\) is easy to sample from to generate training data
When to use Noise Contrastive Estimation:
- NCE most successful when applied to problem with few random variables, but it can work well even if those random variables can take on a high number of values. E.g: modeling the conditional distribution over a work given the context of the word.
- Less efficient when applied to problem with many random variables. The logistic regression classifier can reject a noise sample by identifying any one variable whose value is unlikely. This means that learning slow greatly after \(p_{model}\) has learned the basic marginal statistics.
The constraint that \(p_{noise}\) must be easy to evaluate and easy to sample from can be overly restrictive. When \(p_{noise}\) is too simple, most samples are likely to be too obviously distinct from the data to force \(p_{model}\) to improve noticeably.
Like score matching and pseudolikehood, NCE does not work if only a lower bound on \(\hat{p}\) is available.
When the model distribution is copied to define a new noise distribution before each gradient step, NCE defines a procedure called self-contrastive estimation, whose expected gradient is equivalent to the expected gradient of maximum likehood.
- The special case of NCE where the noise samples are those generated by the model suggests that maxium likehood can be interpreted as a procedure that forces a model to constantly learn to distinguish reality from its own evolving beliefs, while noise
- Noise contrastive estimation achieves some reduced computational cost by only forcing the model to distinguish reality from a fixed baseline (the noise model).
18.7 Estimating the Partition Function¶
eg:
- Model A: \(p_A(x, \theta_A) = \frac{1}{Z_A}\hat{p}_A(x,; \theta_A)\)
- Model B: \(p_B(x, \theta_B) = \frac{1}{Z_B}\hat{p}_B(x,; \theta_B)\)
Suppose the test set consists of m examples \(\{ x^{(1)}, ..., x^{(m)} \}\), if
we say that model A is a better model than model B, in the sense that it has a better log-liekhood. If we knew the ratio of two partition functions, \(r=\frac{Z(\theta_B)}{Z(\theta_A)}\) and we knew that actual value of just one of the two, we could compute the value of the other:
A simple way to estimate the partition function is to use Monte Carlo method such as simple importance sampling:
Proposal distribution: \(p_0 = \frac{1}{Z_0}\hat{p}_0(x)\), which support tractable sampling and tractable evaluation of both the partition function and unnormalized distribution.
So, we make a Monte Carlo estimator, \(\hat{Z}_1\)
This value also allow us to estimate the ratio between the partition function as
- If \(p_0\) is close to \(p_1\): effective way of estimating partition function
- If not, which in most cases, it is difficult to find a tractable \(p_0\) that is simple enough to evaluate while still being close enough to \(p_1\) to result in a high-quality approximation. Most samples from \(p_0\) will have low probability under \(p_1\) and therefore make relatively negative contribution to the sum.
Two strategies to cope with the challenging task of estimating partition functions for complex distribution over high D spaces:
- Annealed Importance Sampling
- Bridge Sampling
Both attempt to overcome the problem of the proposal \(p_0\) being too far from \(p_1\) by introducing intermediate distribution that attempt to bridge the gap between \(p_0\) and \(p_1\) (\(D_{KL}(p_0||p_1)\) is large).
18.7.1 Annealed Importance Sampling¶
Annealed importance sampling attempts to bridge the gap by introducing intermediate distributions: \(p_{\eta_0}, .... p_{\eta_n}\), where \(0=\eta_0< \eta_1 < ... < \eta_n=1\).
We begin with a simpler model with a known partition function and estimate the ratio between the 2 model’s partition functions. The estimate of this ratio is based on the estimate of the ratios of a sequence of many similar distributions, such as the sequence of RBMs with weights interpolating between 0 and the learned weights.
provided the distribution \(p_{\eta_j}\) and \(p_{\eta_{j+1}}\) for all 0<=j<=n-1 are sufficiently close, we can reliably estimate each of the factors \(\frac{Z_{\eta_{j+1}}}{Z_{\eta_j}}\) using simple importance sampling and the use these to obtain an estimation of \(\frac{Z_1}{Z_0}\).
One general purpose and popular choice for the intermediate distribution is to use the weighted geometric average of the target distribution \(p_1\) and the starting proposal distribution \(p_0\):
In order to sample from these intermediate distributions, we define a series of Markov chain transition functions \(T_{\eta_j}(x'|x)\) that define conditional probability distribution of transition to x’ given we are currently at x. The transition operator \(T_{\eta_j}(x'|x)\) is defined to leave \(p_{\eta_j}(x)\) invariant:
Strategy of AIS: generate sample from \(p_0\) and use the transition operator to sequentially generate samples from the intermediate distribution until we arrive at samples from the target distribution \(p_1\):
for sampel k we can derive the importance weight by chaining together importance weights for the jumps between the intermediate distributions
To avoid numerical issues such as overflow, it is probably best to compute \(\log w^{(k)}\) by adding and substracting log probabilities rather than computing \(w^{(k)}\) by multipling and dividing probabilities.
The estimate of ratio of partition function:
AIS is currently the most common way of estimating the partition function for undirected probability models.
18.7.2 Bridge Sampling¶
Bridge sampling relies on a single distribution \(p_*\), known as the bridge to interpolate between a distribution with known partition function, \(p_0\), and a distribution \(p_0\) for whivh we want to estimate the partition function \(Z_1\)
Bridge sampling estimates the ratio as the ratio of the expected improtance weights between \(\hat{p}_0\) and \(\hat{p}_*\), and between \(\hat{p}_*\) and \(\hat{p}_1\)
If the bridge distribution \(p_*\) is chosen carefully to have a large overlap of support with both \(p_0\) and \(p_1\), the bridge sampling can allow the distance between 2 distributions (or \(D_{KL}(p_0||p_1)\)) to be much larger with standard importance sampling.
Optimal bridging distribution:
where \(r=Z_0/Z_1\). It is possible to start with a caorse estimate of r and use resulting bridge distribution to refine our estimate iteratively.
Linked Importance Sampling: leverage the power of the bridge sampling strategy to bridge the intermediate distribution used in AIS and significantly improve the overall partition function estimateion.
Estimating the partition function while training. AIS: standard method for estimating the partition function for many undirected model. However, it is computationally intensive that it remains infeasible to use during training. Alternative strategies have been explored to maintain an estimate of the partition function throughout training.
19 Approximate Inference¶
- v: visible variables
- h: hidden variables
Challenge of inference: compute p(h|v) or take expectation with respect to it.
Review:
Direct interaction in undirected model / explaining away interaction between mutual ancestors of the same visible unit in directed model ==> interaction between latent variables => Intractable inference problem.
19.1 Inference as Optimization¶
Exact inference can be described as an optimization problem.
- Goal: compute \(\log p(v; \theta)\)
- Challenge: costly to marginalize out h
- Solution: Compute the lower bound of \(\log p(v; \theta)\), aka evidence lower bound (ELBO).
Introduce q as an arbitrary distribution over h
Now we have
Because KL divergence is always nonnegative we have lower bound
For appropriate of q, L is tractable to compute. For q(h|v)) that are better approaximation of p(h|v), the lower bound would be tighter, meaning closer to log p(v). When q(h|v) = p(v|h), the approaximation is perfect, \(L(v, \theta, q) = log P(v; \theta)\).
We can think of inference as the procedure for finding the q that maximizes L. Solutions
- Exact inference maximizes L perfectly by searching over a family of functions q that includes p(h|v). (Brute Force)
- Approximate by restricting the family of distribution q that the optimization is allowed to search over
- Approxiamte by using imperfect optimization procedure that may not completely maximize L but may only increase it by significant amount.
Review on Shannon entropy
We can quantify the amount of uncertainty in an entire probability distribution using the Shannon entropy:
19.2 Expectation Maximization¶
Not approximate inference (p(h|v)), but learn with an appraximate pesterior.
E-step (expectation step):
- \(\theta^{(0)}\): value of parameters at the beginning of the step.
- Set \(q(h^{(i)}|v) = p(h^{(i)}|v^{(i)}, \theta^{(0)})\) for all indices i of the training example \(v^{(i)}\) we want to train on (batch or minibatch)
- q is defined with current value of \(\theta^{(0)}\). If we change \(\theta\), \(p(h|v, \theta)\) will change, but q(h| v) will remain equal to \(p(h|v, \theta^{(0)})\)
The M-step (Maximization step):
- Completely or partially maximize \(\sum_i L(v^{(i)}, h, \theta)\) with respect to \(\theta\)
Review on ELBO:
Because KL divergence is always nonnegative we have lower bound
E.g (Batch with size 2):
Step 0 (E-step)
- \(q^{(0, 0)}(h) = p(h^{(0)}|x^{(0)}; \theta^{(0)})\)
- \(q^{(0, 1)}(h) = p(h^{(1)}|x^{(1)}; \theta^{(0)})\)
Step 0 (M-step)
- draw h from \(q^{(0, 0)}\), we have one lower bound \(L^{(0, 0)}(v^{(0)}, q^{(0, 0)}, \theta^{(0)})\)
- dram h from \(q^{(0, 1)}\), we have one lower bound \(L^{(0, 1)}(v^{(0)}, q^{(0, 1)}, \theta^{(0)})\)
- Maximize \(L^{(0, 0)} + L^{(0, 1)}\) with respect to \(\theta\). We get \(\theta^{(1)}\)
Step 1 (E-step)
- \(q^{(1, 0)}(h) = p(h^{(0)}|x^{(0)}; \theta^{(1)})\)
- \(q^{(1, 1)}(h) = p(h^{(1)}|x^{(1)}; \theta^{(1)})\)
Step 1 (M-step)
- draw h from \(q^{(1, 0)}\), we have one lower bound \(L^{(1, 0)}(v^{(1)}, q^{(1, 0)}, \theta^{(0)})\)
- dram h from \(q^{(1, 1)}\), we have one lower bound \(L^{(1, 1)}(v^{(0)}, q^{(1, 1)}, \theta^{(0)})\)
- Maximize \(L^{(1, 0)} + L^{(1, 1)}\) with respect to \(\theta\). We get \(\theta^{(2)}\)
blah, blah, blah …
Stop until \(L^{(n, 0)} + L^{(n, 1)}\) reached some tolerance or declare convergence if EM is improving too slowly.
This can be coordinate ascent algo to maximize L. On one step, we maximize L with respect to q, and on the other, we maximize L with respect to \(\theta\).
Statistical gradient ascent can be viewed as special case of EM algorithm where M step consists of taking a single gradient step. Other variants of EM algorithms can make much larger steps.
Even though E-step involves exact inference, we can think of EM algorithm as using approximate inference in some sense: M step assume that the same value of q can be used for all values of \(\theta\). This will introduce a gap between L and true log p(v).
Insights from EM algo:
Basic structure of learning process: we update model parameters to improve the likehood of a completed dataset. Miss variables have their value provided by an estimate of the posterior distribution. Not unique to EM. e.g gradient descent to maximize log likehood. It take expectation with respect to the posterior distribution over the hidden units.
We can continue to use one value of q even after we have moved to different value of \(\theta\). More unique to EM.
- Used throughout classical ML to derive large M step
- In DL, most models are too complicated to admit a tractable solution for an optimal large M-step update.
Resource¶
19.3 MAP Inference and Sparse Coding¶
An alternative form of inference p(v|h) is to compute the single most likely value of the missing variables, rather than to infer the entire distribution over their possible values.
AKA, maximum a posteriori inference.
It is helpful to think of MAP inference as a procedure that provides a value of q of L(v, h, q) or ELBO. In this sense we can think of MAP inference as approximate inference, because it does not provide the optimal q.
Review on ELBO:
with respect to q over an unrestricted family of probablistic disrtibution, using an exact optimization algorithm. We can derive MAP inference as a form of approximate inference by restricting the family of distribution q may be drawn from.
This means that we can control q entirely via \(\mu\). Dropping terms of L that do not vary with \(\mu\), we are left with the optimization problem:
equivalent to MAP inference problem:
Learning procedure similar to EM algorithm, which alternate between:
- Perform MAP inference to infer \(h^*\)
- Update \(\theta\) to increase \(\log p(h^*, v)\)
MAP inference is commonly used in DL as both a feature extractor and a learning machanism. It is primarily used for sparse coding models.
Review on Sparse Coding:
Sparse coding is a linear factor model that imposes a sparsity-inducing prior on its hidden units. A common choice is a factorial Laplace prior, with
The visible units are then generated by performing a linear transformation and adding noise:
- Challenge: Every pair of variables \(h_i\) and \(h_j\) are both parents of v => if v is observed, the graphical model contains an active path connecting \(h_i\) and \(h_j\) => All the hidden units participate in one massive clique in p(h|v). => Computing or even representing p(h|v) is difficult.
- Solution: Use MAP inference and learn parameters by maximizing ELBO defined by the Dirac distribution around the MAP estimate of h.
If we concatenate all h vectors in the training set into a matrix H and concatenate all the v vectors into matrix V, then sparse coding learning process consists of minimizing:
We can minimize J by alternating between
- minimization with repect to H, convex problem
- minimization with repect to W, convex problem
19.4 Variational Inference and Learning¶
Core idea of variational learning: We can maximize L over a restricted family of distribution q.
Common restriction:
Called mean field approach.
We do not need to specify a sepecific parametric form of q. We specify how it should factorize, then optimization problem determines the optimal probability distribution within those factorization constraints.
- For discrete latent variables: use traditional optimization techniques to optimize a finite number of variables describing the q distribution.
- For continous latent variables: use a branch of mathmatics called caculus of variations to perform optimization over a space of functions and actually determine which function should be used to represent q.
Because \(L(v, \theta, q) = \log p(v) - KL[q(h) || p(h|v)]\), maximize L => minimize KL[q(h) || p(h|v)]. We are fitting q to p.
- When we use maximum likehood learning to fit a model to data, we minimize \(D_{KL}(p_{model}|| p_{data})\), which encourage the model to have high probability everywhere that the data has high probability.
- Optimize based inference procedure encourage q to have low probability everywhere the true posterior has how probability.
Review:
In the inference optimization problem, we choose to use \(D_{KL}(q(h|v) || p(h|v))\) for computational reason.
19.4.1 Discrete Latent Variables¶
We define a distribution q, typically one where each factor of q is just defined by a lookup table over a discrete states.
After determining how to represent q, we simply optimize its parameters. In principle the selection of q could be done with any optimization algorithm, such as gradient descent.
Because this optimization must occur in the inner loop of a learning algorithm, it must be very fast. We typically use special opyimzation algorithms that are designed to solve comparatively small and simple problems in few iterations. A popular choice is to iterate fixed point equitions:
Example of applying variational inference to binary sparse coding model.
Input \(v \in R^n\) is generated from the model by adding Gaussian noise to the sum of m different components.
- b: learnable set of biases
- W: learnable weight matrix
- \(\beta\): learnable, diagonal precision matrix
Now training the model with maximum likehood requires taking the derivatives with respect to the parameters. Consider the derivatives with repect to one of the biases:
See the graph model:
p(h|v) is a complicated distribution.
Solution: variational inference and variational learning.
Mean field approximation:
Use \(\hat{h}\) to represent q: \(q(h_i | v) = \hat{h_i}\)
Tractable lower bound:
L can be expressed in a small number of simple arithmetic operations. The ELBO L is therefore tractable.
We could simply run gradient ascent on both v and h. However, we do not do this for 2 reasons:
- This would require storing \(\hat{h}\) for every v.
- We would like to extract the features \(\hat{h}\) very quickly, in order to recognize the content of v.
Instead, we rapidly estimate them with fixed-point equition.
We can iteratively apply the solution to the equition for i = 1, … m, and repeat the cycle until we satisfy a converge criterion. Common convergence criteria
- Stop when a full cycle of updates does not improve L by more than some tolerance amount
- When the cyle does not change \(\hat{h}\) by more than some amount.
Binary sparse coding e.g
To apply the fixed-point update inference rule, we solve for the \(\hat{h}\) that sets equition 19.43 to 0.
The mean field fixed-point equition defined a recurrent neural network. The task of this network is to perform inference.
In the case of binary sparse coding, the recurrent network specified in equition 19.44 consists of repeatedly updating the hidden units based on the changing value of the neighboring hiddne unites.The input always sends a fixed message of \(v^T\beta W\) to the hidden units, but the hidden units constantly update the message they send to each other. More specifically, two units \(\hat{h}_i\) and \(\hat{h}_j\) inhibit each other when their weight vectors are aligned. Between two hidden units that both explain the input, only the one that explains the input the best will be allowed to remain active. This competition is the mean field approximation’s attempt to capture the explaining away interactions in the binary sparse coding posterior. The explaining away effect actually should cause a multimodel posterior. Unfortunately, explaining away interaction cannot be modeled by the factorial q used for mean field, so the mean field approximation is forced to choose one mode to model
Rewrite the equition of 19.44
We can think of unit i as attempting to encode the residual error in v given the code of the other units. We can thus think of sparse coding as an iterative antoencoder, which repeatedly encodes and decodes its input, attempting to fix mistakes in the reconstruction after each iteration.
19.4.2 Calculus of Variations¶
- A function of a function f is known as functional J[f].
- We can take functional derivative of the functional J with respect to individual values of function f(x) at any any specific value of x.
- The functional derivative of functional J with repect to the value of function f at point x is denoted \(\frac{\delta}{\delta f(x)}J\)
- For differentiable f(x) and differentiable g(y, x) with continuous derivatives:
We can optimize a functional by solving for the function where the functional derivative at every point is equal to zero.
e.g. find the probability distribution function over \(x \in R\) that has maximal differential entropy.
We can not simply maximize H[p] with repect to function p(x). Restriction we want to have
- p(x) integrate to 1
- Entropy increase without bound as the variance increase => problem uninteresting => fixed variance \(\sigma^2\)
- distribution can be shifted without changing entropy => mean of the distribution: \(\mu\)
Review on Lagrange function:
So the Lagrangian functional for this optimization problem is:
The minimize the Lagrangian with repect to p, we set the functional derivatives to 0:
We are free to choose any \(\lambda\) values, becuase the gradient of the Lagrangian with repect to \(\lambda\) is 0 as long as the constraints are satisfied. We may set \(\lambda_1 = 1 - \log \sigma \sqrt{2 \pi}\) and \(\lambda_2 = 0\) and \(\lambda_3 = -\frac{1}{2\sigma^2}\) so that we can have
This is one reason for using the normal distribution when we do not know the true distribution. Because the normal distribution has the maximum entropym we impose the least possible amount of structure by making this assumption.
For the model with continuous latent variable. If we have mean field assumption:
and fix \(q(h_j|v)\) for all \(j \neq i\). Then the optimal \(q(h_i|v)\) maybe obtained by normalizing the unnormalized distribution
As long as p does not assign 0 probability to any joint configuration of variables. It is a fixed point equition, designed to be iteratively applied for each value of i repeatedly until convergence.
The training algorithm tends to adapt the model in a way that makes the approximating assumption underlying the approximate inference algorithm become more true. When training the parameters, variational learning increases
for specific v, this
- increase p(h|v) for values of h that have high probability under q(h|v)
- decrease p(h|v) for values of h that have low probability under q(h|v)
We often estimate \(\log p(v; \theta)\) after training the model and find that the gap with \(L(v, \theta, q)\) is small. * From this, we can conclude that our variational approximation is accurate for the specific value of \(\theta\) that we obtained from the learning process. * We should not conclude that our variational approximation is accurate in general or that the variational approximation did little harm to the learning process.
19.5 Learned Approximate Inference¶
Explicitly performing optimization via
- fixed point equition
- gradient-based optimization
is often very expensive and time consuming.
Solution: we can think of the optimzation process as a function f that maps an input v to an approximate distribution \(q^* = argmax_q L(v, q)\). Once we think of the multistep iterative optimization process as just being a function, we can approximate it with a neural network that implements an approaximate \(\hat{f}(v, \theta)\)
19.5.1 Wake-Sleep¶
- The main difficulties with training a model to infer h from v is that we do not have a supervised training set with which to train the model. Given a v, we do not know the appropriate h.
- Solution: the wake-sleep algorithm resolves this problem by drawing samples of both h and v from the model distribution.
- Drawback of this approach: we will only by able to train the inference network on values of v that have high probability under the model. Early in learning, the model distribution will not resemble the data distribution, so the inference network will not have an opportunity to learn on samples that resemble data.
Review on 18.1:
Target probability distribution:
The gradient
- \(\nabla_{\theta} \log\hat{p}(x;\theta)\) : Positive phase. Models with no latent variables or with few interaction between the latent variables typically have a tractable positive phase.
- \(\nabla_{\theta} \log Z(\theta)\): Negative phase. For undirected model, the negative phase is difficult.
Review on 18.2
Because the negative phase involves drawing samples from the model’s distribution, we can think of it as finding points that the model believes in strongly. Because the negative phase acts to reduce the probability of those points, they are generally considered to represent the model’s incorrect beliefs about the world. They are frequently referred to in the literature as “hallucinations” or “fantasyparticles.” In fact, the negative phase has been proposed as a possible explanation for dreaming in humans and other animals, the idea being that the brain maintains a probabilistic model of the world and follows the gradient of \(\log \hat{p}\) when experiencing real events while awake and follows the negative gradient of \(\log \hat{p}\) to minimize log Z while sleeping and experiencing events sampled from the current model.
Another possible explanation for biological dreaming is that it is providing samples from p(h, v) which can be used to train an inference network to predict h given v.
Paper: Auto-Encoding Variational Bayes¶
How to perform efficient approximate inference and learning with directed probablistic models whose continuous latent variable and/or parameters have intractable posterior.
Variational Bayesian involves the optimization of an approximation to the intractable posterior. The common Mean field approach requires analystical solution of expectation w.r.t the approximate posterior, which are also intractable in general case.
This paper shows how a reparameterization of variational lower bound yields a simple differentiable unbiased estimator of the lower bound.
This Stochastic Gradient Variational Bayesian can be used for effiecient approximate posterior inference in almost any model with continuous latent variable and/or parameters. And it is straightforward to optimize using standard stochastic gradient ascent technique.
For the case of i.i.d dataset and continuous latent variables per datapoint, the paper propose autoencoder VB. It uses SGVB estimator to optimize the recognition model that allow us to perform efficient approximate posterior inference using simple ancestral sampling => make inference and learning especially effcient, which in turn to allow us to efficiently learn the model parameters.
Learned posterior inference model, can also be used for a host of tasks
- recognition
- denoising
- representation
- visualization
When a neural network is used for the recognition model, we reach variational auto-encoder.
2 Method¶
This paper, we restrict the ourselves to the common case where
- We have an i.i.d dataset with latent variable per datapoint,
- We like to perform maximum likehood or maximum a posteriori inference on the global variables and variational inference on the latent variables
We can extend the scenario if we want. But the paper is focused on the common cases.
2.1 Problem scenario¶
review on representation learning in DL textbook at the beginning of chapter 15:
Feedforward network trained by supervised learning: last layer linear classifier or nearest neighbor classifier. The rest of the network learns to provide a representation to this classifier. Training with a supervised criterion naturally leads to the representatino at every hidden layer taking on properties that make the classification easier.
Just like superviser network, unsupervised deep learning algorithms have a main training objective but also learn a representation as a side effect. Most of representation learning problem face the trade off between
- Researve as much info about input as possible
- attain nice properties such as independence
Consider some dataset: \(X = {x^{i}}_{i=1}^N\) consisting of N i.i.d samples of some discrete or continuous variable x.
We assume that the data are generated by some random random process, involving an unobserved continuous random variable z. This data generating processs consists of 2 step:
a value \(z^{(i)}\) is generated from some prior distribution \(p_{\theta^*}(z)\).
a value \(x^{(i)}\) is generated from some conditional distribution \(p_{\theta^*}(x|z)\). We assume that
- the prior \(p_{\theta^*}(z)\) and the likehood \(p_{\theta^*}(x|z)\) come from parametric families of \(p_{\theta}(z)\) and \(p_{\theta}(x|z)\)
- Their probability density functions (PDFs) are differentiable with respect to both \(\theta\) and z
Unfortunately, a lot of this process is hidden from our view: the true parameters \(\theta^*\) as well as the values of the latent variable \(z^{(i)}\) are unknown to us.
Very importantly, we do not make simplifying assumption about
- the marginal \(p_{\theta}(x)\)
- the posterior \(p_{\theta}(z|x)\)
This kind of assumption is very common to solve the challenge. But no! we are not going to use this strategy. So without this kind of simplying assumption, what kind of challenge are we facing:
Intractibility:
- marginal likehood \(p_{\theta}(x) = \int p_{\theta}(z) p_{\theta}(x|z) dz\) is intractible
- true posterior \(p_{\theta}(z|x) = p_{\theta}(x|z) * p_{\theta}(z) / p_{\theta}(x)\) is intractible
- required integrals integrals for any reasonable mean-field VB algorithm are also intractable.
A large dataset: we have so much data that batch optimization is too costly. We would like to make parameter updates using small minibatch or even single datapoints. Sampling based solution, e.g. Monte Carlo, EM would in general be too slow, since it involves a typically expensive sampling loop per datapoint.
Review on intractable inference problem at Chapter 19 of DL textbook:
We are intersted in and propose solution to, 3 related problems in the above scenario:
- Efficient approximate maximum likelihood (ML) or maximum a posteriori (MAP) estimation for the parameter \(\theta\).
- Efficient approximate posterior inference of the latent variable z given an observed value of x for a choice of \(\theta\). This is useful for coding and data representation tasks.
- Efficient approximate marginal inference of the variable x. This allow us to perform all kinds of tasks where a prior over x is required. Common applications in computer vision include image denoising, inpainting, and super resolution.
For the purpose of solving the above problems, we introduce a recognition model \(q_{\phi}(z|x)\): an approximation to the intractable true posterior \(p_{\theta}(z|x)\)
- \(q_{\phi}(z|x)\) probabslistic encoder: given datapoint x, it produces a distribution (e.g. a Gaussian) over the possible values of code z from which the datapoint x could have been generated.
- \(p_{\theta}(x|z)\) probabslistic decoder: given a code z, it produces a distribution over the possible corresponding value of x.
2.2 The variational bound¶
Now we have
Because \(KL[q_{\phi}(z|x) || p_{\theta}(z|x)] >= 0\) we have
we want to differentiate and optimize lower bound w.r.t. both variational parameter \(\phi\) and generative parameters \(\theta\).
To calculate a function w.r.t. \(\phi\) using naive / Monte Carlo method we have:
where \(z^{(l)} \sim q_{\phi}(z)\).
This gradient estimator exhibits very high variance.
2.3 The SGVB estimator and AEVB algorithm¶
Under certain mild condition outlined in 2.4 for a chosen approximate posterior \(q_{\phi}(z|x)\) we can reparameterize the random variable \(\hat{z} \sim q_{\phi}(z|x)\) using a differentiable transformation \(g_{\phi}(\epsilon , x)\) of an auxiliary noise variable \(\epsilon\):
We can now form Monte Carlo estimates of expectation of some function f(z) w.r.t. \(q_{\phi}(z|x)\)
Now we apply this technique to variational lower bound
where \(z^{(i, l)} = g_{\phi}(\epsilon ^{(i, l)} , x^{(i)})\) and \(\epsilon \sim p(\epsilon)\)
Review that :
The KL divergence can be integrated analystically, such that only the expected reconstruction error \(E_{z \sim q_{\phi}(z|x)} log p_{\theta}(x | z)\) requires estimation by sampling.
The KL divergence can then be interpreted as regularizing \(\phi\), encouraging the approximate posterior \(q_{\phi}(z|x)\) to be close to the prior \(p_{\theta}(z)\). This yields a second version of the SGVB estimator
where \(z^{(i, l)} = g_{\phi}(\epsilon ^{(i, l)} , x^{(i)})\) and \(\epsilon \sim p(\epsilon)\)
Given multiple datapoints from a dataset X with N datapoints, we can construct an estimator of the marginal likehood lower bound of the full dataset, based on minibatch:
where the minibatch \(X^M = {x^{(i)}}_{i=1}^M\) is randomly drawn sample of M datapoints from the full dataset X with N dataponits.
The number of samples L per datapoint can be set to 1 as long as the minibatch size M is large enough.
Now we look back at the equation again, this time we connect this equation with autoencoder
where \(z^{(i, l)} = g_{\phi}(\epsilon ^{(i, l)} , x^{(i)})\) and \(\epsilon \sim p(\epsilon)\)
- \(- KL[q_{\phi}(z|x^{(i)}) || p_{\theta}(z)]\) serves as regularizer
- \(\frac{1}{L} \sum_{l=1}^{L} (\log p_{\theta}(x^{(i)} | z^{(i, l)}))\) serves as an expected negative reconstruction error. \(g_{\phi}()\) is chosen such that it maps a datapoint \(x^{(i)}\) and a random noise vector \(\epsilon^{l}\) to a sample from approximate posterior for the datapoint \(z^{(i, l)} = g_{\phi}(x^{(i)}, \epsilon^{i, l})\) is then the input to function \(\log_{\theta}(x^{(i)} | z^{(i, l)})\) which equals the probability density of datapoint \(x^{(i)}\) under the generative model, given \(z^{(i, l)}\). This term is a negative reconstruction error in auto-encoder parlance.
2.4 The reparameterization rick¶
Alternative method for generating samples from \(q_{\phi}(z|x)\):
- Let z be a continuous random variable and \(z \sim q_{\phi}(z|x)\)
- Express the random variable as a deterministic variable \(z = g_{\phi}(\epsilon, x)\)
- \(\epsilon\) is an auxilary variable with independent marginal \(p(\epsilon)\)
- \(g_{\phi}()\) is some vector valued function parameterized by \(\phi\)
It is useful because it can be used to rewrite an expectation w.r.t \(q_{\phi}(z|x)\) such that the Monte Carlo estimate of the expectation is differentiable w.r.t \(\phi\).
How to choose \(g_{\phi}()\) and auxiliary variable:
- Tractable inverse CDF
- Analogous to the Gaussian example
- Composition
3. Examples: Variational Auto-Encoder¶
In VAE, we use a neural network for the probablistic encoder \(q_{\phi}(z|x)\). Below is the game settings of VAE:
\(p_{\theta}(z) = N(z; 0, I)\) as a Multivariate Gaussian
Generator Network / Encoder \(p_{\theta}(x | z)\) whose distribution parameters are computed from z with a MLP
- if x is real-valued data: \(p_{\theta}(x | z)\) as a multivariate Gaussian
- if x is binary data: \(p_{\theta}(x | z)\) as a Bernoulli
Inference Network / Decoder \(q_{\phi}(z|x^{(i)}) = \log N(z; \mu^{(i)}, \sigma^{2(i)}I)\) where the mean and standard deviation: \(\mu^{(i)}\), \(\sigma^{(i)}\) is the outputs of of the encoding MLP
Please note that we can use many other way to approach the approximate of true posterior. The approximate is \(q_{\phi}(z|x)\), it will still be Auto-Encoding Variational Baysian if we use many other methods. Variational Autoencoder believes that \(q_{\phi}(z|x^{(i)}) = \log N(z; \mu^{(i)}, \sigma^{2(i)}I)\). And also it belives that the parameter:math:mu^{(i)}, \(\sigma^{(i)}\) can be reparameterized. But does it have to be the multivariate Gaussian? No, it does not. VAE choose to believe that! The same applies to the generator / decoder in VAE.
Remember that the objective function is:
Because in this model we believe / choose \(p_{\theta}(z)\) and \(q_{\phi}(z|x)\) are Gaussians. The KL divergence can be computed and differentiated without estimation.
Now we breakdown the last expression:
put them together:
Now we can rewrite the estimated lower bound as:
where \(z^{(i, l)} = \mu^{(i)} + \sigma^{(i)}) \odot \epsilon ^{(l)}\) and \(\epsilon ^{(l)} \sim N(0, I)\)
20 Deep Generative Models¶
20.1 Boltzmann Machines¶
Boltzman Machine definition:
- Over d-D binary random vector \(x \in \{0, 1\}^d\)
- Energy based model. \(P(x) = \frac{exp(-E(x))}{Z}\)
- \(E(x) = -x^T Ux - b^Tx\) , where U is the weight matrix of model parameters and b is the vector of bias parameters.
This senario is certainly viable, it does limit the kind of interactions between the observed variables to those described by the weight matrix. Specifically, it means that the probability of one unit being on is given by a linear model (logistic regression) from the values of the other units.
A Boltzmann machine with hidden units is no longer limited to modeling linear relationship between variables. Instead, the Boltzmann machine becomes a universal approximator of probability mass functions over discrete variables.
Formally, we decompose the unites x into 2 subsets:
- Visible units v
- Latent / Hidden units h
Then the energy function become:
Learning algorithms for Boltzmann Machines are usually based on maximum likehood. All Boltzmann machines have an intractable partition function, so the maximum likehood gradient must be approximated.
Interesting property of Boltzmann machine when trained with learning fules based one maximum likehood is that update for particular weight connecting 2 unites depends only on the statistics of those 2 units, collected under different distribution \(p_{model}(v)\) and \(p_{data}(v)p_{model}(h|v)\). The rest of the network participates in shaping those statistics, but the weight can be updated without knowing anything about the rest of the network or how those statistics were produced. This means that the learning rule is local.
20.2 Restricted Boltzmann Machines¶
RBMs are undirected probabilistic graphical models containing a layer of observable variables and a single layer of latent variables. RBMs maybe stacked to form deeper models.
- Observed layer : vector v with \(n_v\) binary random variables.
- Latent layer : vector h with \(n_h\) binary random variables.
Review on 16.7.1
With the graph above, we can see that there is no triangle clique. So each edge forms 1 clique and its factor function of clique \(\{v_i, h_j\}\) would look like this:
That is why the factor function with the graph as a whole looks like:
Intractable partition function Z => P(v) intractable to evaluate.
20.2.1 Conditional Distribution¶
Since we are conditioning on the visible units v, we can treat these as constant with respect to the distribution P(h|v).
Now review on rule 3.25 we have: \(1- \sigma{x} = \sigma{-x}\).
So now we have:
A similar deviation could be
Because RBM admits efficient evaluation and differentiation of \(\hat{P}(v)\) and efficient MCMC sampling in the form of block Gibbs sampling, it can readily be trained with any of the techniques described in Chapter 18 for training models that have intractable partition functions. This include
- CD
- SML
- Ratio Matching
- and so on
Review :
Gibbs Sampling
Draw sample from distribution q(x): updating \(x \leftarrow x' \sim T(x'|x)\)
Problem: how to ensure that q(x) is a useful distribution
Solution:
- Derive T from a goven learned model \(p_{model}\)
- Directly parameterize T and learn it, so that its stationary distribution implicitely define the \(p_{model}\) of interest.
In DL, we commonly use Markov chain to draw samples from an energy-based model defining a distribution \(p_{model}(\vec{x})\). In this case, we want the \(q(\vec{x})\) for the Markov chain to be \(p_{model}(\vec{x})\). To obtain the desired q(x), we must choose an appropriate \(T(x'|x)\)
Gibbs sampling: sampling from \(T(x'|x)\) is accomplished by selecting one variable \(x_i\) and sample it from \(p_{model}\) conditioned on it neighbor in the undirected graph G defining the structure of the energy based model. We can also sample several variables at the same time as long as they are conditionally independent given all their neighbors.
Contrastive Divergence (CD)
Stachastic Maximum Likehood (SML) or Persistent Contrastive Divergence (PCD)
20.3 Deep Believe Network¶
Deep believe networks:
- Generative models
- Several layers of latent variables, typically binary.
- Visible units: maybe binary or real
- No intralayer connection
- The connection between the top 2 layers are undirected
- Connections between all the other layers are directed, with arrows pointed toward the layer that is close to the data.
A DBM with l hidden layers contains l weight metrices: \(W^{(1)}, W^{(2)} .... W^{(l)}\) and also l + 1 bias vector \(b^{(0)}, b^{(1)}, ...., b^{(l)}\). Probability represented by DBN:
Draw sample from a DBN
- run several steps of Gibbs sampling on the top 2 hidden layer. Essentially drawing a sample from the RBM defined by the top 2 hidden layers
- use a single pass of ancestral sampling through the rest of the model to draw a sample from the visible units.
Inference in Deep Belief Network is intractable because of explaining away effect within each directed layer and the interaction between the 2 hidden layers that have undirected connections.
To train a deep belief network,
Training an RBM to maximize \(E_{v\sim p_{data}}\log p(v)\) using contrastive divergence or stachastic maximum likehood.
The parameters of the RBM the define the parameters of the first layer of the DBN.
A second RBM is trained to maximize: \(E_{v \sim p_{data}} E_{h^{(1)} \sim p^{(1)} (h^{(1)}|v)} \log p^{(2)}(h^{(1)})\).
- \(p^{(1)}\) is the probability distribution represented by the first RBM. It is driven by data.
- \(p^{(2)}\) is the second RBM trained to model the distribution defined by sampling the hidden units of the first RBM
repeat 3, with each new RBM modeling the samples of the previous one. This procedure can be justified as increasing a variational lower bound on the log-likehood of the data under the DBN.
In most applications, no effort is made to jointly train the DBN after the greedy layer-wise procedure is complete. However, it is possible to perform generative fine-tuning using the wake-sleep algorithm.
We can take the weights from the DBN and use them to define MLP.
- ..math::
- h^{(1)} = sigma(b^{(1)} + v^TW^{(1)}) \ h^{(l)} = sigma(b_i^{(l)} + h^{(l-1)}W^{(l)}) forall l in 2, …, m
20.4 Deep Boltzmann Machines¶
Deep Boltzmann Machines
- Entirely undirected model
- Several layer of latent variables
- Within each layer, each of the variables are mutually independent, conditioned on the variable in the neighboring layers.
- Energy based model
Review on energy based model:
Many interesting theoretical results about undirected models depend on the assumption that \(\forall x, \hat{p}(x) > 0\). A convinient way to enforce this condition is to use energy based model (EBM) where
and E(x), aka, energy function. By learning the energy function, we can use unconstrained optimization. Any distriburuib of the form given by the equation above is an example to Boltzmann distribution.
e.g. DBM with 1 layer of visible variables and 3 hidden layers:
Advantages offered by DBM, similar to RBM: can be organized into a bipartite graph. With odd layer on one side and even layer on the other. When we condition on the variables in the even layer, the variables in the odd layers become conditionally independent, vice versa.
E.g of 2 hidden layers:
The bipartite structure makes Gibbs sampling in the deep Boltzmann Machine efficient. It is possible to update all units in only 2 iterations. Gibbs sampling can be divided into 2 blocks of updates, one including even layers including v and the other including all odd layers. Because of the bipartite DBM connection pattern, gievn the even layers, the distribution over the odd layers is factorial and thus can be sampled simultaneously and independently as a block, vice versa. Efficient sampling is especially important for training with the stochastic maximum likehood algorithm.
20.4.1 Intersting Properties¶
Compared to DBN, the posterior distribution P(h|v) is simpler for DBM. In DBMs. all the hidden units within a layer are conditionally independent given the other layers. This lack of intralayer interaction makes it possible to use fixed-point equitions to optimize the variational lower bound and find the true optimal mean field expectation.
Unfortunate property of DBMs: sampling from them is relatively difficult.
- DBNs: MCMC sampling in their top pair of layers. The other layers are used at the end of the sampling process, in one efficient ancestral sampling pass.
- DBMs: necessary to use MCMC accorss all layers, with every layer of the model participating in every Markov Chain transition.
20.4.3 DBM Mean Field Inference¶
- Conditional distribution over one DBM layer given the neighboring layers is factorial.
- The distribution over all hidden layers generally does not factorize because of interaction between layers.
The mean field approximation is a simple form of variational distribution, where we restrict the approximating distribution to fully factorial distribution.
Review Variational Inference and Learning:
Core idea of variational learning: We can maximize L over a restricted family of distribution q.
Common restriction:
Called mean field approach.
In the case of approximating posterior distribution over hidden units given the visible units, we restrict the approximating family to the set of distribution where the hidden units are conditionally independent. e.g: DBM with 2 hidden layers: Let \(Q(h^{(1)}, h^{(2)} | v)\) be the approximation of \(P(h^{(1)}, h^{(2)} | v)\)
Importantly, the inference process must be run again to find a different distribution Q every time we use a new value of v.
Gold of mean field approach, minimize:
We associate the probability of eah element of \(h^{(1)}\) with a parameter.
- For each j, \(\hat{h}^{(1)}_j = Q(h^{(1)}_j = 1 | v)\) where \(\hat{h}^{(1)}_j \in [0, 1]\)
- For each k, \(\hat{h}^{(2)}_k = Q(h^{(2)}_k = 1 | v)\) where \(\hat{h}^{(2)}_k \in [0, 1]\)
Now we have
For DBMs with more layers, the approximate posterior parameterization can be extended in the obvious way, updating all the even layers simultaneously and then update all the odd layers simulteneously, following the same schedule as Gibbs sampling.
Now that we have specified our family of approximating distribution Q, it remains to specify a procedure for choosing the member of this family that best fits p. The most straight-forward way is shown below.
Review on 19.4.3:
For the model with continuous latent variable. If we have mean field assumption:
and fix \(q(h_j|v)\) for all \(j \neq i\). Then the optimal \(q(h_i|v)\) maybe obtained by normalizing the unnormalized distribution
Applying these genarl equitions, we obtain the update rules
These fixed_point update equition define an iterative algorithm where we alternate updates of \(\hat{h}^{(1)}_j\) and \(\hat{h}_k^{(2)}\)
For deep Boltzmann Machine with 2 hidden layer, ELBO is given by
The hardness results for computing the partition function that apply to restricted Boltzmann machines also apply to deep Boltzmann machines.
Evauating the probability mass function of Boltzmann machine requires approximate methods such as annealed importance sampling.
Training the model requires approximation to th egradient of the log partition function. DBMs are typically trained using stochastic maximum likehood..
- pseudolikehood require the ability to evaluate the unnormalized probabilities, rather than only obtain a variational lower bound on them
- Contrastive Divergence is slow for DBMs because they do not allow efficient sampling of hidden units given the visible units
Review on Annealed Importance Sampling: 18.7
20.4.4 Layer-Wise Pretraining¶
Train a DBM using stochastic maximum likehood from a random initiallization usually results in failure.
- In some cases, the model fails to learn to represent the distribution adequately.
- In other cases, the DBM may represent the distribution well, but with no higher likehood than could be obtained with just an RBM.
Greedy layer wise pretraining: each layer of the DBM is trained in isolation as an RBM. The first layer is train to model the imput data. Each subsequent RBM is trained to model samples from the previous RBM’s posterior distribution. After all the RBMs have been trained in this way, they can be combined to form a DBM. The DBM may then be trained with OCD. Typically the PCD will make only a small change in the model’s parameters and in its performance as measured by the log-likehood it assignes to the data, or its ability to classify inputs.
In DBM, the RBM parameters must be modified before inclusion in the DBM. A layer in the middle of the stack of RBM is trained with only bottom-up input, but after the stack is combined to form the DBM, the layer will have both bottom-up and top-down input. To acount for this effect, divide the weights of all but the top and bottom RBM in half before inserting them into the DBM.
20.4.5 Jointly training Deep Boltzmann Machines¶
Cause: Classic DBMs require greedy unsupervised pretraining and, to perform classification well, require a seperate MLP-based classifier on top of the hidden features they extract.
Challenge: Har to tell how our hyperparameters are working until quite late in the training process
2 Main Solutions
- Centered deep Boltzmann Machine, which reparameterizes the model in order to make the Hessian of the cost function better conditioned at the beginning the learning process. Can be trained without a greedy layer-wise pretraining stage. Excellent test set log-likehood and produce high quality samples. Unfortunately, remain unable to compete with appropriately regularize MLPs as a classifier.
- Multi-prediction deep Boltzmann machine. Uses an alternative training criterion that allows the use of the back-propagation algorithm to avoid the problem with MCMC estimates of the gradient. New criterion does not lead to good likehood or samples. Compared to the MCMC approach, it does lead to superior classification performance and ability to reason well about the missing inputs.
Reviwe on Hessian Matrices: textbook p84
Energy function function of Boltzmann machines:
Centereed Boltzmann machine introduces a vector \(\mu\) that is substracted from all the states:
Typically \(\mu\) is a hyperparameter fixed at the beginning of the training. It is usually chosen to make sure that \(x - \mu \approx 0\) when the model is initialized. Improved conditioning of the Hessian matrix enables learning to succeed, even in difficult cases like training a deep Boltzmann machine with multiple layers.
Multi-prediction DB views the mean field equition as defining a familt of recurrent networks for approximately solving possible inference problem. Rather than training the model to maximize the likehood, the model is trained to make each recurrent network obtain an accurate answer to the corresponding inference problem. Process:
Final loss is typically based on the approximate conditional distribution that the approximate inference network imposes over the missing values.
Backpropagation through inference graph has 2 main advantages
It trains the model as it is really used, with approximate inference.
- The orginal DBM does not make accurate classifier on its own, the best classification results with the original DBM were based on training a seperate classifier to use features extracted by the DBM
- MP-DBM uses inference in the DBM to compute the distribution oever the class labels. Mean field inference in the MP-DBM performs well as a classifier without special modification.
Back-propagation computes the exact gradient of the loss. Better for optimzation than the approximate gradient of SML training, which suffer from both bias and variance.
Disadvantage of back-propagation through the approximate inference graph is that it does not provide a way to optimize the log-likehood, but rather gives a heuristic approximation of the generalized pseudolikehood.
Compare to dropout
- Dropout: share parameters among mant different computational graphs, with different between each graph being whether it includs or excludes each unit
- MP-DBM also shares parameters accross many computational graphs. The difference between the graph is wethher each input units is observed or not. When a unit is not observed, then MP-DBM does not delete in entirely as dropout does. Instead, the MP-DBM treats it as a latent variable to be inferred.
20.5 Boltzmann Machines for Real-Valued Data¶
20.5.1 Gaussian-Bernoulli RBMs¶
- Binary hidden units
- Real-valued visible units
- Conditional distribution over visible units is a Gaussian distribution whose mean us a function of the hidden units.
There are many ways of parameterize Gaussian-Bernoulli RBMs.
- covariance matrix
- precision matrix (below)
We wish to have the conditional distribution
f encapsulates all the terms that are a function only of the parameters and not the random variables in the model. We can discard f because its only role is to normalize the distribution and the partition function of whatever energy function we choose will carry out the role.
One way to define the energy function:
20.5.2 Undirected Models of Conditional Covariance¶
Challenge: Much of the information content present in natural images is embedded in the covariance between pixels rather than in the raw pixel avlues. Since Gaussian only model the conditional mean of the input given the hidden uints, it cannot capture conditional mean of the input given the hidden units, it cannot capture conditional covariance information.
Solutions:
- mean and covariance RBM (mcRBM)
- the mean prodoct of Student t-distribution (mPoT) model
- spike and slab RBM(ssRBM)
Mean and Covariance RBM: 2 groups of hidden units
- mean units. Binary. With v: Gaussian RBM
- covariance units. Binary. With v: covariance RBM
- \(r^{(j)}\) corresponds to the covariance weight vector associated with \(h_j^{(c)}\)
- \(b^{(c)}\) is a vector of covariance offset
It is difficult to train mcRBM via contrastive divergence or persistent contrastive divergence
Mean product of Student t-distributions
Spike and Slab RBMs: 2 sets of hidden units
- Binary spike units h
- Real-Valued slan units s
The mean of visible units conditioned on the hidden units is given by \((h \odot s)W^T\).
- Each column \(W_{:, j}\) defines a component that can appear in the input when \(h_i = 1\).
- The corresponding spike variable \(h_i = 1\) determines whether that component is present at all
- The corresponding slab variable \(s_i = 1\) determines the intensity of the that component if it is present.
When the spike variable is active, the corresponding slab variable add variance to the input along the axis defined by \(W_{:, j}\). Contrastive divergence and persistent contrastive divergence with Gibbs sampling are still applicable.
mcRBM and mPoT:
- use the activation of hidden units \(h_j>0\) to enforce constraints on the conditional covariance in the direction \(r^{(j)}\)
- An overcomplete representation would mean that to capture variation in a particular direction in the observation space would require removing potentially all constraints with positive projection in that direction. This would suggest that these models are less well suited to the overcomplete setting.
ssRBM:
- specifies the conditional covariance of the observations using the hidden spike activations \(h_i=0\) to pinch the precision matrix along the direction specified by the corresponding weight vector.
- In a overcomplete setting, sparse activation with the ssRBM parameterization permit significant variance only in the selected directions of the sparsely activated \(h_i\)
disadvatage of ssRBM: some settings of the parameters can correpond to a covariance matrix that is not positive definite. Such a covariance matrix places more unormalized probability on values that are father from the mean, causing the integral over all possible outcomes to diverge.
20.6 Convolutional Boltzmann Machines¶
Review on Conv Net Structure:
To simulate pooling layer in feedforward convolutional networks, we could introduce a binary pooling unite p over n binary detector units d and enforce \(max_i d_i\) by setting the energy function to be \(\infty\) whenever that constraint is violated.
Challenge: This does not scale well, as it requires \(2^n\) different energy configurations to compute the normalization constant.
Solution: probabilistic max pooling, which constrain the detector units so at most one may be active at a time. This means there are only n + 1 total states. The state with all units off is assigned energy 0.
Drawback:
- probablistic max pooling does force the detector units to be mutually exclusive, which may be a useful regularizing constraint in some contexts or a harmful limit on model capacity in other contexts.
- It also does not support overlapping pooling regions.
- Usually does not perform as well as a classifier as traditional convolutional networks trained with supervised learning.
- For Boltzmann machine, it is difficult to thange the input size for various reasons. Scaling the Boltzmann machine pooling region is awkward. For BMs, large pooling region become to expensive for the naive approach.
- Pixels at the boundary of the image also pose some difficulty.
20.7 Boltzmann Machines for Structured or Sequential Outputs¶
In the structured output scenario, we wish to train a model that can map from some input x to some output y, and the different entries of y are related to each other and must obey some constraints.
In the sequence modeling scenario, the model must estimate a probabilistic distribution over a sequence of variables \(p(x^{(1)}... x^{(n)})\). Condtional Boltzmann machiens can represent factors of form \(p(x^{(t)}| x^{(1)}... x^{(t-1)})\) in order to accomplish this task.
RNN-RBM sequence model: generative model of a sequence of frames \(x^{(t)}\) consisting of an RNN that emits the RBM parameters for each time step. To train the model, we need to be able to back-propagate the gradient of the loss function through the RNN. The loss function is not applied directly to th RNN outputs. Instead, it is to the RBM. This means that we must approximately differentiate the loss with respect to the RBM parameters using contrastive divergence or a related algorithm. This approximate gradient may then be back-propagate through the RNN using the usual back-propagation through time algorithm.
20.8 Other Boltzmann Machines¶
Boltzmann machine may be extended with different training criteria.
- Maximize the generative criterion \(\log p(v)\)
- Maximize the discriminative criterion \(\log p(y|v)\)
It is possible to train higher order (>2) BM, whose energy function terms involve the products between many variables.
- 3-way interaction between a hidden unit and 2 different images can model spatial transformation from one frame of video to the next
- Multiplication by one-hot class variable can change the relationship between visible and hidden units depending on which class is present.
- BM with 2 groups of hidden units. One group interact with both v and class label y. The other that interacts only with the v input values
- Gate some features. Binary mask with variables associated with each visible unit.
20.9 Back-Propagation through Random Operations¶
When developing genertative models, we often wish to extend neural networks to implement stochastic transformations of x. Strategy
- Extra input z that are sampled from some simple probability, e.g. uniform or Guassian
- The neural network can then continue to perform deterministic computation internally
- The function f(x, z) will appear stochastic to an observer who does not have access to z.
We can build elements of the graph on top of the output of sampling distribution.
Reparameterization trick, stochastic back-propagation: \(p(x|w)\), with w being parameters \(\theta\) parameters, and if applicable, input x. We can rewrite \(p(x|w)\) into :math`:y=f(z;w)` where z is a source of randomness. We may then compute the derivative of y w.r.t. w using traditional tool such as back propagation algorithm applied to f, as long as f is continuous and differetiable almost everywhere. Cruciallt w must not be a function of z and z must not be a function of w.
20.9.1 Back-Propagation through Discrete Stochastic Operations¶
Reinforce algoriths: Reward Increment = Negative Factor * Offset Reinforment * Characteristic Eligibility.
Core Ider of Reinforcement algorithm: Even though J(f(z; w)) is a step function with useless derivative, the expected cost \(E_{z \sim p(z)J(f(z;w))}\) is often a smooth function amenable to gradient descent. Although the expectation is typically not tractable when y is high-D, it can be estimated without bias using a Monte Carlo average.
Assuption for above equation: J does not reference w directly.
- Issue with the simple REINFORCE estimator: it has very high variance
- Solution: variance reduction models. Idea: modify the estimator so that its expected value remain unchanged but its variance gets reduced. In the context of REINFORCE, involve a computation of a baseline that is used to offset J(y). Any offset b(w) that does not depend on y would not change the expectation of the estimated gradient because:
We can obtain the optimal b(w) by computing the variance of \((J(y) - b(w))\frac{\partial \log p(y)}{\partial w}\)
20.10 Directed Generative Nets¶
Review Deep Believe Network
Deep believe networks:
- Generative models
- Several layers of latent variables, typically binary.
- Visible units: maybe binary or real
- No intralayer connection
- The connection between the top 2 layers are undirected
20.10.1 Sigmoid Belief Network¶
In general, we can think of a sigmoid belief network as having a vector of binary state s, with each element if state influenced by its ancestors:
Most common structure of sigmoid belief network is the one divided into many layers, with ancestral sampling procceeding through a seriers of many hidden layers and then ultimately generating the visible layer. Note it is not deep belief network. Such structure is interesting because the structure is universal approximator of probability distributions over the visible units.
- Generating a sample: very efficient
- Inference: Intractable. Mean field inference also intractable.
Solution:
- construct a different lower bound that is specialized for sigmoid belief networks
- Another approach is to use learned inference machanisms as described in 19.5
Special case of sigmoid belief network: no latent variables. See auto-regressive network which generalize this fully visble belief network to other kinds of variables besides binary variables and other structures of conditional distribution besides log-linear relationships.
20.10.2 Differentiable Generator Networks¶
Differentiable Generator Networks: transform samples of latent variables z to sample x or to distribution over x using a differentiable function \(g(z; \theta^{(g)})\) which is typically represented by a neural network. This model class includes
- Variational autoencoders: pair generator with an inference net
- Generative Adverserial networks: pair gererator network with a discriminator network
Sampling x or to distribution over x defines a distribution \(p_g(x)\) and allow us to train various criteria of \(p_g(x)\) using the reparameterization tricks
Review on 20.9 Reparameterization trick:
When developing genertative models, we often wish to extend neural networks to implement stochastic transformations of x. Strategy
- Extra input z that are sampled from some simple probability, e.g. uniform or Guassian
- The neural network can then continue to perform deterministic computation internally
- The function f(x, z) will appear stochastic to an observer who does not have access to z.
We can build elements of the graph on top of the output of sampling distribution.
Reparameterization trick, stochastic back-propagation: \(p(x|w)\), with w being parameters \(\theta\) parameters, and if applicable, input x. We can rewrite \(p(x|w)\) into :math`:y=f(z;w)` where z is a source of randomness. We may then compute the derivative of y w.r.t. w using traditional tool such as back propagation algorithm applied to f, as long as f is continuous and differetiable almost everywhere. Cruciallt w must not be a function of z and z must not be a function of w.
- When generator defines a conditional distribution over x, it is capable of generating dsicrete data as well as continuous daya.
- when generator net provides samples directly, it is capable of generating only continuous data. Advantage: no longer forced to use conditional distributions whose form can be easily written down and algebraically manipulated by human designer.
Differentiable generator networks have sufficient model capacity to be good generative model and optimization algorithms have the ability to fit them. The difficulty lies in determining how to train generator networks when the value of z for each x is not fixed and known ahead of each time.
20.10.3 Variantional Autoencoders¶
The variational antoencoder or VAE is directed model that uses learned approximate inference and can be trained purely with gradient-based algorithm. Process of generating samples from the model:
- Draw a sample z from a distribution \(p_{model}(z)\)
- The sample z run through a differentiable generator network g(z)
- x is sampled from distribution \(p_{model}(x;g(z)) = p_{model}(x|z)\)
During training, the approximate inference network (or encoder) q(z | x) is used to obtain z, and \(p_{model}(x | z)\) is then viewed as a decoder network.
Review on variational lower bound
Introduce q as an arbitrary distribution over h
Now we have
Because KL divergence is always nonnegative we have lower bound
For appropriate of q, L is tractable to compute. For q(h) (textbook: q(h|v)) that are better approaximation of p(h|v), the lower bound would be tighter, meaning closer to log p(v). When q(h) = p(v|h), the approaximation is perfect, \(L(v, \theta, q) = log P(x; \theta)\).
In the context of VAE
The main idea behind variational autoencoder is to traina parametric encoder that produce the parameters of q. As long as z is a continuous variable, we can then back-prop through samples of z drawn from \(q(z|x) = q(z;f(x;\theta))\) to obtain a gradient with repect to \(\theta\). Learning then consists solely of maximizing L with respect to the parameters of the encoder and decoder. All the expectation in L may be approximated by Monte Carlo Sampling.
Drawback:
- Samples from variational autoencoders trained on images tend to be somewhat blurry.
- Tend to use only a small subset of the dimensions of z, as if the encoder were not abe to transform enough of the local distribution in input space where the marginal distribution matches the factorized prior
VAE framework is straightforward to extend to a wide range of model architectures. This is a key advantage over Boltzman Machines
- Deep recurrent attention writer (DRAW) uses a recurrent encoder and recurrent decoder combined with an attention machanism.
- Generate squences by defining variational RNN by usin a recurrent encoder and decoder within the VAE framework
- Importance-weighted antoencoder
Comparison:
VAE
- The variantional autoencoder is defined for arbitrary computational graphs, which makes it applicable to a wider range of probabistic model families because there is no need to restrict the choice of models to those with tractable mean field fixed-point equitions.
- Has the advantage of increasing a bound on the log-likehood of the model
- learns an inference for only one problem, inferring z given x
MP-DBM and other approaches that involve back-prop through the approximate inference
- require an inference procedure such as mean field fixed-point equitions to provide the computational graph.
- more heuristic and have little probablistic interpretation beyond making the results of approximate inference accurate.
- Are able to perform approximate inference over any subset of variables given any other subset of variables, beacuse mean field fixed-point equition specify how to share parameters between the computational graphs for all the different problems.
Nice property of VAE: Simultaneously training a parametric encoder in combination with the generator network forces the model to learn a predictable coordinate system that the encoder can capture.
Example:
- KL Divergence
Consider 2 multivariant normal distributions
Now we have
Focus on small part:
For another for
So,
Now we look back to variational auto encoder

The goal of VAE: to find a distribution \(q_{\phi}(z| x)\) of some latent variable z which we can sample from it. The generate new samples from \(x \sim q_{\theta}(x'|z)\)
Challenge of traditional auto-encoder: when we feed a trained encoder some unknown vector (not in training dataset) as x, decoder cannot decode the z into something meaningful. Decoder does not expect z to be from a distribution from out side of training set. VAE tries to solve the problem by making sure the encoding from some known probability distribution can be decoded to a reasonal output when if they are not the encoding of actual training set. So instead forcing the encoder to produce a single encoding, it forces the encoder to produce a probability distribution function over the encodings.
- To sample x, we need to know p(x).
- \(p(x) = \int_z p(x | z)p(z)dz\).
- The integral is not in closed form, aka, intractable due to multiple integrals of involved for latent variable z.
- The alternative is to approximate p(z|x) by another distribution q(x|z) which is defined in such a way that it has tractable solution.
- This is done by using variational inference.
- The main idea of variational inference is to pose the inference problem as an optimization problem, by modeling p(z|x) using q(z|x) when q(z|x) is a simple distribution such as Gaussian.
- How to measure the distance between p(z|x) and q(z|x)? In the context of variational inference, KL divergence. If we minimize this distance, we are approximating q(z|x) to p(z|x).
Now we have
And this is VAE objective function
- 1st term \(E_{z \sim Q_{\phi} (z|x)}(\log p_{\theta}(x|z))\) represent reconstruction likehood. Without this term, we are not penalizing the error of reconstruction. The network cannot learn the generation process.
- 2nd term \(KL[Q_{\phi} (z|x) || p_{\theta} (z)]\): ensure that our learned distribution Q is similar to the prior distribution \(p_{\theta} (z)\), KL divergence does not allow probability distribution fuction of latent variable \(Q_{\phi} (z|x)\) collapse with 0 variance by penalize if it deviate from \(p_{\theta} (z)\) as a prior. If we do not have this term but only the \(E_{z \sim Q_{\phi} (z|x)}(\log p_{\theta}(x|z))\) (1st term). The network \(Q_{\phi} (z|x)\) will cheat by learning a narrow dirtibution with small variance. It will effectively represent a single value from \(p_{\theta} (z)\). Now we reduce Variational autoencoder to become a normal autoencoder because \(Q_{\phi} (z|x)\) become a single value instead of distribution.
On the left side of the equation, we have
- 1st term: \(p_{\theta}(x)\): reconstruction, generator
- 2nd term: \(KL[Q_{\phi} (z|x)\): regularizer. It is not chose by human designer. Instead, it is obtained by deriving loss function automatically / naturally.
The loss function = - objective function. So we have
Now let’s talk more about the second term: \(KL[Q_{\phi} (z|x) || p_{\theta} (z)]\). We can assuming
- \(KL[Q_{\phi} (z|x)\) as a Gaussian \(N(\mu_{\phi}(z), \Sigma_{\phi}(z))\)
- \(p_{\theta} (z)\) as a Gaussian \(N(0, 1)\)
Review that we have this derived above.
So we have
And ladies and gentle man, this is the final loss function:
During training, we need to alternate between 2 optimzation principles
- Minimize loss respect to \(\theta\), while keep \(\phi\) fixed.
- Minimize loss respect to \(\phi\), while keep \(\theta\) fixed
The second equation is using Monte Carlo Estimate. \(z^{(l)}\) is sampled from \(Q_{\phi} (z|x)\).
The second tern is easy. The first term is very hard to estimate because \(\phi\) appears in the distribution to which the expectation is taken. The solution of the problem is to design a linear transformation \(g_{\phi}\) so that \(z = g_{\phi}(\epsilon, x)\) and \(\epsilon \sim N(0, 1)\). Now we have:
Now we can back-propagate with respect to \(\phi\). Now we can change the first term as
Again, we can use Monte Carlo Estimation.
20.10.4 Generative Adverserial Network¶
- Generator: directly produces samples \(x = g(z;\theta^{g})\)
- Discriminator: attempts to distinguish between samples from the trianing data and samples drawn from the generator. The discriminator emits a probability value given by \(d(x;\theta^{d})\), indicating the probability that x is a real training example rather than a fake samples drawn from the model.
Simplest way to formulate learning in generative adverserial network is as a zero sum game, in which a function \(v(\theta^{(g)}, \theta^{(d)})\) determines the payoff of the discriminator. The generator receives \(-v(\theta^{(g)}, \theta^{(d)})\) as its own pay off. During learning, each player attempts to maximize its own payoff, so that at convergence
The default choise for v is
The main motivation for the design of GAN is that learning process requires neither approximate inference nor approximation of partition function gredient.
Learning in GANs can be difficult in practice when g and d are represented by neural networks and \(max_d v(g, d) is not convex\). Note the equilibria for minimax game are not local minima of v. Instead, they are points that are simulteneously minima for both player’s cost. This means that they are saddle points of v that are local minima with respect to the first player’s parameters and local maxima with respect to the second player’s parameters. It is possible for two players to take turns increasing then decreasing v forever, rather than landing exactly on the saddle point, where neither player is capable of reducing its cost.
Stablization of GAN learning remains an open problem. Fortunately, GAN learning perform well when the model architecture and hyperparameters are carefully selected.
GAN’s training procedure can fit probability distribution that assign 0 probability to the training points. Rather than maximizing the log-probability of specific points, the generator net learns to trace out a mamifold whose points resemble training points in some way.
Units in discriminator should be stochastically dropped while computing the gradient for the generator network to follow.
20.10.5 Generative Moment Mathcing Networks¶
Generative moment matching networks are trained with a technique called moment matching. Moment matching: train the generator generator in such a way that many of the statistics of samples generated by the model are as similar as possible to those of the statistics of the examples in the training set. In this case, a momen is an expectation of differnt oiwers of a random variable.
Generative moment matching networks can be trained by maximizing a cost function called maximum mean discrepancy, or MMD. This cost function measures the error in the first moments in an infinite-dimentional space, using an implicity mapping to feature space defined by a kernel function to make computations on infinite-dimentional vectors tractable. The MMD cost is zero if and only if the 2 distributions being compared are equal.
When the batch size is too small, MMD can underestimate the amount of variation in the distributions being sampled.
As with GANs, it is possible to train a generator net using MMD even if that generator net assigns 0 probability to the training points.
20.10.6 Convolutional Generative Networks¶
Conv net for recognition: information flow from image to some summarization layer at the top of the network, become more invariant to nuisance transformation. In generator network, the oppsite is true. Rich detail must be added as the representation of the image to be generated propagates through the network, culminating in the final representation of the image, which is of courtse the image itself, with object positions and poses and textures and lighting.
The primary mechanism for discarding information in a convolutional network is the pooling layer. We cannot put the inverse of the poolin layer into generator network because most pooling functions are not invertible. An approach that seems to perform acceptably is to use an “un-pooling”. This layer correponds to the inverse of the max-pooling operation under certain simplifying conditions.
- Stride of the max-pooling operation is constrained to be = width of the pooling region
- Maximum input within each pooling region is assumed to be the input in the uppper left corner
- All nonmaximal inputs within each pooling region are assumed to be 0.
Even though the assumption motivating the definition of the un-pooling operator are unrealistic, the subsequient layers are able to learn to compensate for its unusual output, so the samples generated by the model as a whole are visually pleasing.
20.10.7 Auto-regressive Networks¶
Auto-regressive networks are directed probabilistic models with no latent random variables. They decompose a joint probability over the observed variables using the chain rule of prpbability to obtain a product of conditionals of the form \(P(x_d|x_{d-1}, ... x_1)\). Such models have been called fully-visible Bayes network and used successfully in many forms.
20.10.8 Linear Auto-Regressive networks¶
The simplest form of auto-regressive network has no hidden units and no sharing of parameters or features. Each \(P(x_i|x_{i-1}, ... x_1)\) is parameterized as a linear model
- Linear regression: real valued data
- Logistic regression: binary data
- Softmax regression: discrete data
This model has \(O(d^2)\) parameters when there are d variables to model.
Linear auto-regression networks are essentially the generalization of linear classification models to generative modeling. They therefore have the same advantages and disadvantages as linear classification.
- They must be trained with convect loss functions and sometimes admit closed form solution
- The model iteself does not offer a way of increasing its capacity, so capacity must be raised using techniques like basis expansions of the input or the kernel trick.
20.10.9 Neural auto-regressive network¶
Neural auto-regressive networks have the same left-to-right graphical model as logistic auto-regressive networks (figure 20.8 above). The new parametrization is more powerful in the sense that its capacity can be increased as much as needed, allowing approximation of any joint distribution. The new parametrization can also improve generalization by introducing a parameter sharing and feature sharing principle common to deep learning in general. By using neural networ, 2 advantages are obtained:
- The parametrization of each \(P(x_i | x_{i-1}, ..., x_1)\) by neural network with (i - 1) * k input and k output (if the vriables are discrete and take k values, encoded one-hot) enables one to estimate the conditional probability without requiring an exponential number of parameters (and examples), yet still is able to capture high order dependencies between the random variables.
- Instead of having a different neural network for the prediction of each \(x_i\), a left to right connectivity allows one to merge all the neural networks into one. Equivalently, it means that the hidden layer features computed fpr predicting \(x_i\) can be resued for predicting \(x_{i+k}\)
You can extend the model to continuous variables or joint distributions involving both discrete and continuous variables.
20.10.10 NADE¶
The neural auto-regressive density estimator (NADE): the connectivity is the same as for the original neural auto-regressive network, but NADE introduces an additional parameter sharing schema. The parameters of the hidden units of different groups j are shared.
Forward propagation in a NADE model would loosely resemble the computations performed in mean field inference to fill in missing inputs in an RBM. This mean field inference corresponds to running a recurrent network with shared weights and the first step pf that inference is the same sa in NADE. The only differenceis that with NADE, the output weights connecting the hidden units to th output are parametrized independenly from the weights connecting the input unist to the hidden units. In the RBM, the hidden-to-output weights are the transpose of the input-to-hidden weights.
Another extension of the neural auto-regressive architectures gets rid of the need to choose an arbitrary order for the observed variables. Since many oerders of variables are possible and each order o of variables yield a different p(x | o), we can form an ensemble of model for many valueso of o:
This ensemble model usually generalize better and assign hight probability to the test set than does an individual model defined by a single ordering.
20.11 Drawing Samples from Autoencoders¶
- Some kinds of autoencoders, such as variational autoencoders, explicitly represent a probability distribution and admit straightforward ancestral sampling.
- Most other kinds of autoencoders require MCMC sampling.
20.11.1 Marlove Chain Associated with Any Denoising Autoencoder¶
Generalized denoising autoencoders aare specified by a denoising distribution for sampling an estimate of the clean input given the corrupted input. Each step of the Markov chain that generates from the estimated distribution consists of the following step:
- Starting from the previous state x, inject corruption noise, sampling \(\hat{x}\) from \(C(\hat{x} | x)\)
- Encode \(\hat{x}\) into \(h=f(\hat{x})\)
- Decode h to obtain the parameters \(w=g(h)\) of \(p(x|w = g(h)) = p(x|\hat{x})\)
- Sample the next state x from \(p(x|w = g(h)) = p(x|\hat{x})\)
20.11.2 Clamping the Conditional Sampling¶
Similar to Boltzmann Machine, denoising autoencoders and their generalizations can be used to sample from a conditional ditribution \(p(x_f|x_o)\), simply by clamping the observed units \(x_o\) and resampling the free units \(x_f\) given \(x_o\) and the sampled latent variables (if any).
20.11.3 Walk-Back Training Procedure¶
The walk-back training procedure was proposed as a way to accelerate the convergence of generative training of denoising autoencoders. It consists of multiple stochastic encoder-decoder step, initialized at a training exmple and penalizing the last probablistic reconstruction (or all the reconstructions along the way. )
20.12 Generative Stochastic Networks¶
Generative stochastic networks (GSN) are generalizations of denoising entoencoders that include latent variables h in the generative Markov Chain, in addition to the visible variables
GSN is parameterized by 2 conditional probability distributions that specify one step of the Markov Chain
- \(p(x^{(k)} | h^{(k)})\) tells how to generate next visible variable given the current latent state.
- \(p(h^{(h)} | h^{(k-1)}, x^{(k-1)})\) tells how to update the latent state variable, given the previous latent state and visible variable.
Denoising antoencoders and GSN differ from classical probablistic models (directed or undirected) in that they parameterize the generative process itself rather than the mathmatical specification of the joint ditribution of visible and latent variables. Instead, the latter is defined implicitly, if exists, as the stationary distribution of the gernatove Markove chain.
20.12.1 Discriminant GSNs¶
The original formuation of GSN was meant for unsupervised learning and implicitly modeling p(x) for observed data x, but it is possible to modify the framework to optimize p(y|x).
20.13 Other Generation Schemes¶
- MCMC sampling
- Ancestral sampling
They are most popular approaches to generative modeling, they are by no means the only approaches. Other approaches
- Diffusion inversion: based on the idea that the probability distribution we wish to sample from have structure. This structure can gradually be destroyed by a diffusion process that incrementally changes the probability distribution to have more entropy. To form a generative model, we can run the process in reverse, by training a model that gradually restores the structure to an un structured distribution. By iterattively applying a process that brings a distribution closer to the target one, we can gradually approach that target distribution. The model is defined to be the probability distribution produced by the iterative procedure.
- Approximate Bayesian Computation: samples are rejected or modified to make the moments of selected function of the samples match those of the desired distribution.
20.14 Evaluating Generative Models¶
Often, cannot evaluate log(x) under the model, but we can evaluate only a approximation. In these cases, it is important to think and communicate clearly about what exactly is being measured.
Evaluation metrics are often hard research problems in and of themselves.
Other fields of machine learning usually allows for some variation in the preprocessing of the data. Examples: MNIST dataset. check textbook for example
Practioners often evaluate generative models by visually inspecting the samples. Unfortunately, it is possible for a very poor probablistic model to produce very good samples
Visual quality of samples not reliable -> evaluate log-likehood the model assigns to the test data. Unfortunately, in some cases the likehood seems not to measture any attribute of the model that we really care about. Example: assigne high likehood to MNIST dataset because of the background.
There are many different uses of generative models and the choice of metrics must match the intended use of the model