Welcome to Projects and Notes’s documentation!¶
IMAGENET-TRAINED CNNS ARE BIASED TOWARDS TEXTURE; INCREASING SHAPE BIAS IMPROVES ACCURACY AND ROBUSTNESS¶
Authors: | Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, Wieland Brendel |
---|---|
Links: |
|
Motivation¶
Two contradictory hypotheses: Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. However, some recent studies hint to a more important role of image textures:
- CNNs can still classify texturised images perfectly well, even if the global shape structure is completely destroyed (Gatys et al., 2017; Anonymous, 2018)
- Standard CNNs are bad at recognising object sketches where object shapes are preserved yet all texture cues are missing (Ballester & de Araujo, 2016)
- Two studies suggest that local information such as textures may actually be sufficient to “solve” ImageNet
object recognition:
- Gatys et al. (2015) discovered that a linear classifier on top of a CNN’s texture representation (Gram matrix) achieves hardly any classification performance loss compared to original network performance.
- Anonymous (2018) demonstrated that CNNs with explicitly constrained receptive field sizes throughout all layers are able to reach surprisingly high accuracies on ImageNet, even though this effectively limits a model to recognising small local patches rather than integrating object parts for shape recognition.
Do we need to revise the way we think about how neural networks recognize objects?
Beyond the shape, objects typically have a more or less distinctive color, size and texture. All of these factors could be harnessed by a neural network to recognize objects. While color and size are usually not unique to a certain object category, almost all objects have texture-like elements if we look at small regions — even cars, for instance, with their tyre profile or metal coating.

We know that neural networks happen to have an amazing texture representation — without ever being trained to acquire one. This becomes evident when considering style transfer. Neural networks acquire a powerful representation of image textures despite being trained only on object recognition.
Experiments¶
Put these conflicting hypotheses to a quantitative test by comparing human behaviors with CNN predictions. It’s showed that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes.

General settings¶
- 16 categories: airplane, bear, bicycle, bird, boat, bottle, car, cat, chair, clock, dog, elephant, keyboard, knife, oven and truck. The 1,000 ImageNet class predictions were mapped to the 16 categories using the WordNet hierarchy.
- Four CNNs pre-trained on standard ImageNet, namely AlexNet (Krizhevsky et al., 2012), GoogLeNet (Szegedy et al., 2015), VGG-16 (Simonyan & Zisserman, 2015) and ResNet-50 (He et al., 2015).
- Only object and texture images that were correctly classified by all four networks are selected. This method ensures that the mistakes made by CNNs are because of the image distortions.
Details¶
Results¶
When object outlines were filled in with black colour to generate a silhouette, CNN recognition accuracies were much lower than human accuracies. This was even more pronounced for edge stimuli. Indicating that human observers cope much better with images that have little to no texture information.
One confound in these experiments is that CNNs tend not to cope well with domain shifts, i.e. the large change in image statistics from natural images (on which the networks have been trained) to sketches (which the networks have never seen before)
Human observers show a striking bias towards responding with the shape category (95.9% of correct decisions). This pattern is reversed for CNNs, which show a clear bias towards responding with the texture category (VGG-16: 17.2% shape vs. 82.8% texture; GoogLeNet: 31.2% vs. 68.8%; AlexNet(larger convolution size): 42.9% vs. 57.1%; ResNet-50: 22.1% vs. 77.9%).

OVERCOMING THE TEXTURE BIAS OF CNNS¶
The experiments suggest that ImageNet-trained CNNs, but not humans, exhibit a strong texture bias. One reason might be the training task itself: it might simply suffice to integrate evidence from many local texture features rather than going through the process of integrating and classifying global shapes.
In order to test this hypothesis a ResNet-50 is trained on Stylized-ImageNet (SIN) data set in which the object-related local texture information is replaced with the uninformative style of randomly selected artistic paintings.

However, the same standard architecture (ResNet-50) that learns a texture-based representation on ImageNet is able to learn a shape-based representation instead when trained on ‘StylizedImageNet’, a stylized version of ImageNet.


Shape-ResNet surpasses the vanilla ResNet in terms of top-1 and top-5 ImageNet validation accuracy as reported in Table 2. This indicates that SIN may be a useful data augmentation on ImageNet that can improve model performance without any architectural changes.
The representations of each model are tested as backbone features for Faster R-CNN (Ren et al., 2017) on Pascal VOC 2007. Incorporating SIN in the training data substantially improves object detection performance from 70.7 to 75.1 mAP50.
It is also systematically tested that how model accuracies degrade if images are distorted by uniform or phase noise, contrast changes, high- and low-pass filtering or eidolon perturbations. The SIN-trained network outperforms the IN-trained CNN on almost all image manipulations.

Are All Training Examples Created Equal? An Empirical Study¶
Authors: | Kailas Vodrahalli, Ke Li, Jitendra Malik |
---|---|
Links: | Original paper: https://arxiv.org/pdf/1811.12569.pdf |
Abstract¶
Modern computer vision algorithms often rely on very large training datasets. However, it is conceivable that a carefully selected subsample of the dataset is sufficient for training. In this paper, we propose a gradient-based importance measure that we use to empirically analyze relative importance of training images in four datasets of varying complexity.
We find that in some cases, a small subsample is indeed sufficient for training. For other datasets, however, the relative differences in importance are negligible. These results have important implications for active learning on deep networks. Additionally, our analysis method can be used as a general tool to better understand diversity of training examples in datasets.
Methods¶
We conduct this analysis by computing the gradient magnitude of the loss corresponding to each individual training image at the end of training to determine a relative importance score for each image. We then retrain our network on subsets of the data selected based on our importance measure to determine how well these subsets capture the distribution of the entire dataset.
The magnitude of change in the parameters is directly related to how important/informative the current batch of training data is. If the current batch is important, then the model should change significantly after seeing the current batch. On the other hand, if it is not important, the model should remain almost the same.
Analysis steps¶
- Train the network using the entirety of the training data, using validation data for early stopping. Log the test accuracy.
- Compute the gradient of each network parameter with respect to a loss for each training image in the training dataset. We will use these gradients to subsample data in the next step.
- Now retrain the network from a random initialization using a subsampled portion of the data. Then log the test accuracy as a measure of how well the subset represents the entirety of the dataset.
Batch Selection¶
- Random: (baseline) We randomly select the given number of images from all training images.
- Max-Gradient: (straight-forward) We select images in descending order by their gradient magnitude until we reach the given number of images.
- Non-extreme Max-Gradient: (remove outliers) We order images by their gradient magnitude in descending order. Then we discard the top 5% of images, and proceed to select images in order until we reach the given number of images.
- Gradient-CDF: (introduce randomness) Here, ’CDF’ stands for ’cumulative distribution function.’ We use the gradient magnitudes to induce a probability mass function (PMF) over the training images. We subsequently use the resulting distribution to sample, without replacement, the given number of images.
Results¶
On MNIST, subsampling by maximum gradient gives higher performance, suggesting that there is indeed redundancy in the dataset. However, when we move to CIFAR- 10 and CIFAR-100, random sampling performs better than Max-Gradient and is most closely matched by Gradient-CDF, which is simply weighted random sampling. CIFAR seems to be diverse in the sense that it is not redundant.
We also note that gradient-based sampling may not always be optimal. Sampling by gradient skews the class distribution when we order by gradient magnitude, which in turn makes generalization more difficult. Note that in CIFAR-100, Non-extreme Max-Gradient results in lower test accuracy, while in CIFAR-10 it achieved roughly the same performance as Random. This difference may be due to CIFAR-100 having 10 times as many classes, and so the issue of image distribution skew is exacerbated. We see the same behavior for ImageNet.
doc2vec¶
Papers¶
Distributed Representations of Sentences and Documents [1]
Authors: | Quoc Le, Tomas Mikolov |
---|---|
Abstract: | In this paper, we propose ParagraphVector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. |
The most common fixed-length vector representation for texts is bag-of-words or bag-og-n-grams. Disadvantage: The word order is lost. Bag-of-n-grams tries to solve this issue but suffers from data sparsity and high dimensionality. And both methods have very little sense about the semantics of the words.
ParagraphVector: the vector representation is trained to be useful for predicting words in a paragraph. The paragraph vector is concatenated with several word vectors from the paragraph to predict the following word in the given context. Both word vectors and paragraph vectors are trained by the stochastic gradient descent and backpropagation. While paragraph vectors are unique among paragraphs, the word vectors are shared. At prediction time, the paragraph vectors are inferred by fixing the word vectors and training the new paragraph vector until converge.
Paragraph Vector is capable of constructing representations of input sequences of variable length. Unlike some of the previous approaches, it is general and applicable to texts of any length: sentences, paragraphs, and documents.
Paragraph Vector: A distributed memory model The paragraph vectors are also asked to contribute to the prediction task of the next word given many contexts sampled from the paragraph.
The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph. For this reason, we often call this model the Distributed Memory Model of Paragraph Vectors (PV-DM).
After being trained, the paragraph vectors can be used as features for the paragraph.
Advantages Paragraph Vectors inherit an important property of the word vectors: the semantics of the words. They also take into consideration the word order, at least in a small context.
Paragraph Vector without word ordering: Distributed bag of words Ignore the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output. In reality, what this means is that at each iteration of stochastic gradient descent, we sample a text window, then sample a random word from the text window and form a classification task given the Paragraph Vector.
PV-DM alone usually works well for most tasks (with state-of-art performances), but its combination with PV-DBOW is usually more consistent across many tasks that we try and therefore strongly recommended.
[1] | Quoc Le and Tomas Mikolov. Distributed Representations of Sentences and Documents. http://arxiv.org/pdf/1405.4053v2.pdf |
Implementation¶
gensim¶
https://radimrehurek.com/gensim/models/doc2vec.html
class gensim.models.doc2vec.Doc2Vec(documents=None, dm_mean=None, dm=1, dbow_words=0, dm_concat=0,
dm_tag_count=1, docvecs=None, docvecs_mapfile=None, comment=None,
trim_rule=None, callbacks=(), **kwargs)