We are exploring the nature of equivariance, a concept that is now closely associated with the capsules network architecture (see key papers Sabour et al, and Hinton et al). Machine learning representations that capture equivariance must learn the way that patterns in the input vary together, in addition to statistical clusters in the input (that a typical neural network discovers). This is very useful, because in theory it should improve generalisation.
In the process we have found hints that conventional discriminatory training of networks causes pathological form of “brittleness”. Capsules networks seem to avoid or at least mitigate this issue. As shown below, in some ways the output of conventionally trained networks is worse than an untrained network of the same geometry!
This article will describe the performance of a set of related architectures, derived from standard feed-forward convolutional networks. Some of the architectures are intended to be equivariant; others not. By exploring which features give rise to equivariant representations, we hope to refine our understanding and how to achieve equivariance.
What is equivariance?
To understand equivariance, it is helpful to first consider invariance. Invariance is often a stated goal of representation-learners: Perhaps we want to output a ‘1’ value from our classifier whenever a dog is observed. Since the output is identical whenever a dog is present, we can say the classifier is “invariant” to changes in dog appearance or pose. But invariance doesn’t communicate much information about the entity.
By contrast, a ‘dog’ capsule won’t output the same value[s] whenever a dog is observed. Instead, it will output a set of varying parameters that describe the current pose, appearance or other salient features of the dog. These parameters are equivariant – they are not invariant. (You can find a formal definition of equivariance here). By varying systematically, an equivariant representation of a dog communicates much useful information about the dog being observed.
Prerequisites for equivariance
Our results suggest that four factors are necessary to enable a representation to capture equivariance:
- Sparsity. By only activating a subset of the cells for any given input, sparsity enables specialization. Cells can learn to represent only certain related input, and subsequently specialize in capturing the variation in the relevant subset.
- Group selection. It is possible that dynamic cliques of sparsely activated cells could discover and model equivariance, but we have not been able to achieve this. Instead, by dividing cells into arbitrary, fixed groups (aka capsules) and selecting at the group level, the cells in the group can learn to collectively model their common input subset and jointly represent all observations of the subset. Within the group, representational responsibility must be distributed – parameters should capture different aspects of the variation observed. This happens as a side-effect of training by gradient descent, because weight update magnitude is a function of hidden layer activity (for a single input, cells with larger activity values are changed faster).
- Consensus. A capsule should represent the possible configurations of an entity or class, such as a dog. We want each capsule to model different forms of the same thing, not different forms of different things, or even worse, the same forms of different things! But entities are like latent variables – they are not directly observable and must be inferred given evidence. We therefore face the problem of grouping observations into related subsets; well-chosen groupings will make it easier to find parameters that efficiently describe the subset. The purpose of consensus is to reliably identify entities in the input by exploiting information from throughout the network to make local decisions about which capsule[s] best represent current input. Selected capsules can then be trained to represent these specific entities, and specialize in capturing the observed variants of the entities they represent, rather than unrelated variation in the input in general. In capsules papers thus far, consensus is achieved by the routing mechanism, in which capsules “vote” for particular interpretations of the input by producing “predictions” of a consensus “pose” (aka parameter configuration) of entities in the next layer. As a result, in theory Capsule learning can be unsupervised.
- Competition. We tried non-competitive normalization between capsules (e.g. using the long term `layer-normalization’ approach of Ba et al (2016), but this consistently removed the equivariance effect. We also note that the existing capsules papers all use a normalization step with a fixed total output over all competing capsules (e.g. Softmax). So without additional evidence to the contrary, it appears that explicitly competitive output normalization is necessary: Capsules must participate in a Zero-sum game. The underlying reason may be to force interpreting layers to consider the output of capsules relative to other capsules, as opposed to fixed thresholds.
Why Capsules Generalize Better
From our results we believe – but have admittedly not yet proven – that capsules generalize better due to the representation capturing and encoding the ways that entities in the input vary. Samples outside the training data set may then be represented by more extreme configurations of the existing parameters, without further learning. This adjustment happens automatically on exposure to new input, without permanent weight adaptation. Since consumers of the representation must already have learnt how to interpret continuous variation in these parameters, it is possible that the learned interpretation includes generalization to more extreme values.
Our recent arXiv upload “Sparse Unsupervised Capsules generalize better” showed that a sparse, unsupervised capsules network can generalize effectively from MNIST training data to affNIST (affine transformed MNIST digits). We did not train our capsules network on affNIST at all, yet our capsules representation achieved better classification accuracy than all the other results we could find (except one very sophisticated network explicitly designed to model affine transformations, which had similar performance to ours). Many of the other inferior results were also trained on affNIST input and therefore weren’t even trying to generalize! So we take this as a good piece of evidence that Capsule networks might generalize better than conventional (non-capsule) networks.
Objective equivariance & performance metrics
This article will primarily use images of univariate perturbations of network hidden layer state to explore the variations captured by each representation. Unfortunately these are subjective, qualitative measures. It would clearly be more ideal to use objective, quantitative metrics to evaluate the equivariance of a representation, but we have not researched this topic sufficiently to arrive at a good method.
We will also use some numerical metrics such as classification accuracy and mean-square-error reconstruction loss to quantitatively measure the quality of representations, though on MNIST these are not very useful as the dataset is too simplistic. They mainly serve to detect algorithm defects.
We can expect stroke-based glyphs to have a common set of equivariances, that result from the way the glyphs are formed. These are nicely illustrated in Sabour et al fig.4, reproduced here:
We tested several network architectures. Each architecture consisted of a 3-layer Encoder combined with a 3-layer Decoder network.
One decoder optimizes reconstruction loss, which makes the complete network an autoencoder. This tests the information content of the hidden layers. The other decoder optimizes classification accuracy, which results in a conventional feed-forward network (ignoring bidirectional connections inside some of the encoder layers, in the capsules algorithm).
To improve the fairness of the comparisons, the number of filters and other geometric properties of the encoders and decoders are kept constant. The training regimes, learning algorithms and nonlinearities such as group selection are varied.
One of the networks we tested is a new algorithm we have invented, called Predictive Capsules or PredCaps for short. There’s not enough time or space to explain PredCaps in this article but it could be substituted for other Capsules encoders such as the architectures in Sabour et al or Hinton et al. We believe the fundamental insights of this article would be unchanged; for example, the reported equivariances appear similar.
PredCaps was motivated by a set of biologically plausible design goals, which result in the following advantages:
- The architecture is homogeneous: All layers are PredCaps layers, unlike the capsules papers cited above where Relu and other conventional layers are used.
- Uses only a local, unsupervised learning rule.
- All layers are trained continuously and simultaneously.
- Replaces routing with self-prediction as the consensus mechanism.
- There are no squashing functions or other nonlinearities, except sparseness (for training) and scaling capsules’ total output to a unit value.
- Doesn’t need a slowly changing consensus; instead consensus is achieved as outputs are sent up and down the stack of PredCaps layers, a la Belief-Propagation.
- Robustly infers the entities being modelled using feedback and feed-forward input.
- Mimics and inspired by the morphology of Pyramidal Neurons.
|1 Encoder||Input: 28x28x1|
Receptive fields: 6×6
Filters: 64 = 16 capsules x 4 parameters
|2 Encoder||Input: 10x10x64|
Receptive fields: 5×5
Filters: 256 = 16 capsules x 16 parameters
|3 Encoder||Input: 10x10x64|
Receptive fields: 5×5
Filters: 256 = 16 capsules x 16 parameters
|Univariate perturbation between these layers, at testing time.|
|4 Decoder||Fully connected dense layer with 1x1x256 inputs and 512 cells and ReLU nonlinearity||Fully connected dense layer with 1x1x256 inputs and 1024 cells and ReLU nonlinearity|
|5 Decoder||Fully connected dense layer with 1024 cells and ReLU nonlinearity||Fully connected dense layer with 512 cells and ReLU nonlinearity|
|6 Decoder||Fully connected dense layer with 784 cells (outputs) and Sigmoid nonlinearity||Fully connected dense layer with 10 cells (outputs) and Sigmoid nonlinearity|
|Encoder Algorithms (representation learners)|
|Enc-Untrained||Untrained convolutional network, consisting of 3 ReLU layers. We use a Tensorflow StopGradient to prevent any update to the weights in the encoder layers and verify the weights never change.|
|Enc-Deep-BP||As above, but we train the entire encoder-decoder pair as a conventional deep feed-forward network (all 6 layers trained). Errors and weight updates are backpropagated through all layers.|
|Enc-PredCaps||3 layers of Predictive Capsules.|
|Decoder networks (alternate “heads”)|
|Dec-Recon||With this decoder, the network optimizes mean-square error between the output and the original input image.|
|Dec-Class||This network configuration optimizes cross-entropy classification loss but (classification) accuracy is also reported.|
In total, 6 network architectures were tested. They were:
- Enc-Untrained + Dec-Recon
- Enc-Untrained + Dec-Class
- Enc-Deep-BP + Dec-Recon
- Enc-Deep-BP + Dec-Class
- Enc-PredCaps + Dec-Recon
- Enc-PredCaps + Dec-Class
The results are summarized in a table below. There are some surprising and provocative results.
First, all 3 encoders are able to produce output that permits very good classification accuracy. Do bear in mind that MNIST is sufficiently “easy” that excellent classification performance is expected. The decoder network is very powerful, so this isn’t a total surprise, but it does tell us some things:
- None of the encoders (including the untrained encoder) result in a significant loss of information from the original input image. It’s possible that the convolution process (even without trained filters) is actually useful for classification.
- Specifically, the new PredCaps encoder does not cause a pathological loss of information needed for classification despite also learning equivariances.
Reconstruction losses were more varied. Enc-Untrained is much worse than the other encoders – the difference is visible to the eye too, with the reconstructed digits being blurred and degraded. However, both Enc-Deep-BP and Enc-PredCaps produce very accurate reconstructions. This provides further evidence that both encodings preserve almost all information, despite sharing a small bottleneck of just 256 elements.
|Encoder||Classification Accuracy||Reconstruction loss (MSE)||Equivariance?|
|Enc-Untrained||99.4%||0.014||Arguable, but digit forms are not recognizably digits 0..9.|
|Enc-Deep-BP||99.6%||0.004||No. Perturbed digit forms are unconnected blobs.|
|Enc-PredCaps||99.7%||0.005||Yes, always plausible forms of digits 0..9 under varying transformations.|
Now we get to the interesting part. We individually perturbed each element of the encoder output and observed the resulting decodings, looking for evidence that the encoding is equivariant. Of course, when exposed to real data the parameters vary in combination, but this space is too large to explore easily. The same perturbations were made to all networks’ 3rd encoder layer output, prior to decoding. We made the following observations:
- The untrained encoder network produces smoothly varying output forms in response to univariate hidden layer perturbations, but they are not recognizably the numerals 0..9.
- After perturbation, Enc-Deep-BP reliably produces broken blobs and spots that do not look like digits at all. Not equivariant.
- Enc-PredCaps produces outputs that are almost always recognizably the digits 0..9 although with forms that vary in ways including skew, rotation, compression in some axis, digit morphing, stroke intensity and so on.
These results reveal something fascinating about the effect of conventional convolutional network training using gradient descent. Whereas a random, untrained encoder network can be decoded in a manner that is somewhat resistant to univariate perturbations, resulting in smoothly varying although not recognizable digit-forms, univariate perturbations of the trained network result in spotty, broken forms that don’t look like digits at all. The training process has sensitized the complete network such that small changes in the deepest hidden layer can disrupt reconstruction producing output that does not resemble the training data at all. This is a form of brittleness – only very close matches to the training data can be effectively encoded.
Overfitting and poor generalization are known to be common problems for deep networks trained in this manner. If the training process discovers statistical clusters in the input and then learns to identify the presence or absence of these, it may become less able to describe input that is not within these clusters. As a result, the network is only able to produce meaningful and useful results in a small subset of its possible configurations.
Obviously, not training networks is not the answer! (Extreme Learning Machines might disagree). Although it was still possible to classify the digit given an untrained network, image reconstruction was badly affected by the absence of training. Capsules algorithms such as PredCaps appear to offer the best of both worlds – an ability to capture almost all input information in a small number of parameters, while simultaneously being able to describe unseen variants of the input using different values of the same parameters. A wide variety of PredCaps hidden layer configurations produces recognizable numerical digit-forms under various transformations.
The ultimate test of equivariant representations is whether they achieve the core objective of improved generalization. We are now testing PredCaps on the MNIST → affNIST generalization task, and on more sophisticated datasets (including smallNORB).