3。5。4 MNIST Handwritten Digits
The MNIST dataset has pixel values in the range [0,255]。 We thus start with simple rescaling to shift the data into the range [0,1]。 In practice, removing the mean-value per example can also help feature learning。 Note: While one could also elect to use PCA/ZCA whitening on MNIST if desired, this is not often done in practice。
Chapter Four Deep Networks
4。1 Overview
In the previous sections, you constructed a 3-layer neural network comprising an input, hidden and output layer。 While fairly effective for MNIST, this 3-layer model is a fairly shallow network; by this, we mean that the features (hidden layer activations a(2)) are computed using only "one layer" of computation (the hidden layer)。
In this section, we begin to discuss deep neural networks, meaning ones in which we have multiple hidden layers; this will allow us to compute much more complex features of the input。 Because each hidden layer computes a non-linear transformation of the previous layer, a deep network can have significantly greater representational power (i。e。, can learn significantly more complex functions) than a shallow one。
Note that when training a deep network, it is important to use a non-linear activation function f(·) in each hidden layer。 This is because multiple layers of linear functions would itself compute only a linear function of the input (i。e。, composing multiple linear functions together results in just another linear function), and thus be no more expressive than using just a single layer of hidden units。
4。2 Advantages of deep networks
Why do we want to use a deep network? The primary advantage is that it can compactly represent a significantly larger set of functions than shallow networks。 Formally, one can show that there are functions which a k-layer network can represent compactly (with a number of hidden units that is polynomial in the number of inputs), that a (k−1)-layer network cannot represent unless it has an exponentially large number of hidden units。
To take a simple example, consider building a Boolean circuit/network to compute the parity (or XOR) of n input bits。 Suppose each node in the network can compute either the logical OR of its inputs (or the OR of the negation of the inputs), or compute the logical AND。 If we have a network with only one input, one hidden, and one output layer, the parity function would require a number of nodes that is exponential in the input size n。 If however we are allowed a deeper network, then the network/circuit size can be only polynomial in n。
By using a deep network, in the case of images, one can also start to learn part-whole decompositions。 For example, the first layer might learn to group together pixels in an image in order to detect edges (as seen in the earlier exercises)。 The second layer might then group together edges to detect longer contours, or perhaps detect simple "parts of objects。" An even deeper layer might then group together these contours or detect even more complex features。
Finally, cortical computations (in the brain) also have multiple layers of processing。 For example, visual images are processed in multiple stages by the brain, by cortical area "V1", followed by cortical area "V2" (a different part of the brain), and so on。
4。3 Difficulty of training deep architectures
While the theoretical benefits of deep networks in terms of their compactness and expressive power have been appreciated for many decades, until recently researchers had little success training deep architectures。
The main learning algorithm that researchers were using was to randomly initialize the weights of a deep network, and then train it using a labeled training set using a supervised learning objective, for example by applying gradient descent to try to drive down the training error。 However, this usually did not work well。 There were several reasons for this。