Backpropagation from scratch in Julia (part II: derivation and implementation)

get the code from here

This is the second post of the series describing backpropagation algorithm applied to feed forward neural network training. In the last post we described what neural network is and we concluded it is a parametrized mathematical function. We implemented neural network initialization (meaning creating a proper entity representing the network – not weight initialization) and inference routine but we never made any connection to the data itself. In this post we will make such a connection and we will express meaning of parametrization “goodness” in terms of training data and network output.

More layers

Linear layer

To do so we need to focus on the last output layer as it is going to be input to the function expressing how well network fits the data. Our output layer is going to be “softmax”. Softmax usually goes together with fully connected linear layer prior to it. Before introducing softmax lets have linear layer explained and implemented.

Linear layer consists of a set of nodes with a purpose to linearly combine input with its parameters. In contrast to sigmoid we won’t ‘squash’ the linear combination output but leave it as it is – but to be consistent with the code we wrote so far we will think of it as if we applied “identity” as a ‘squashing’ function. This interpretation will be needed later when we get to the core of this post – backpropagation (it will turn out squashing function and its derivative, with respect to input, is important)

Let’s quickly add linear layer to our current implementation:


...
# first we introduce underlying function of a node/layer
function linearNeuronTransformFunction(params, input)
  return params * appendColumnOfOnes(input)
end
...
# then we add our new layer of nodes to existing network 
function addFullyConnectedLinearLayer(architecture::NetworkArchitecture, numberOfNeurons::Int64)
 lastNetworkLayer = architecture.layers[end]
 inputSize = lastNetworkLayer.numberOfNeurons
 linearLayer = FullyConnectedComputingLayer(inputSize, numberOfNeurons, identityNeuronTransformFunction)
 push!(architecture.layers, linearLayer)
end
...

Please note we used our generic type FullyConnectedComputingLayer (on the graph below you can think of this type of neurons as “blue” – computing neurons).

Linear layer is yet another function that can be added to the whole composition of a functions that any neural network represents – introducing linear layer provides more flexibility – after that we will use our neural network to represent broader class of functions.

We can now introduce softmax – our next step to connect neural network output to the data and at the end measure the output goodness(or error).

Softmax layer

The purpose of our new layer is to normalize network outputs such that they all contribute to one final probability vector. This kind of layer is called softmax layer. Let’s take a look how it’s gonna be attached at the end of our current network template:

NN_prob

Last layer of orangish nodes does not differ much from regular neurons. The only difference is its underlying function – it crunches input data in a different way. Let’s take a look at its underlying function:

\[
f_k(x) = \frac{\mathrm{e}^{x_k}}{\sum_{i}^{K} \mathrm{e}^{x_i}}
\]

(where \(k\) is the neuron index and \(x\) is input vector of length \(K\))

If you look closer you will quickly conclude it is just certain way of normalization. Denominator for each neuron is the same and it adds up to the sum of all nominators. Softmax is not parametrized. It behaves the same way no matter what – its purpose is to exponentially normalize the input (coming from previous layer)

Let’s add it to our current implementation


function exponentialNormalizer(params, input)
  denominator = sum(exp(input),1)
  return exp(input) ./ denominator
end

type SoftMaxLayer <: Layer
  numberOfNeurons::Int64
  parameters::Any
  transform::Function

  function SoftMaxLayer(numberOfNeurons)
    return new(numberOfNeurons, [], exponentialNormalizer)
  end
end

function addSoftMaxLayer(architecture::NetworkArchitecture)
 lastNetworkLayer = architecture.layers[end]
 numberOfNeurons = lastNetworkLayer.numberOfNeurons
 softMaxLayer = SoftMaxLayer(numberOfNeurons)
 push!(architecture.layers, softMaxLayer)
end

We first added function that performs exponential normalization of input vector. That function normalizes each component of the input vector (please don't mind numerical stability problems now - we will fix it later when needed). It is then used as a transformation function of a newly created softmax layer. Adding that layer will be realized with the last function. Note that it does not require any parameters as it is fully determined by the previous layer in the network. Lets try to build a network now and quickly test if our softmax layer really normalize the input.


firstLayer = FullyConnectedComputingLayer(784, 100, sigmoidNeuronTransformFunction);
architecture = NetworkArchitecture(firstLayer);
addFullyConnectedSigmoidLayer(architecture, 50);
addFullyConnectedSigmoidLayer(architecture, 30);
addFullyConnectedLinearLayer(architecture, 9);
addSoftMaxLayer(architecture);

outputVector = infer(architecture , randn(784,1)) # we try a single vector now
sum(outputVector)
# 1

After passing through the network starting from a random input vector we got a probability vector at the network output layer. Still, that is not really a success... unless we motivate it. The whole effort of normalizing the output to probability vector lets us treat the neural network output as a probability distribution. That interpretation let us treat the problem within probabilistic framework and so let us use tools from that framework - one such a tool is a cross entropy error function that can be applied to normalized output vectors. Before talking specifically about cross-entropy lets focus on error function in general, first.

Error function

Error function is the way to express how well given fixed parametrized neural network is trained. This is expressed with one real number that is the error function output. It is a mathematical formulation expressing what "best" neural network parametrization exactly means. One way of evaluating neural network parametrization is to check how it performs in terms of its original purpose (what a discovery!). In case of MNIST dataset classification, that we have been refering to so far, the goal of neural network is to classify (guess) label of the input image. Evaluation will then have to involve training dataset, corresponding labels and model parameters (neural network layers parameters). Error function should be very low or zero if all the training examples are classified correctly and it should have high value whenever classification is incorrect in most of the cases. This is the most basic requirement on error function.

We will now present cross-entropy function that looks a bit complicated at the beginning but we will get why it looks the way it looks in a minute. When it comes to the choice of cross-entropy function for that particular type of problems it is motivated by experiments done on that matter before. (like this one). Training of such a function works in practice. There are other options though - another error function that people use is squared error (usually without softmax, applied to the sigmoid neurons directly).

Please take a look at the form of cross-entropy function:

\[
\mathrm{J}(x, y, \theta) = - \frac{1}{N} \sum_{z} \mathrm{log}( {o}_{y_z}^L(x_{z}))
\]

Lets focus on that function for one minute. Its arguments are:

  • \( x \) is a set of training data examples (in case of MNIST: set of images)
  • \( y \) is a corresponding set of labels (in case of MNIST: digit label)
  • \(\theta\) - this represents all parameters of given fixed neural network. In case of our code \( \theta \) would represent all the numbers kept in layer parameters

Error function computes an average value of all the components of a form
\[
\mathrm{J}_z = - \mathrm{log}( {o}_{y_z}^L(x_{z}))
\]

Each component is computed for \(z\)-th element of a training set (\(x_{z}\)) and its corresponding label \(y_{z}\). Please note \({o}_{y_z}^L\). This represents output from an output neuron that is chosen to model label given by \(y_z\) (where\(L\) is a layer index) - this notation carries certain implementation detail then. It means that we will have to choose output neurons to model specific labels. So it means that output neuron \({o}_{y_z}^L\) is chosen upfront to model label \( y_{z} \). It means that for each training example we are practically interested in one particular output neuron when computing total error*.

*one important assumption though, we assume certain label coding scheme here. Given error function only makes sense when we assume crips class membership and so 1 of K coding scheme

We still need a motivation for using such an error function and intuition why it even works. Formally it is a cross entropy approximation between two discrete probability distributions (of which one is unknown). This is something that might be digestible for scientist but it does not directly give you any intuition given lack of certain background. So lets try to see/visualize what it means. Lets focus on one specific \( \mathrm{J}_k \) - we can assume k is given and known and so we only observe one specific output neuron that was chosen to model labels of value \(y_k\) - more exact: \( o_{y_k}^L \). Because we used softmax to compute it we know its value is from range \((0,1)\). Plot of \( - \mathrm{log} \) in \((0,1)\) range below may give you the simplest intuition what is going on here:

error

Once again, we are considering single specific input vector \(x_k\) and so output neuron \(o_{y_k}^L\) contribution to the total error. Because we decided upfront that our output neuron is responsible for modeling labels of value \( y_k\) - we are the happiest if its value for chosen input is 1 (and because of softmax - others are 0 as they all add up to one by definition). Also if its value is closer to zero it means our network cannot correctly guess the true label. This is expressed above - error function will punish incorrect probability values with high error values - more or less this is the basic intuition behind.

Lets try to implement error function evaluation in our existing code.


...
function crossEntropyError(architecture::NetworkArchitecture, input, labels)
 probabilitiesSparseMatrix = infer(architecture, input) .* labels
 probabilities = sum(probabilitiesSparseMatrix , 1)
 return -sum(log(probabilities)) / length(probabilities)
end
...

Having our function defined like that we can compute the total error at once assuming certain label set representation. So lets assume we have 100 elements in dataset and so 100 labels from range [1,10] and again 784-dimensional input.


firstLayer = FullyConnectedComputingLayer(784, 100, sigmoidNeuronTransformFunction)
architecture = NetworkArchitecture(firstLayer)
addFullyConnectedSigmoidLayer(architecture, 50)
addFullyConnectedSigmoidLayer(architecture, 10)


dataBatch = randn(784,100)
labels = sparse(mapslices(x -> begin x[rand(1:10,1)] = 1; x end, zeros(10,100) , 1))
crossEntropyError(architecture, dataBatch, labels)
# 2.7185784703167353

The key is labels representation, we want it to be a sparse matrix with \(i\)-th column representing a label of \(i\)-th input element in a way that it has a value 1 on a row that represents correct label. For example


 Y[:,1]
#10x1 sparse matrix with 1 Float64 entries:
#        [5 ,  1]  =  1.0

means first element in dataset represents label number 5. In case of MNIST it could mean that first element is actually handwriten 5. The Reason why we sum

infer(architecture, input) .* labels

by column is to make it represent the sequence
\[
o_{y_1^L}(x_1), o_{y_2}^L(x_2), \dots, o_{y_k}^L(x_k)
\]

so they can be later logarithmized and averaged over number of examples to form the final error value.

Minimizing error function

Our goal is to minimize the error. We are after a parametrization that results in the lowest error function value - indirectly implying high classification accuracy (on a training sample). So in fact we are going to search parameter space to optimize the error function. This is very generic problem and any method that lets you search through a parameter space is a potential choice. So we could consider simmulated annealing, genetic algorithms and other optimization methods too. But in fact because of the careful choice of activation functions and error function, optimization problem is a bit simpler. The key is differentiability of the error function. Error function is differentiable in the parameter space - it is smooth. It's very important property of the function because it opens doors to the gradient descent method.

Gradient descent is an iterative method for finding local minimum of given function assuming the function is differentiable. Differentiability of the error function implies that at each point of error function surface we can compute the direction of where function descents. After "guessing" or anyhow initializing neural network weights - such a fixed parametrization corresponds to the point on the error function surface - the derivative at that point will tell us the "direction" of steepest descent - where to look for the next solution candidate. All the new solution candidates are the points of certain line in parameter space. Updating current solution (network parametrization) is choosing one point from candidates that is "reasonably close" to the current solution. Pseudo code for gradient descent looks like


c = init() # randomly is fine 
while not convergence  
  c = c - alpha * error_derivative(c) 
  convergence = convergence_criteria()   
end 

The core element of the pseudo-algorithm above is computation of a derivative of error function at point c. One possible algorithm for computing derivative of neural network error function is called backpropagation algorithm and that brings us the main point of this series.

Backpropagation algorithm

Backpropagation is a method of calculating partial derivatives of a class of functions that are represented by neural networks error functions with respect to the model parameters. Let's focus on one specific error funtion that we already know: cross entropy error. Cross entropy error function "sees" neural network outputs and based on that it evaluates how well the output fits/models desired output value (truth). Outputs, produced by the neural network, depend on model parameters - parameter values for each network layer. They don't depend on input data and labels as one might think - as these are considered fixed - constant.

Error function is given by:

\[
J(x,y,\theta) = \frac{1}{N} \sum_{z} - log(o_{y_z}^L(x_z))
\]

where \(x\) is a sequence of network inputs and \(y\) is a sequence of correspoding labels. Single \(x_z\) is a vector that neural network is fed with, \(y_z\) is a single corresponding label, \(\theta\) is a structure containing model parameters, \(N\) is a number of training samples and \( o_{y_z}^L \) is network output node that is upfront chosen to model labels of value \(y_z\)*.

*one important assumption though, we assume certain label coding scheme here. Given error function only makes sense when we assume crips class membership and so 1 of K coding scheme

Lets assume \(\theta\) is indexed in a following way:

\[
\theta^{l}_{ij}
\]

meaning a single model parameter value being \(j\)-th parameter of \(i\)-th node of \(l+1\)-th layer of the network - in fact weight of a weighted sum that is being applied to single input node output of index \(j\) from the previous layer \(l\) in node \(i\). Ranges of these indices depend on the network architecture: \(l\) depends on number of layers of the network, \(i\) depends on the number of nodes in a layer \((l+1)\)-th and \(j\) represents one of a number of nodes in \(l\)-th layer.

2thetaparam

Quick look above, \(\theta^{l}_{ij} \) is the \(j\)-th parameter of the \(i\)-th node underlying crunching function of layer \(l+1\). Recall functions that perform actual computations in neural networks. In the case we explained, weighted sum of input signal is computed to be later squashed by sigmoid (or identity - not squashed at all!). Input signal comes from the previous layer and it is a vector - index \(j\) is then single value of that vector - signal coming from \(j\)-th node of the previous layer.

One more observation before we begin. Error function we chose has an interesting property. We already pointed in out before - it might be represented as a mean of errors per observation pair.

\[
\mathrm{J}(x, y, \theta) = \frac{1}{N} \sum_{z} \mathrm{J}_z
\]

therefore

\[
\frac{\partial \mathrm{J}}{\partial \theta_{ij}^l} = \frac{1}{N} \sum_{z} \frac{\partial \mathrm{J}_z}{\partial \theta_{ij}^l}
\]

meaning partial derivative of total error function with respect to \( \theta_{ij}^l \) is a mean of partial derivatives of single training example error components. We can therefore focus on that single case and solve it. So our new goal is to compute
\[
\frac{\partial \mathrm{J_z}}{\partial \theta_{ij}^l}
\] for single training example pair \((x_z , y_z)\). In other words we want to know how to tweak single parameter \(\theta_{ij}^l\) to decrease error component \( \mathrm{J_z} \). There is not much we can do now but just writing it down and trying to see what emerges. We will be assuming multi-layer network with softmax output layer and sigmoid inner layers - please be aware there is no linear layer right before softmax - it is a bit unusual but the final derivation will be generic so any 'squashing' function - including identity of course would work the same. Upcoming derivation will assume we deal with parameter that is not a bias \(\beta\). Derivation for bias parameter derivative is similar and will be refered to later.

\[
\frac{\partial \mathrm{J}_z}{\partial \theta_{ij}^l} = \frac{\partial \mathrm{J_z}}{\partial o_{y_z}^L} \frac{\partial o_{y_z}^L}{\partial \theta_{ij}^l}
\]

The error function only relies on one output node value \(o_{y_z}^L\) - derivation above is the result of the application of a chain rule. First component \(\frac{\partial \mathrm{J}_z}{\partial o_{y_z}^L} \) is known. This is the first derivative of the error function with respect to the output neuron that models given label.

\[
\frac{\partial \mathrm{J_z}}{\partial o_{y_z}^L} = - \frac{1}{o_{y_z}^L}
\]

Lets take a closer look at the other component \( \frac{\partial o_{y_z}^L}{\partial \theta_{ij}^l} \) - this is the function that expresses how the underlying crunching function of the output node (softmax) changes with respect to the parameter in consideration. Let's try to focus on that

\[
\begin{align}
\frac{\partial o_{y_z}^L}{\partial \theta_{ij}^l} = \sum_{k} \frac{\partial o_{y_z}^L}{\partial o_{k}^{L-1}} \frac{\partial o_{k}^{L-1}}{\partial \theta_{ij}^l}
\end{align}
\]

What we see here is that derivative with respect to \( \theta_{ij}^l \) relies on all the input nodes now. Therefore we again used the chain rule for multivariate function. Take a look at the sum
\[
\sum_{k} \frac{\partial o_{y_z}^L}{\partial o_{k}^{L-1}} \frac{\partial o_{k}^{L-1}}{\partial \theta_{ij}^l}
\]

and keep in mind once again that \( o_{y_z}^L \) is a softmax node. Lets consider two cases now

  • Index \( k \) of a component corresponds to the nominator in softmax unit underlying function. Then
    \[
    \frac{\partial o_{y_z}^L}{\partial o_{k}^{L-1}} \frac{\partial o_{k}^{L-1}}{\partial \theta_{ij}^l} = o_{y_z}^L(1 - o_{y_z}^L) \frac{\partial o_{k}^{L-1}}{\partial \theta_{ij}^l}
    \]
  • Index \( k \) of a component corresponds to the denominator in softmax unit underlying function. Then
    \[
    \frac{\partial o_{y_z}^L}{\partial o_{k}^{L-1}} \frac{\partial o_{k}^{L-1}}{\partial \theta_{ij}^l} = - o_{y_z}^L o_{k}^L \frac{\partial o_{k}^{L-1}}{\partial \theta_{ij}^l}
    \]

Let's now combine it all together to get the following:

\[
\begin{align}
\frac{\partial \mathrm{J_z}}{\partial \theta_{ij}^l} &= - \frac{1}{o_{y_z}^L} \sum_{k} o_{y_z}^L ( \mathrm{1}(k = y_z) - o_{k}^L ) \frac{\partial o_{k}^{L-1}}{\partial \theta_{ij}^l} \\
& = \sum_{k} (o_{k}^L - y_k) \frac{\partial o_{k}^{L-1}}{\partial \theta_{ij}^l}
\end{align}
\]

(where \( \mathrm{1}(k = y_z) \) is indicator function, it is one if logical expression is true otherwise 0)

Wow, we got through one layer! Uff. Unfortunately we need to continue that mathematical journey for some time, but we still hope some pattern will emerge so we can enclose all of these in a much simpler form.

One important note now, We somehow reduced the error formulation down through the network and our error derivative now relies on the components in the \((L-1)\)-th layer. Lets focus on one component like that
\[
\frac{\partial o_{k}^{L-1}}{\partial \theta_{ij}^l}
\]

What does it exactly say? So now we would like to know how underlying functions of nodes in the layer \((L-1)\) change with respect to the one parameter we consider. Let's find out! We know what to do - again we apply chain rule for the same reason as before

\[
\frac{\partial o_{k}^{L-1}}{\partial \theta_{ij}^l} = \sum_{u} \frac{\partial o_{k}^{L-1}}{\partial o_{u}^{L-2}} \frac{\partial o_{u}^{L-2}}{\partial \theta_{ij}^l}
\]

The observation now is that \( o_{k}^{L-1} \) has sigmoid activation function that relies on its input nodes from layer \( (L-2) \)
therefore:

\[
\frac{\partial o_{k}^{L-1}}{\partial o_{u}^{L-2}} = o_{k}^{L-1}(1 - o_{k}^{L-1}) \theta_{ku}^{L-2}
\]

and

\[
\frac{\partial o_{k}^{L-1}}{\partial \theta_{ij}^l} = \sum_{u} o_{k}^{L-1}(1 - o_{k}^{L-1}) \theta_{ku}^{L-2} \frac{\partial o_{u}^{L-2}}{\partial \theta_{ij}^l}
\]

Please notice the recursive structure of the above sum. The term \( \frac{\partial o_{u}^{L-2}}{\partial \theta_{ij}^l} \) has exactly the same form but it just goes one layer down.

\[
\frac{\partial o_{u}^{L-2}}{\partial \theta_{ij}^l} = \sum_{k} o_{u}^{L-2}(1 - o_{u}^{L-2}) \theta_{uk}^{L-3} \frac{\partial o_{k}^{L-3}}{\partial \theta_{ij}^l}
\]

At some point we will get to \(l\)-th layer and 'find' \(\theta_{ij}^l\) parameter. What happens then? Lets assume \(l = L - 4\). Then there is only one node in the layer \((L-3)\) that is influenced by \(\theta_{ij}^l\) - that node is \(o_{i}^{L-3}\) - the rest is independent of \(\theta_{ij}^l\) and can be considered fixed. And because of that we can exclude them from the sum as their derivative with respect to \(\theta_{ij}^l\) is \(0\).

\[
\begin{align}
\frac{\partial o_{u}^{L-2}}{\partial \theta_{ij}^l} &= o_{u}^{L-2}(1 - o_{u}^{L-2}) \theta_{ui}^{L-3} \frac{\partial o_{i}^{L-3}}{\partial \theta_{ij}^l} \\
& = (o_{u}^{L-2}(1 - o_{u}^{L-2}) \theta_{ui}^{L-3}) ( o_{i}^{L-3}(1 - o_{i}^{L-3}) o_{j}^{L-4} )
\end{align}
\]

One important note now: Please take a look at the last element \(o_{j}^{L-4}\) that is used in a multiplication. It exists because we assumed parameter we are after is not a bias and so it is involved in computing weighted sum of its neuron inputs (last derivative in a chain is \(o_{j}^{L-4}\) therefore). If it were bias \(o_{j}^{L-4}\) would not be here and would be replace by 1 so it would just vanish from multiplication components.

We can now write:

\[
\begin{align}
\frac{\partial o_{k}^{L-1}}{\partial \theta_{ij}^l} &= \sum_{u} o_{k}^{L-1}(1 - o_{k}^{L-1}) \theta_{ku}^{L-2} (o_{u}^{L-2}(1 - o_{u}^{L-2}) \theta_{ui}^{L-3}) ( o_{i}^{L-3}(1 - o_{i}^{L-3}) o_{j}^{L-4} ) \\
&= o_{k}^{L-1}(1 - o_{k}^{L-1}) ( o_{i}^{L-3}(1 - o_{i}^{L-3}) o_{j}^{L-4} ) \sum_{u} \theta_{ku}^{L-2} (o_{u}^{L-2}(1 - o_{u}^{L-2}) \theta_{ui}^{L-3}) \\
\end{align}
\]

to finally get

\[
\begin{align}
\frac{\partial \mathrm{J_z}}{\partial \theta_{ij}^l} &= ( o_{i}^{L-3}(1 - o_{i}^{L-3}) o_{j}^{L-4} ) \sum_{k} ( o_{k}^L - y_k ) o_{k}^{L-1}(1 - o_{k}^{L-1}) \sum_{u} \theta_{ku}^{L-2} (o_{u}^{L-2}(1 - o_{u}^{L-2}) \theta_{ui}^{L-3}) \\
& = ( o_{i}^{L-3}(1 - o_{i}^{L-3}) o_{j}^{L-4} ) \sum_{k} ( o_{k}^L - y_k ) o_{k}^{L-1}(1 - o_{k}^{L-1}) \sum_{u} \theta_{ku}^{L-2} (o_{u}^{L-2}(1 - o_{u}^{L-2}) \theta_{ui}^{L-3}) \\
&= ( o_{i}^{L-3}(1 - o_{i}^{L-3}) o_{j}^{L-4} ) \sum_{u} (o_{u}^{L-2}(1 - o_{u}^{L-2}) \theta_{ui}^{L-3}) \sum_{k} \theta_{ku}^{L-2} ( o_{k}^L - y_k ) o_{k}^{L-1}(1 - o_{k}^{L-1})
\end{align}
\]

Because of the recursive nature of chain rule applied to the netwok our final partial derivative may be now expressed as:

\[
\begin{align}
\frac{\partial \mathrm{J_z}}{\partial \theta_{ij}^l} &= o_{j}^l \delta_{i}^{l+1} \quad \text{(for parameter other than bias)} \\
\frac{\partial \mathrm{J_z}}{\partial \theta_{ij}^l} &= \delta_{i}^{l+1} \qquad \text{(for bias parameter)}
\end{align}
\]

where \(\delta_{k}^{j}\) is recursively given by

\[
\begin{align}
& \delta_{k}^{j} = \frac{\partial o_k^j}{\partial z_k} \sum_i^{\mathrm{s}(j+1)} \theta_{ik}^{j} \delta_{i}^{j+1} \\
& \delta_{i}^{L-1} = \frac{\partial o_i^{L-1}}{\partial z_i} (o_{i}^L - y_i)
\end{align}
\]

where \( \frac{\partial o_k^j}{\partial z_k} \) is a derivative with respect to weighted sum of \(o_k^j\) input and \( s(i) \) is number of nodes in \(i\)-th layer.

One important thing to observe here, please take a look at the rather cumbersome (for which I'm sorry upfront) derivation of the final formula. All of the components of the form \( \theta_{ab}^{c} \) are the result of weighted sum being involved. That means that - if only - we keep weighted sum as a part of the underlying neuron function and we just replace the 'squashing' function, final recursive formula does not change at all. It means we can therefore replace sigmoid to other variants and the formula won't change. Assuming weighted sum component is kept unchanged - the formula is independent (to certain extend) from the squashing function.

Finally, we can conclude that computing partial derivative of the error component \(\mathrm{J}_z \) with respect to single parameter \(\theta_{ij}^l \) requires

  • storing all neurons outputs from forward pass procedure
  • computing \( \frac{\partial o_k^j}{\partial z_k} \) for each computing node (except of softmax)
  • and finally computing \(\delta_{i}^{l+1}\) for each node

Therefore to compute all partial derivative and so to update parameters with gradient descent we need to compute all \(\delta\) for all nodes. That can be done with one single backward pass procedure.

Implementation

First of all our derivation involved only one error component for specific data point. To be able to perform full parameters update we need all of these components, or more precisely its mean value. We will use certain convenience coming from linear algebra operations. We will remember deltas for each layer in a form of a separate matrix. For each layer there will exist a matrix of deltas. The matrix will store deltas for each node and training example (this is where 2D form comes from).

Lets get back to our implementation and try to prepare data needed to compute \(\delta\)s layer by layer. As was stated, we need to store all neurons output values (for all training examples). We also need a function to compute \(\frac{\partial o_k^j}{\partial z_k}\) in each layer and finally we need \( \delta \) for each node and training example


...
type FullyConnectedComputingLayer <: Layer
  inputSize::Int64
  numberOfNeurons::Int64
  parameters::Array{Float64,2}
  transform::Function
  derivative::Function # derivative added here

  function FullyConnectedComputingLayer(inputSize, numberOfNeurons, transform::Function, derivative::Function)
    parameters = randn(numberOfNeurons, inputSize + 1)  * 0.1 # adding one param column for bias
    return new(inputSize, numberOfNeurons, parameters, transform, derivative)
  end
end
...
function addFullyConnectedSigmoidLayer(arch::NetworkArchitecture, numberOfNeurons::Int64)
 lastNetworkLayer = arch.layers[end]
 inputSize = lastNetworkLayer.numberOfNeurons
 #derivative added
 sigmoidLayer = FullyConnectedComputingLayer(inputSize, numberOfNeurons, sigmoidNeuronTransformFunction, x -> x .* (1 - x))
 push!(arch.layers, sigmoidLayer)
end
...

function addFullyConnectedLinearLayer(architecture::NetworkArchitecture, numberOfNeurons::Int64)
 lastNetworkLayer = architecture.layers[end]
 inputSize = lastNetworkLayer.numberOfNeurons
 linearLayer = FullyConnectedComputingLayer(inputSize, numberOfNeurons, linearNeuronTransformFunction, x -> 1)
 push!(architecture.layers, linearLayer)
end
...

First we update our computing layer definition to hold information about how to compute derivative with respect to its input weighted sum. In case of linear layer derivative is simply 1. In case of sigmoid the derivative takes a convenient form
\[
\frac{\partial o_k^j}{\partial z_k} = o_k^j ( 1 - o_k^j)
\]

We will now introduce entity called backpropagation batch learning unit (not sure if a good name). Its main purpose is to store outputs and deltas produced during backpropagation routine, later on it will also serve as a container for the whole batch of data that we will use for training (it will also be helpful later when we get to stochastic gradient descent training method)


type BackPropagationBatchLearningUnit
  networkArchitecture::NetworkArchitecture
  dataBatch::Array{Float64,2}
  labels::AbstractSparseMatrix
  outputs::Array{Array{Float64,2}} # outputs remembered now
  deltas::Array{Array{Float64,2}} # deltas kept here

  function BackPropagationBatchLearningUnit(arch::NetworkArchitecture, dataBatch::Array{Float64,2}, labels::AbstractSparseMatrix)
     outputs = [ zeros(l.numberOfNeurons, size(dataBatch,2)) for l in arch.layers ]
     deltas = [ zeros(l.numberOfNeurons, size(dataBatch,2)) for l in arch.layers ]
     return new(arch, dataBatch, labels, outputs, deltas)
  end
end

Backpropagation routine can be seen as a continuous repetition of forward pass and backward pass through the network. Forward pass is needed to compute all output values for each node/neuron. Backward pass is needed to compute all deltas (that rely on outputs - that's why the order). BackPropagationBatchLearningUnit will be our memory that stores these values.


function forwardPass!(learningUnit::BackPropagationBatchLearningUnit)
  currentResult = learningUnit.dataBatch
  for i in 1:length(learningUnit.networkArchitecture.layers)
     layer = learningUnit.networkArchitecture.layers[i]
     currentResult = layer.transform(layer.parameters, currentResult)
     learningUnit.outputs[i]  = currentResult
  end
end


function backwardPass!(learningUnit::BackPropagationBatchLearningUnit)

  layer = learningUnit.networkArchitecture.layers[end-1]
  learningUnit.deltas[end-1]  = layer.derivative(learningUnit.outputs[end-1]) .*  (learningUnit.outputs[end] - learningUnit.labels)

  for i in 2:(length(learningUnit.networkArchitecture.layers) - 1)
      higherLayer = learningUnit.networkArchitecture.layers[end - i + 1]
      currentLayer = learningUnit.networkArchitecture.layers[end - i]
      learningUnit.deltas[end-i] = currentLayer.derivative(learningUnit.outputs[end-i]) .* (transpose(higherLayer.parameters[:,(1:end-1)]) * learningUnit.deltas[end - i + 1])
  end
end

Forward pass does not differ much from the infer function. The only difference is that it stores all the outputs in a learningUnit. Please note exclamation mark at the end (it is a convention used to indicate function changing its input).

Backward pass implements computation of deltas. There is an simplifying assumption - we assume the last layer is a softmax (that is why we don't touch the last layer - as it is parameterless), but let's live with it. Backward pass computes all deltas for all examples in a batch and all nodes.

Now we need a procedure that will update network parameters. Just to recall parameter update is realized with gradient descent method so it is given as below:
\[
\theta_t = \theta_{t-1} - \alpha \nabla \theta_{t-1}
\]

\( \alpha \) is called learning rate and it expresses how rapid we want parameters to change. It is a hyper parameter. We need to know proper value of \( \alpha \) to learn good parametrization. Choosing \(\alpha\) is a problem itself and many techniques have been proposed to solve it (recently people tend to use adaptive gradient descent techniques like AdaGrad, AdaDelta or Adam).

function updateParameters!(unit::BackPropagationBatchLearningUnit, learningRate)
  forwardPass!(unit)
  backwardPass!(unit)
  derivativeW= (unit.deltas[1] * transpose(unit.dataBatch)) / size(unit.dataBatch,2);
  unit.networkArchitecture.layers[1].parameters[:,1:(end-1)] = unit.networkArchitecture.layers[1].parameters[:,1:(end-1)] - learningRate * derivativeW;
  derivativeB = mean(unit.deltas[1],2);
  unit.networkArchitecture.layers[1].parameters[:,end] =  unit.networkArchitecture.layers[1].parameters[:,end] - learningRate * derivativeB;
  for i in 2:(length(unit.networkArchitecture.layers) - 1)
    derivativeW = (unit.deltas[i] * transpose(unit.outputs[i-1])) / size(unit.dataBatch,2);
    unit.networkArchitecture.layers[i].parameters[:,1:(end-1)] = unit.networkArchitecture.layers[i].parameters[:,1:(end-1)] - learningRate * derivativeW;
    derivativeB = mean(unit.deltas[i],2);
    unit.networkArchitecture.layers[i].parameters[:,end] =  unit.networkArchitecture.layers[i].parameters[:,end] - learningRate * derivativeB;
  end
end

Quick explanation: First we perform forward pass and backward pass so learning unit can store output values and deltas needed for derivatives. Then we compute derivatives of parameters \((W,\beta)\). Each layer is treated separately. We first do an update for first layer (it is in a way special because input is provided from outside the layers) then we iterate over the remaining layers and update parameters. \(\beta\) is treated differently than other parameters because of the difference in the formula. (That could be fixed making code more elegant, but for now lets have it that way). Multiplication of deltas and inputs only sum derivatives over examples, that's why we need to divide it by size of the training sample. Same holds for bias derivatives being averaged (over training examples).

Into the wild

Not really into the wild, because the problem we will try to solve is known and rather easy (because of proper alignment). Anyways, we are now ready to apply our solution onto the real dataset and update neural network parameters - we can now learn our network to predict class label. As was promised we will try to classify MNIST digits. We need a function to load training data and test data. Test data will be for us to see if our network "generalizes" well. Lets use MNIST package and write such a function:


Using MNIST
...
function loadMnistData()

  a,b = MNIST.traindata()
  t,l = MNIST.testdata()
  t = (t .- mean(t,2)) / std(t .- mean(t,2))
  a = (a .- mean(a,2)) / std(a .- mean(a,2))
  b = sparse(convert(Array{Int64}, b + 1),1:60000, [ 1 for i in 1:60000])
  l = sparse(convert(Array{Int64}, l + 1),1:10000, [ 1 for i in 1:10000])
  return(a,b,t,l)
end
...

Now we need to build a network and start learning! We don't want to repeat the same code creating network arhitecture therefore we will use a helpers:


# helper to build SoftMax architecture
function buildNetworkArchitectureSoftMax(sizes)
  firstLayer = FullyConnectedComputingLayer(sizes[1], sizes[2], linearNeuronTransformFunction, x -> 1);
  architecture = NetworkArchitecture(firstLayer);
  addSoftMaxLayer(architecture)
  return(architecture)
end

# helper to build an architecture with hidden sigmoid layers 
function buildNetworkArchitectureWithOneHiddenSigmoids(sizes)
  firstLayer = FullyConnectedComputingLayer(sizes[1], sizes[2], sigmoidNeuronTransformFunction, x -> x .* (1 - x));
  architecture = NetworkArchitecture(firstLayer);
  for i in 3:(length(sizes)-1)
    addFullyConnectedSigmoidLayer(architecture, sizes[i]);
  end
  addFullyConnectedLinearLayer(architecture, sizes[end]);
  addSoftMaxLayer(architecture)
  return(architecture)
end

Now to build a network architecture with 784 dimensional input, followed by one linear layer of 10 neurons and, followed by softmax layer we just write

architecture = buildNetworkArchitectureSoftMax([784,10])

Because we never improved our network learning algorithm anyhow lets start with that simple network as the model. Let's load the dataset and start learning


using Gadfly
trainingData, trainingLabels, testData, testLabels = loadMnistData()
architecture = buildNetworkArchitectureSoftMax([784,10])
crossEntropies = Float64[]
for i = 1:100
  learningUnit = BackPropagationBatchLearningUnit(architecture, trainingData, trainingLabels);
  updateParameters!(learningUnit, 0.3)
  push!(crossEntropies, crossEntropyError(architecture, trainingData, trainingLabels))
end
plot(x = 1:length(crossEntropies), y = crossEntropies, Geom.line, Guide.xlabel("iterations"), Guide.ylabel("error"))

first

We can see how error function value changed after each iteration. It decreases fast at the beginning and then training becomes harder and so improvement is less significant.
Lets now find out what portion of unseen data our classifier is able to label correctly.


inferedOutputs = infer(architecture, testData)
mean(mapslices(x -> indmax(x), inferedOutputs ,1)[:]  .==  mapslices(x -> indmax(x), full(testLabels),1)[:])
# 0.9052

In my case classifier is able to correctly label 90.5 % of all examples. It is actually bad result. People are able to classify MNIST with more than 99% accuracy (years ago). Don't worry we will get somewhere there too!

For now let's still play with what we get now. Another step to take is to increase number of iterations and add expressiveness to our network. We can do it by adding hidden layer.


architecture = buildNetworkArchitecture([784,50, 10]) # 50 neurons in a hidden layer now 
crossEntropies = Float64[]
for i = 1:500 
  learningUnit = BackPropagationBatchLearningUnit(architecture, trainingData, trainingLabels);
  updateParameters!(learningUnit, 0.3) 
  push!(crossEntropies, crossEntropyError(architecture, trainingData, trainingLabels))
end
plot(x = 1:length(crossEntropies), y = crossEntropies, Geom.line, Guide.xlabel("iterations"), Guide.ylabel("error"))
inferedOutputs = infer(architecture, testData)
mean(mapslices(x -> indmax(x), inferedOutputs ,1)[:]  .==  mapslices(x -> indmax(x), full(testLabels),1)[:])
# 0.928

error vs iterations

It is better now. But still it looks like further error decrease may follow. What about 1000 iterations now?


architecture = buildNetworkArchitectureWithOneHiddenSigmoids([784,50, 10]) # 50 neurons in a hidden layer now 
crossEntropies = Float64[]
for i = 1:1000
  learningUnit = BackPropagationBatchLearningUnit(architecture, trainingData, trainingLabels);
  updateParameters!(learningUnit, 0.3) 
  push!(crossEntropies, crossEntropyError(architecture, trainingData, trainingLabels))
end
plot(x = 1:length(crossEntropies), y = crossEntropies, Geom.line, Guide.xlabel("iterations"), Guide.ylabel("error"))
inferedOutputs = infer(architecture, testData)
mean(mapslices(x -> indmax(x), inferedOutputs ,1)[:]  .==  mapslices(x -> indmax(x), full(testLabels),1)[:])
# 0.9401

third

After 1000 iterations we get better performance. The problem, that you don't see or experience now is that training is very slow. Each iteration involves the whole training set. Therefore it does not make much sense to increase number of iterations especially when the resources are limited - searching through hyperparameter space will be slow too. We need to look for alternative to be able to speed up our training process. One such alternative is stochastic gradient descent that is explained (roughly) below.

Improving learning with Stochastic Gradient Descent

What would happen if we "lost" one of our the training examples? Lets imagine our younger brother deleted one row from a file containing training data. If you think of it, it does not change much - we do not have "complete" dataset in the first place. Let's recall that our error function is defined as if input and labels were fixed. Deleting one row of data changes error function then. The thing is that even though we miss one data example, we still believe our new error function reflexes the same phenomenon. What if we deleted two rows? What if we deleted 80% of data? Training neural network with limited number of observation at a time is a main concept behind stochastic gradient descent.

It turns out that leaving only very limited amount of data is enough for good network training. Original stochastic gradient descent is about computing derivatives based on one example only. Some variants though involve small sets of mini-batches. The algorithm would then shuffle the dataset iterate through it - compose mini batches and update parameters based on these mini batches. The only thing that changes each time is error function (dependent on input training data). This is also where "stochastic" part of the algorithm name comes from. Each time error function is different, and in a way random too.

We won't spend too much time on stochastic gradient descent - fortunately though we are directly able to apply it - we only need different learning unit at each iteration. Every time it will be fed with different batch.


architecture = buildNetworkArchitectureWithOneHiddenSigmoids([784,50, 10]) # 50 neurons in a hidden layer now 
crossEntropies = Float64[]
batchSize = 20
for i = 1:30000 
   minibatch = collect((batchSize*i):(batchSize*i +batchSize)) % size(trainingLabels,2) + 1 # take next 20 elements 
   learningUnit = BackPropagationBatchLearningUnit(architecture, trainingData[:,minibatch ], trainingLabels[:,minibatch]);   
   updateParameters!(learningUnit, 0.3)  
   if i % 100 == 0  # this one costs so lets store entropies every 100 iterations
     push!(crossEntropies, crossEntropyError(architecture, trainingData, trainingLabels))   
   end                 
end   
plot(x = 1:length(crossEntropies), y = crossEntropies, Geom.line, Guide.xlabel("iterations"), Guide.ylabel("error"))
inferedOutputs = infer(architecture, testData)
mean(mapslices(x -> indmax(x), inferedOutputs ,1)[:]  .==  mapslices(x -> indmax(x), full(testLabels),1)[:])
# 0.9655 

fourth2

Error now decreases much faster. After 30000 iterations (using mini batches of 20 examples) it is much better than error we got after 1000 iterations using whole dataset. Please note the difference here. Iterations now are not the same thing, using SGD we perform mupltiplications that require much less resources (3000 smaller outcomes/deltas matrices). We got much better performance too, our error is around 3.5% now. It is still far from the best results nowadays but at least we get big improvement. Another thing to observe is oscillating nature of error - because now we are in a stochastic context error may increase after parameters update.

Practical advantage of stochastic gradient descent is that it allows for online learning - we don't need to store all dataset in the memory now as we only need a single mini batch. Therefore when we learn on a very big dataset we can fetch mini batch whenever it is needed instead of storing it in memory at once.

Stochastic gradient descent is a tip of the iceberg. There is much more to explore. People use many different techniques to train neural networks with backpropagation very often being the main building block. If you want to explore the topic yourself, It might be good idea to explore weight decay and momentum. These two are practically standard today. People change the network structure during training to achieve better generalization - the technique is called "dropout". Some more advanced topics include adaptive training such as AdaGrad, AdaDelta, Adam or RMSprop. Another thing to look at is pre-training - it is basically application of unsupervised learning to initialize neural networks weights.

What's next ?

That was fun. I highly recommend implementing backpropagation from scratch as a really good way of learning it. In the next posts we will try to look at some more advanced training techniques. In this post we only introduced limited number of neuron types. Others exist though and we will try to cover them, too.

---