## Backpropagation from scratch in Julia (part I)

get the code from here

Neural networks have been going through a renaissance recently. After exploding computational power availability (often GPU based), recent improvements in neural networks initialization (pre-training with RBMs or autoencoders), overcoming vanishing gradient problem in recurrent networks (LSTM) and advances in optimization techniques (AdaGrad, AdaDelta, Adam and others) neural networks are in the centre of attention, again. And the attention is tremendous in fact. Neural networks have won most of machine learning challenges during the last couple of years. Their bio-inspired nature and so their human-like roots caused neural networks popularity growth among regular people, too. Friends of mine with no scientific background whatsoever happen to poke me asking if I heard of “singularity” or neural networks by any chance?

In other words, neural networks can no longer be ignored, and the goal of this post is to catch up a bit and learn basics about them.

Let’s start.

In this series of posts we will describe step by step how to train a feedforward neural network with backpropagation algorithm. However, if this is your first attempt to understand backpropagation I think it is good idea to take a look at ELI5/nomath tutorial by Andrej Karpathy, the next step might be to check out this material by Alex Graves (Neural Networks chapter). Another important thing, if you are looking for near-production ready implementation please take a look at Mocha.jl or MxNet.jl if it has to be Julia. Otherwise one might check out recently released TensorFlow, Torch7 or keras

Here, we will focus on backpropagation algorithm in the context of multi-label classification problem. It means we will be given a dataset that is labeled with more than two classes and our goal will be to classify each dataset entry correctly. Sometimes, to provide an example, we will refer to the MNIST dataset – dataset of images of handwritten digits. Finally we will run our experiments on MNIST.

One digression: we will use a term “neural network”, but in fact we will talk about feedforward neural network – specific type of NN suitable for multi-label classification problem.

#### What is a neural network?

Neural network is a class of mathematical functions. The functions underneath are parametrized – meaning they behave in a slightly different manner depending on parameter values, but still they share same properties and generic behaviour. When these parameters are fixed and known neural network turns out to be a deterministic composition of functions. In the essence such a network with fixed parameters is a bunch of numbers interpreted in a specific way.

Please take a look at the pseudo code (but don’t commit to it too much – it is just to show certain duality of the term neural network itself)


NeuralNetwork = function(params)
instance = x -> f(x,params) #  Instance of a neural network is the set of function parameters [params] + some constant behaviour [f].
return instance
end



In this abstract (and practically useless) code NeuralNetwork is a higher order function that returns an instance of neural network given some parameters. Instance of neural network is a function with fixed parameters that takes input from nonfunctional domain and return output from nonfunctional domain. Such nonfunctional domain based input could be an image of a digit from MNIST dataset and nonfunctional output would be a corresponding label. Most of the time we will be dealing with fixed instance of neural network, but it might be beneficial to realize that neural network can be interpreted as abstract entity and therefore can be analyzed in a certain parameter space. This space is what we will try to search at some point of this series of posts. And in fact mysterious backpropagation is just a way of searching such parameter space.

From more formal perspective neural network is a mathematical model. It models certain tiny aspects of reality. How well it models that part of reality depends on its parameters. From that perspective, it does not differ much from 1D gaussian distribution (with parameters $$(\sigma\, \mu)$$) that may be used to model simple distribution of age across given population.

Before we go deeper we need to understand what class of mathematical function neural network expresses. Picture below represents very first attempt to explain what math hides behind the neural network:

The picture is far from being self explanatory so let’s try to explain it a bit going from left to right. This is the right direction as the network final output is computed that way.

On the very left side of the picture we can see component called “input vector”. Input vector is a raw data example entering the neural network – it is what neural network initially see – its input. In case of the aforementioned problem of MNIST digits classification let’s say it is a vector of length 784 (28×28) with each dimension describing single pixel intensity value.

The next layer of nodes consists of green nodes that represent input vector dimensions. Input vector in our case will be simply put onto the green nodes dimension by dimension to be crunched later by the next layers. In case of MNIST dataset we have 784 green nodes – each for a sigle pixel intensity value. Next layer of blue nodes represents computing units – these units are called neurons. Neurons process their inputs and produce single output that is transfered to the neurons in the next layer. Arrows in the picture above represent information flow. They should be read as directed arrows going from left to right. If there is an arrow from node A to node B – it means that whatever comes out from A is an input for B.

#### Neurons

Neuron is nothing but a function that takes multidimensional input and returns single real value. That multidimensional input is also an output from previous network layer – in our toy case from the picture above each neuron from second layer will collect inputs from all green input nodes and produce certain output that is later transferred as input to the next layer of neurons.

Picture above shows how one neuron gathers input to be processed. When all inputs are ready neuron executes its underlying function and produces output that is the input to the next neurons in the next layer. We will refer to the aforementioned function as to “neuron activation function” or simply “activation function“.

After outputs in the middle layer are all computed they become input to the next layer consisting of computing nodes again. Result of these computations are finally components of our desired output vector.

One puzzle that is still not solved is a neuron underlying activation function. Let’s take a closer look at it. First of all there is no single correct activation function. We will therefore cover the most popular one called sigmoid activation. Sigmoid activation is parametrized by pair $$\omega, \beta$$ and is given by

$f(x,(\omega, \beta)) = \sigma(w \cdot x + \beta)$

where $$x$$ is functions input (from previous layer possibly), $$\omega$$ is a vector of parameters of the same size as input and dot operation is a dot product (equivalently we can say that $$\omega \cdot x$$ is linear combination of input, with coefficient given by $$\omega$$). $$\beta$$ is a real number called bias (free parameter) and $$\sigma$$ is a sigmoid function – special case of logistic function known from logistic regression. Let’s try to interpret what happens here in details and focus on the $$\sigma$$ input first

$\omega \cdot x + \beta = \omega_{1} x_1 + \omega_2 x_2 + \dots + \omega_n x_n + \beta$

Result of such combination of input vector dimensions is one real (unlimited in range) number. In case of input coming from raw MNIST example we would deal with 784 sum components of form
$\omega_{k} x_{k}$

where $$k$$ represents one specific pixel from 28×28 grid. It means that for each image pixel neuron has corresponding weight to determine how “important” the pixel value is (or how it stimulates given neuron). The whole combination then expresses sum of improtance of all pixels. This number is then “squashed” to the range $$(0,1)$$ by sigmoid function and serves as input to the next neural network layer. Please note here that if input for given neuron strongly stimulates it (this happens when neuron detects desired pattern), the output value will be high and so the information of the pattern will be passed further along the network.

Parameters needed for each neuron consists of vector $$\omega$$ of size of the neuron input and single value $$\beta$$ called bias. It is in fact input size plus one.

It might be a good idea to think of recursive nature of function modeled with our neural network. Each single output value is a composition of activation functions. We will get there soon.

#### Simple neural network in Julia language

Time to write some code. We will only model blue nodes now – computing nodes – those with sigmoid activation functions. For each node we will define an input size – so it will “know” what to expect. We will not model “green” input nodes themselves as they will be implicitly present as inputs for first layer of computing nodes. Our goal then is to implement sigmoid function applied for a weighted sum of a vector of observations and to store coefficients used to perform that weighted sum – these coefficients are called parameters. Each node will be then parametrized entity with underlying “behaviour” definition (how to crunch its input).

We could start with something like type definition of a neuron. Such neuron would have a “parameters” array, input size and underlying transform function defined. That would work, but we want to take advantage of Julia being a language for scientific computing so we will be rather going towards representing nodes as whole layers of nodes. It is because computations within single layer (especially weighted sums!) can be easily enclosed in matrix vector multiplications and these operations are highly optimized in Julia. It’s going to be much faster when we think of a layer as our atomic component.

Lets define sigmoid nodes behaviour first:


function appendColumnOfOnes(a::Array{Float64,2})
vcat(a,ones(1,size(a ,2)))
end

function sigmoidNeuronTransformFunction(params, input)
return 1.0 ./ (1.0 .+ exp(-params * appendColumnOfOnes(input)))
end



First we define a helper function that adds a row of ones with the purpose to not treat a bias parameter in any special way, sigmoidNeuronTransformFunction is responsible for input transformation inside sigmoid neuron. This transformation takes two arguments “input” and “params” – params is a vector of weighted sum coefficients and input is an input vector. In fact its arguments could be matrices too – representing many input vectors entering the whole layer of neurons. That is very convenient way of treating neural network input/output as it is faster and (for some) more elegant.

Instead of dealing with bias separately, constant value(s) of one(s) is added to the input vector(s) and simple matrix vector (matrix) multiplication is computed for each layer. Again, this is mainly for speed we get by using basic linear algebra routines.

Quick definition of abstract layer


abstract Layer



We want our types to be abstract as we expect different (than those consisting of sigmoid neurons) types of layers may exist. For now we will only define “computing layer” – please note that this is generic w.r.t underlying crunching function.


type FullyConnectedComputingLayer <: Layer
inputSize::Int64
numberOfNeurons::Int64
parameters::Array{Float64,2}
transform::Function # we don't define the function directly to get flexibility later

function FullyConnectedComputingLayer(inputSize, numberOfNeurons, transform::Function)
parameters = randn(numberOfNeurons, inputSize + 1) # adding one param column for bias
return new(inputSize, numberOfNeurons, parameters, transform)
end
end



We defined a type that represents fully connected computing layer. It constist of set of nodes (numberOfNeurons) that are connected to all of the nodes from its input layer (inputSize) and it is meant to compute the output and to pass along the result. Parameters are required for later transformation while transform function is the transformation definition. Please note that parameters is a 2D structure - it is a matrix that holds each single neuron parameters set as one column. This is where we go from neuron to layer perspective.


type NetworkArchitecture
layers::Array{Layer}
function NetworkArchitecture(firstLayer::Layer)
return new([firstLayer])
end
end



Network architecture defines layer composition. It is important to note that network architecture has only one constructor that always requires initial layer - convention we use to initialize the architecture.


lastNetworkLayer = architecture.layers[end]
inputSize = lastNetworkLayer.numberOfNeurons
sigmoidLayer = FullyConnectedComputingLayer(inputSize, numberOfNeurons, sigmoidNeuronTransformFunction)
push!(architecture.layers, sigmoidLayer)
end



Given network architecture we will need to add consecutive layers of neurons. This function adds fully connected sigmoid to the network architecture - this is by far the only neuron type we know.

Let's now create out first neural network architecture to see what parameters it holds and possibly explain what they mean.


sigmoidLayer = FullyConnectedComputingLayer(784, 100, sigmoidNeuronTransformFunction)
architecture = NetworkArchitecture(sigmoidLayer)
size(architecture.layers[end].parameters)
# (100,785)



After adding single layer of 100 neurons with simgoid activation function we end up with parameters matrix for that layer of size (100,785). We deal with 100 vectors of size 785 organized in a matrix. Each vector represents a single neuron - we requested 100 such neurons. Number 785 is the number of parameters for single neuron. Why 785? As you recall each neuron is parametrized with a pair $$\omega, \beta$$ to be able to compute
$f(x ,(\omega, \beta)) = \sigma(\omega \cdot x + \beta)$

so indeed - in this case we need 784 weights (one for each pixel) and one free parameter - bias. After adding one more layer to the existing network we get the following parameter matrix size


size(architecture.layers[end].parameters)
# (10,101)



Again, we requested 10 neurons with sigmoid activation function, but this layer will not "operate on" initial raw input. Input for our new layer is the output from the previous layer. That is why we have (10,101) size of the parameter matrix. It is because each neuron will only "see" output from previous layer - 100 values - as previously we requested 100 neurons. And again, following the same reasoning we add one extra paramater or each neuron as a bias.

Let's now make our network compute stuff - we need a function for that:


function infer(architecture::NetworkArchitecture, input)
currentResult = input
for i in 1:length(architecture.layers)
layer = architecture.layers[i]
currentResult = layer.transform(layer.parameters, currentResult)
end
return currentResult
end



Here, we iterate over layers (note that input is attached at the beginning of the network) and compute each layer output until there is no layer left - meaning we are at the end of the network and our result is final network output. Our computing layers are now capable of computing its all neuron outputs at once.
Lets try it out.


firstLayer = FullyConnectedComputingLayer(784, 100, sigmoidNeuronTransformFunction)
architecture = NetworkArchitecture(firstLayer)
infer(architecture , randn(784,1)) # we try a single vector now
# 5x1 Array{Float64,2}:
# 0.217054
# 0.681068
# 0.149886
# 0.772127
# 0.458145



As you can see output vectors consists of 5 components from $$(0,1)$$.

Finally, let's directly write what is the final resursive approximate formula for our neural network function:

\begin{align} o_{i}(X) &= o_{i}(o_{i-1}(X)) \\ o_{0}(X) &= X \end{align}

where $$o_i$$ is the output from $$i$$=th layer of the network and $$X$$ is its initial input. As you can see neural network is recursive in nature. This will be important when deriving backpropagation algorithm in the next post.

#### Conclusions/What next?

Obviously what we have computed now does not make much sense yet. Our neural network can only produce some random output for now, but please be aware right now there probably exists one specific parametrization that - given a dataset - could produce "correct" labelling. This "one" parametrization is something we are after. Looking for such a "best" parametrization is called "training". In the next post we will focus on one specific (and the most popular in case of NN) training technique called backpropagation - backpropagation will turn out to be certain realization of gradient descent optimization. Before we introduce backpropagation though we will explain what such a "best parametrization" means. We will use our training dataset and propose a error function that we will try to minimize to achieve the best parametrization.
go to the next post

---