Browsing posts in: Classification

Chess position evaluation with convolutional neural network in Julia

In this post we will try to challenge the problem of chess position evaluation using convolutional neural network (CNN) – neural network type designed to deal with spatial data. We will first explain why we need CNNs then we will present two fundamental CNNs layers. Having some knowledge from the inside of the black box, we will apply CNN to binary classification problem of chess position evaluation using Julia deep learning library – Mocha.jl.

Introduction – data representation

One of the challenges that frequently occurs in machine learning is proper representation of the input data. Ideally, data is desired to be represented in a way that it carries as much information while being digestable for the ML algorithms. Digestibility means fitting in existing mathematical frameworks where known abstract tools can be applied.

A common convenient representation of single observation is a vector in \(\mathbb{R}^n\). Assuming such representation, ML problems may be seen from many different angles – with benefit of using well known abstractions/interpretations. One perspective that is very common is algebraic perspective – having the input data as a matrix (one vector per column), its eigendecomposition or various factorizations may be considered – they both yield important results in the context of machine learning. Set of vectors in \(\mathbb{R}^n\) shapes a point cloud – when geometry of such cloud is considered manifold learning methods emerge. Linear model with least squares error has closed form solution in algebraic framework. In all of these cases, representing input data as vectors implies broad range of tools to handle the problem effectively.

For some domains though it is not obvious how to represent input as vectors while preserving original information contained in the data. An example of such domain is text. Text document is rich in various types of information – there is a semantics and syntax of the text or even personal style of the writer. It is not clear how to represent this unnamed information contained in text. People tend to simplify it and use Bag of Words (BoW) approach to represent text (which completely ignores ordering of words in a document – treats it a a set).

Another domain that suffers from similar problem is domain of images. The spatiality of the data is missing when representing images as vectors of dimensionality equal to the total number of pixels. When one represents image that way the spatial information is lost – the algorithm that later consumes the input vectors is usually not aware the original structure of images is a set of 2-dimensional grids (one matrix for each channel).

So far our neural network has not been aware of two dimensional nature of input data (MNIST). It could of course find it out itself learning relations between neighboring pixels, but, the fact is, it had no clue so far.

Continue Reading

---

Optimization techniques comparison in Julia: SGD, Momentum, Adagrad, Adadelta, Adam

In today’s post we will compare five popular optimization techniques: SGD, SGD+momentum, Adagrad, Adadelta and Adam – methods for finding local optimum (global when dealing with convex problem) of certain differentiable functions. In case of experiments conducted later in this post, these functions will all be error functions of feed forward neural networks of various architectures for the problem of multi-label classification of MNIST (dataset of handwritten digits). In our considerations we will refer to what we know from previous posts. We will also extend the existing code.

Stochastic gradient descent and momentum optimization techniques

Let’s recall stochastic gradient descent optimization technique that was presented in one of the last posts. Last time we pointed out its speed as a main advantage over batch gradient descent (when full training set is used). There is one more advantage though. Stochastic gradient descent tends to escape from local minima. It is because error function changes from mini-batch to mini-batch pushing solution to be continuously updated (local minimum for error function given by one mini-batch may not be present for other mini-batch implying non-zero gradient).

Traversing through error differentiable functions’ surfaces efficiently is a big research area today. Some of the recently popular techniques (especially in application to NN) take the gradient history into account such that each “move” on the error function surface relies on previous ones a bit. An example of childish intuition of that phenomenom might involve a snow ball rolling down the mountain. Snow keeps on attaching to it increasing its mass and making it resistant to stuck in small holes on the way down (because of both the speed and mass). Such snow ball does not teleport from one point to another but rather rolls down within certain process. This infantile snowy intuition may be applied to gradient descent method too.

Continue Reading

---

Neural Networks in Julia – Hyperbolic tangent and ReLU neurons

get the code from here

Our goal for this post is to introduce and implement new types of neural network nodes using Julia language. These nodes are called ‘new’ because this post loosely refers to the existing code.

So far we introduced sigmoid and linear layers and today we will describe another two types of neurons. First we will look at hyperbolic tangent that will turn out to be similar (in shape at least) to sigmoid. Then we will focus on ReLU (rectifier linear unit) that on the other hand is slightly different as it in fact represents non-differentiable function. Both yield strong practical implications (with ReLU being considered more important recently – especially when considered in the context of networks with many hidden layers).

What is the most important though, adding different types of neurons to neural network changes the function it represents and so its expressiveness, lets then emphasize this as the main reason they are being added.

Hyperbolic tangent layer

From the biological perspective, the purpose of sigmoid activation function as single node ‘crunching function’ is to model passing an electrical signal from one neuron to another in brain. Strength of that signal is expressed by a number from \((0,1)\) and it relies on signal from the input neurons connected to the one under consideration. Hyperbolic tangent is yet another way of modelling it.

Let’s first take a look at the form of hyperbolic tangent:

\[
f(x) = \frac{\mathrm{e}^x – \mathrm{e}^{-x}}{\mathrm{e}^x + \mathrm{e}^{-x}}
\] Continue Reading

---

Backpropagation from scratch in Julia (part II: derivation and implementation)

get the code from here

This is the second post of the series describing backpropagation algorithm applied to feed forward neural network training. In the last post we described what neural network is and we concluded it is a parametrized mathematical function. We implemented neural network initialization (meaning creating a proper entity representing the network – not weight initialization) and inference routine but we never made any connection to the data itself. In this post we will make such a connection and we will express meaning of parametrization “goodness” in terms of training data and network output.

Continue Reading

---

Backpropagation from scratch in Julia (part I)

get the code from here

Neural networks have been going through a renaissance recently. After exploding computational power availability (often GPU based), recent improvements in neural networks initialization (pre-training with RBMs or autoencoders), overcoming vanishing gradient problem in recurrent networks (LSTM) and advances in optimization techniques (AdaGrad, AdaDelta, Adam and others) neural networks are in the centre of attention, again. And the attention is tremendous in fact. Neural networks have won most of machine learning challenges during the last couple of years. Their bio-inspired nature and so their human-like roots caused neural networks popularity growth among regular people, too. Friends of mine with no scientific background whatsoever happen to poke me asking if I heard of “singularity” or neural networks by any chance?

In other words, neural networks can no longer be ignored, and the goal of this post is to catch up a bit and learn basics about them.

Let’s start.

In this series of posts we will describe step by step how to train a feedforward neural network with backpropagation algorithm. However, if this is your first attempt to understand backpropagation I think it is good idea to take a look at ELI5/nomath tutorial by Andrej Karpathy, the next step might be to check out this material by Alex Graves (Neural Networks chapter). Another important thing, if you are looking for near-production ready implementation please take a look at Mocha.jl or MxNet.jl if it has to be Julia. Otherwise one might check out recently released TensorFlow, Torch7 or keras

Here, we will focus on backpropagation algorithm in the context of multi-label classification problem. It means we will be given a dataset that is labeled with more than two classes and our goal will be to classify each dataset entry correctly. Sometimes, to provide an example, we will refer to the MNIST dataset – dataset of images of handwritten digits. Finally we will run our experiments on MNIST.

Continue Reading

---

Logistic regression (Part II – evaluation)

(get code directly)

This post is a continuation of recent post where we implemented gradient descent method for solving logistic regression problem. Today we will try to evaluate our algorithm to find out how it works in the wild.

Last time the logistic regression problem was stated strictly as optimization problem which suggests our evaluation should consider the goal function itself – which is to some extend correct, indeed it makes sense to check how J changes while algorithm runs. But it is not the best idea one can have. Please recall our goal. We aimed for parametrized labelling machine producing labels for new upcoming obveravations/objects based on pre-defined features they are characterized by. The word new is crucial here. We want our model to fit to new data – to predict labels correctly for unseen data.

Continue Reading

---

Solving logistic regression problem in Julia

In this post we will try to implement gradient descent approach to solve logistic regression problem.

Please note this is mainly for educational purpose and the aim here is to learn. If you want to “just apply” logistic regression in julia please check out this one

Lets start with basic background.

Logistic regression is a mathematical model serving to predict binary outcome (here we consider binary case) based on set of predictors. Binary outcome (meaning outcome with two possible values, like physiognomical gender spam/not spam email or fraud/legit transaction) is very often called target variable and predictors can be also called features. The latter name is much more intuitive as it is closer to Earth, so to say, it suggests we deal with some sort of objects with known features and we try to predict binary feature that is unknown at the time of prediction. The main idea is to predict the target given set of features based on some training data. Training data is a set of objects for which we know the true target.

So our task is to derive a function that will later serve as a 0/1 labelling machine – mechanism to produce (hopefully) correct target for given object described with features. The class of functions we choose to derive the target is called the model. Such a class is parametrized with set of parameters and the task of learning is in fact to derive these parameters to let our model perform well. Performing well is on the other hand measured with yet another function called goal function. The goal function is what we optimize when we learn parameters for a model. It involves model parameters and training data. It is a function of model parameters and training data with known labels. So the problem of machine learning can be (ignorantly of course) reduce to the problem of estimating set of parameters of given class of functions (model) to minimize given goal function.

Continue Reading

---