This is the second post of the series describing backpropagation algorithm applied to feed forward neural network training. In the last post we described what neural network is and we concluded it is a parametrized mathematical function. We implemented neural network initialization (meaning creating a proper entity representing the network – not weight initialization) and inference routine but we never made any connection to the data itself. In this post we will make such a connection and we will express meaning of parametrization “goodness” in terms of training data and network output.

## Backpropagation from scratch in Julia (part I)

Neural networks have been going through a renaissance recently. After exploding computational power availability (often GPU based), recent improvements in neural networks initialization (pre-training with RBMs or autoencoders), overcoming vanishing gradient problem in recurrent networks (LSTM) and advances in optimization techniques (AdaGrad, AdaDelta, Adam and others) neural networks are in the centre of attention, again. And the attention is tremendous in fact. Neural networks have won most of machine learning challenges during the last couple of years. Their bio-inspired nature and so their human-like roots caused neural networks popularity growth among regular people, too. Friends of mine with no scientific background whatsoever happen to poke me asking if I heard of “singularity” or neural networks by any chance?

In other words, neural networks can no longer be ignored, and the goal of this post is to catch up a bit and learn basics about them.

Let’s start.

In this series of posts we will describe step by step how to train a feedforward neural network with backpropagation algorithm. However, if this is your first attempt to understand backpropagation I think it is good idea to take a look at ELI5/nomath tutorial by Andrej Karpathy, the next step might be to check out this material by Alex Graves (Neural Networks chapter). Another important thing, if you are looking for near-production ready implementation please take a look at Mocha.jl or MxNet.jl if it has to be Julia. Otherwise one might check out recently released TensorFlow, Torch7 or keras

Here, we will focus on backpropagation algorithm in the context of multi-label classification problem. It means we will be given a dataset that is labeled with more than two classes and our goal will be to classify each dataset entry correctly. Sometimes, to provide an example, we will refer to the MNIST dataset – dataset of images of handwritten digits. Finally we will run our experiments on MNIST.

## Random walk vectors for clustering (final)

Hi there. I have finally manage to be finishing the long series of posts about how to use random walk vectors for clustering problem. It’s been a long series and I am happy to finish it as the whole blog suddenly moved away from being about Julia and turned into a random walk weirdo… Anyways lets finish what has been started.

In the previous post we saw how to use many random walks to cluster given dataset. Presented approach was evaluated on toy datasets and our goal for this post is to try it on some more serious one. We will then apply same approach onto MNIST dataset – set of hardwritten digits images. We will compare the results to the state-of-the art k-means algorithm.

For the reminder of what is going to be applied – please take a step back to the previous post.

## Random walk vectors for clustering (III)

This is the third post about how to use random walk vectors for clustering. The main idea as was stated before is to represent point cloud as a graph based on similarities between points. Similarities between points are encoded in the form of a matrix and the matrix is then treated as a weight matrix of a graph. Such a graph is then traversed randomly resulting in set of random walk vectors (with seed vectors being focused on one different starting point each walk). Each random walk vector represents similarities between points once again – but this time it encodes global dataset shape around given starting point.

In this post we will try to combine many random walk vectors into one matrix that will be used as a data matrix. We will represent each original point as a sequence of numbers that reflect given point different clusters memberships. Having that representation we will use NMF clustering technique to cluster these new data points. (I think it might be considered as a type of consensus clustering as well)

## Random walk vectors for clustering (part II – perspective switch)

This post is a second part of the series of posts that will result in combining random walk vectors for clustering. So far we have understood what the similarity is, how to build similarity matrix based on distance matrix. We know the similarity matrix is a structure that encodes similarities between all objects in our dataset.

Today we will further motivate our quest for similarity matrix started in the previous post. We will tell a little bit about what the graph is and how to switch from point cloud perspective to graph perspective. The bridge between these two worlds is the similarity matrix introduced last time.

Basics first.

A wide range of structures occurring in nature are naturally expressed as networks. Whenever a set of objects and relations among them appears, the whole setting can be interpreted this way. A typical example of such network is – becoming hot topic in last decade – social network. A node in that type of a network is an abstraction of a social unit (mainly a person) and an edge represents social relation between pairs of nodes like friendship, business relation, co-interest etc. Another type of network is the internet itself where websites are represented as nodes and links between websites as connections between them. In fact, any set of objects with relation between them can be modeled as a network.

Formally a network is called **graph **and it consists of set of nodes and set of edges and is usually formalized as a pair \( G = (V,E) \) where \( V \) is a set of \(N\) objects indexed with integers \( \{1, 2, \dots, N \} \) and \( E \) is a function:

\[

E : V^2 \rightarrow \mathcal{R}

\]

## Random walk vectors for clustering (part I – similarity between objects)

This post opens a series of short posts that discuss how to use many random walk vectors (vectors describing probability distributions of random surfer on a graph – like pagerank) to find homogeneous groups of objects in a dataset. The main idea here is to combine the existing work of many researchers into one method. The mathematics behind these ideas might be complex, but we will try to be as ELI5-oriented as possible – so anyone can read, understand (at least basic bits) and implement it himself/herself. As the whole thing here relies on components that exist on different abstractions levels we will start from basic atomic terms that will later be enclosed in a building blocks for the final solution.

We will start from explaining notion of similarity between objects. It is essential for many machine learning problems.

Among various similarity measures the most popular is euclidean distance (in fact it is dissimilarity measure – we will soon tranform it into similarity), it is a length of a line segment between two points in euclidean space spanned by object features dimensions. Euclidean distance aggregates similarities between each of the features. It is defined as follows:

\[

d(a,b) = \sqrt{\sum_{i}^{D} (a_{i} – b_{i})^2}

\]

If two objects do not differ at any of the dimensions euclidean distance is zero, otherwise it is a positive number. Distances between all objects in a dataset are usually expressed in a form of a matrix called distance matrix.

## Logistic regression (Part II – evaluation)

This post is a continuation of recent post where we implemented gradient descent method for solving logistic regression problem. Today we will try to evaluate our algorithm to find out how it works in the wild.

Last time the logistic regression problem was stated strictly as optimization problem which suggests our evaluation should consider the goal function itself – which is to some extend correct, indeed it makes sense to check how J changes while algorithm runs. But it is not the best idea one can have. Please recall our goal. We aimed for parametrized labelling machine producing labels for new upcoming obveravations/objects based on pre-defined features they are characterized by. The word **new** is crucial here. We want our model to fit to new data – to predict labels correctly for unseen data.

## Basic visualization in Julia – Gadfly

In this post we will walk through basics of Gadfly – visualization package written in Julia. Gadfly is Julia implementation of layered grammar of graphics proposed by Hadley Wickham who implemented his idea into ggplot2 package being the main visualization library in R. One spicy note, the original inventor of “grammar of graphics” (the one who was inspiration for Wickham) is now hired by Tableau Software – leading company in data visualization.

The main motivation for grammar of graphics is to formalize visualization for statistics. Authors use word “grammar” so one can think of set of rules that let you build “correct” (with respect to given grammar) sentences. In this case though sentence is graphical so one can see the output in a form of a plot.

Lets now try to provide declarative description of what a plot is and then use this knowledge to actually plot stuff.

Plot consists of:

**Aesthetics**– it can be understood as plot interface for data. Data is binded to aesthetics. Different aesthetics are expected for different kinds of plots. For example to plot set of points one can use**geometry**Geom.point (don’t worry yet – geometry is explained in a minute) that**requires**aesthetics**x and y**. In other words these aesthetics are always known at the time of plot creation. Knowing what you want to plot there is always specification of what aethetics chosen geometry requires – so it is not an art but rather a craft to choose proper aesthetics.**Geometries**– geometry is what defines what will be plotted, what is the geometry of your data. Each geometry requires set of aesthetics to work. Please take a look at specification of Geom.point – it requires aethetics x and y as was noted above. Different kind of geometries define different plots. The geometry is then a central point of your plot – geometries and aesthetics define**what**you want to plot while other components specify**how**you want to do it.**Statistics**– It is a middle layer between aesthetics provided and geometry. So whenever you provide aesthetics for given geometry there is corresponding statistics in the middle – very often that statistics is simply “identity” (like in case of Geom.point)**Scales**– to transform axes of your plot, (to land with log-scale of x for scatterplot one can use Scale.x_log10)**Guides**– elements responsible for plotting axis labels, titles etc.

## Solving logistic regression problem in Julia

In this post we will try to implement gradient descent approach to solve logistic regression problem.

Please note this is mainly for educational purpose and the aim here is to learn. If you want to “just apply” logistic regression in julia please check out this one

Lets start with basic background.

Logistic regression is a mathematical model serving to predict binary outcome (here we consider binary case) based on set of predictors. Binary outcome (meaning outcome with two possible values, like physiognomical gender spam/not spam email or fraud/legit transaction) is very often called target variable and predictors can be also called features. The latter name is much more intuitive as it is closer to Earth, so to say, it suggests we deal with some sort of objects with known features and we try to predict binary feature that is unknown at the time of prediction. The main idea is to predict the target given set of features based on some training data. Training data is a set of objects for which we know the true target.

So our task is to derive a function that will later serve as a 0/1 labelling machine – mechanism to produce (hopefully) correct target for given object described with features. The class of functions we choose to derive the target is called the model. Such a class is parametrized with set of parameters and the task of learning is in fact to derive these parameters to let our model perform well. Performing well is on the other hand measured with yet another function called goal function. The goal function is what we optimize when we learn parameters for a model. It involves model parameters and training data. It is a function of model parameters and training data with known labels. So the problem of machine learning can be (ignorantly of course) reduce to the problem of estimating set of parameters of given class of functions (model) to minimize given goal function.

## Reading CSV file into Julia

As for someone experienced in R I naturally look for data.frame-like structure in Julia to load csv file into it. And luckily it is present and seems to work pretty well. You need to install package called “DataFrames” to operate on R-like dataframes:

Pkg.add("DataFrames")

and load it after installation:

using DataFrames;

The whole documentation is available here. For now we will try to load simple CSV file and play with it. You can use iris dataset. It is a toy dataset meant for various machine learning tasks. Lets download it and read into a variable called iris

iris = readtable("iris.csv")