## Random walk vectors for clustering (part II – perspective switch)

This post is a second part of the series of posts that will result in combining random walk vectors for clustering. So far we have understood what the similarity is, how to build similarity matrix based on distance matrix. We know the similarity matrix is a structure that encodes similarities between all objects in our dataset.

Today we will further motivate our quest for similarity matrix started in the previous post. We will tell a little bit about what the graph is and how to switch from point cloud perspective to graph perspective. The bridge between these two worlds is the similarity matrix introduced last time.

Basics first.

A wide range of structures occurring in nature are naturally expressed as networks. Whenever a set of objects and relations among them appears, the whole setting can be interpreted this way. A typical example of such network is – becoming hot topic in last decade – social network. A node in that type of a network is an abstraction of a social unit (mainly a person) and an edge represents social relation between pairs of nodes like friendship, business relation, co-interest etc. Another type of network is the internet itself where websites are represented as nodes and links between websites as connections between them. In fact, any set of objects with relation between them can be modeled as a network.

Formally a network is called graph and it consists of set of nodes and set of edges and is usually formalized as a pair $$G = (V,E)$$ where $$V$$ is a set of $$N$$ objects indexed with integers $$\{1, 2, \dots, N \}$$ and $$E$$ is a function:

$E : V^2 \rightarrow \mathcal{R}$

---

## Random walk vectors for clustering (part I – similarity between objects)

This post opens a series of short posts that discuss how to use many random walk vectors (vectors describing probability distributions of random surfer on a graph – like pagerank) to find homogeneous groups of objects in a dataset. The main idea here is to combine the existing work of many researchers into one method. The mathematics behind these ideas might be complex, but we will try to be as ELI5-oriented as possible – so anyone can read, understand (at least basic bits) and implement it himself/herself. As the whole thing here relies on components that exist on different abstractions levels we will start from basic atomic terms that will later be enclosed in a building blocks for the final solution.

We will start from explaining notion of similarity between objects. It is essential for many machine learning problems.

Among various similarity measures the most popular is euclidean distance (in fact it is dissimilarity measure – we will soon tranform it into similarity), it is a length of a line segment between two points in euclidean space spanned by object features dimensions. Euclidean distance aggregates similarities between each of the features. It is defined as follows:

$d(a,b) = \sqrt{\sum_{i}^{D} (a_{i} – b_{i})^2}$

If two objects do not differ at any of the dimensions euclidean distance is zero, otherwise it is a positive number. Distances between all objects in a dataset are usually expressed in a form of a matrix called distance matrix.

---

## Logistic regression (Part II – evaluation)

(get code directly)

This post is a continuation of recent post where we implemented gradient descent method for solving logistic regression problem. Today we will try to evaluate our algorithm to find out how it works in the wild.

Last time the logistic regression problem was stated strictly as optimization problem which suggests our evaluation should consider the goal function itself – which is to some extend correct, indeed it makes sense to check how J changes while algorithm runs. But it is not the best idea one can have. Please recall our goal. We aimed for parametrized labelling machine producing labels for new upcoming obveravations/objects based on pre-defined features they are characterized by. The word new is crucial here. We want our model to fit to new data – to predict labels correctly for unseen data.

---

## Basic visualization in Julia – Gadfly

In this post we will walk through basics of Gadfly – visualization package written in Julia. Gadfly is Julia implementation of layered grammar of graphics proposed by Hadley Wickham who implemented his idea into ggplot2 package being the main visualization library in R. One spicy note, the original inventor of “grammar of graphics” (the one who was inspiration for Wickham) is now hired by Tableau Software – leading company in data visualization.

The main motivation for grammar of graphics is to formalize visualization for statistics. Authors use word “grammar” so one can think of set of rules that let you build “correct” (with respect to given grammar) sentences. In this case though sentence is graphical so one can see the output in a form of a plot.

Lets now try to provide declarative description of what a plot is and then use this knowledge to actually plot stuff.

Plot consists of:

• Aesthetics – it can be understood as plot interface for data. Data is binded to aesthetics. Different aesthetics are expected for different kinds of plots. For example to plot set of points one can use geometry Geom.point (don’t worry yet – geometry is explained in a minute) that requires aesthetics x and y . In other words these aesthetics are always known at the time of plot creation. Knowing what you want to plot there is always specification of what aethetics chosen geometry requires – so it is not an art but rather a craft to choose proper aesthetics.
• Geometries – geometry is what defines what will be plotted, what is the geometry of your data. Each geometry requires set of aesthetics to work. Please take a look at specification of Geom.point – it requires aethetics x and y as was noted above. Different kind of geometries define different plots. The geometry is then a central point of your plot – geometries and aesthetics define what you want to plot while other components specify how you want to do it.
• Statistics – It is a middle layer between aesthetics provided and geometry. So whenever you provide aesthetics for given geometry there is corresponding statistics in the middle – very often that statistics is simply “identity” (like in case of Geom.point)
• Scales – to transform axes of your plot, (to land with log-scale of x for scatterplot one can use Scale.x_log10)
• Guides – elements responsible for plotting axis labels, titles etc.
---

## Solving logistic regression problem in Julia

In this post we will try to implement gradient descent approach to solve logistic regression problem.

Please note this is mainly for educational purpose and the aim here is to learn. If you want to “just apply” logistic regression in julia please check out this one

Logistic regression is a mathematical model serving to predict binary outcome (here we consider binary case) based on set of predictors. Binary outcome (meaning outcome with two possible values, like physiognomical gender spam/not spam email or fraud/legit transaction) is very often called target variable and predictors can be also called features. The latter name is much more intuitive as it is closer to Earth, so to say, it suggests we deal with some sort of objects with known features and we try to predict binary feature that is unknown at the time of prediction. The main idea is to predict the target given set of features based on some training data. Training data is a set of objects for which we know the true target.

So our task is to derive a function that will later serve as a 0/1 labelling machine – mechanism to produce (hopefully) correct target for given object described with features. The class of functions we choose to derive the target is called the model. Such a class is parametrized with set of parameters and the task of learning is in fact to derive these parameters to let our model perform well. Performing well is on the other hand measured with yet another function called goal function. The goal function is what we optimize when we learn parameters for a model. It involves model parameters and training data. It is a function of model parameters and training data with known labels. So the problem of machine learning can be (ignorantly of course) reduce to the problem of estimating set of parameters of given class of functions (model) to minimize given goal function.

---

## Reading CSV file into Julia

As for someone experienced in R I naturally look for data.frame-like structure in Julia to load csv file into it. And luckily it is present and seems to work pretty well. You need to install package called “DataFrames” to operate on R-like dataframes:

Pkg.add("DataFrames")


using DataFrames;


The whole documentation is available here. For now we will try to load simple CSV file and play with it. You can use iris dataset. It is a toy dataset meant for various machine learning tasks. Lets download it and read into a variable called iris

iris = readtable("iris.csv")

---

## Warming up

Here it goes. This is my first blog entry with the great purpose to learn Julia language from scratch. The motivation for such is knowledge, of course, but some sort of convenience as well. From what I read already I conclude Julia is promising tool for scientific computing being described as fast and elegant and my goal here is to find out how it’s gonna suit me. Throughout the blog entries you read I will try to solve many small problems you may encounter yourself, so I really hope someone will find whole thing useful.

---

Pages:12