Posts tagged with: similarity matrix

## Random walk vectors for clustering (part II – perspective switch)

This post is a second part of the series of posts that will result in combining random walk vectors for clustering. So far we have understood what the similarity is, how to build similarity matrix based on distance matrix. We know the similarity matrix is a structure that encodes similarities between all objects in our dataset.

Today we will further motivate our quest for similarity matrix started in the previous post. We will tell a little bit about what the graph is and how to switch from point cloud perspective to graph perspective. The bridge between these two worlds is the similarity matrix introduced last time.

Basics first.

A wide range of structures occurring in nature are naturally expressed as networks. Whenever a set of objects and relations among them appears, the whole setting can be interpreted this way. A typical example of such network is – becoming hot topic in last decade – social network. A node in that type of a network is an abstraction of a social unit (mainly a person) and an edge represents social relation between pairs of nodes like friendship, business relation, co-interest etc. Another type of network is the internet itself where websites are represented as nodes and links between websites as connections between them. In fact, any set of objects with relation between them can be modeled as a network.

Formally a network is called graph and it consists of set of nodes and set of edges and is usually formalized as a pair $$G = (V,E)$$ where $$V$$ is a set of $$N$$ objects indexed with integers $$\{1, 2, \dots, N \}$$ and $$E$$ is a function:

$E : V^2 \rightarrow \mathcal{R}$

---

## Random walk vectors for clustering (part I – similarity between objects)

This post opens a series of short posts that discuss how to use many random walk vectors (vectors describing probability distributions of random surfer on a graph – like pagerank) to find homogeneous groups of objects in a dataset. The main idea here is to combine the existing work of many researchers into one method. The mathematics behind these ideas might be complex, but we will try to be as ELI5-oriented as possible – so anyone can read, understand (at least basic bits) and implement it himself/herself. As the whole thing here relies on components that exist on different abstractions levels we will start from basic atomic terms that will later be enclosed in a building blocks for the final solution.

We will start from explaining notion of similarity between objects. It is essential for many machine learning problems.

Among various similarity measures the most popular is euclidean distance (in fact it is dissimilarity measure – we will soon tranform it into similarity), it is a length of a line segment between two points in euclidean space spanned by object features dimensions. Euclidean distance aggregates similarities between each of the features. It is defined as follows:

$d(a,b) = \sqrt{\sum_{i}^{D} (a_{i} – b_{i})^2}$

If two objects do not differ at any of the dimensions euclidean distance is zero, otherwise it is a positive number. Distances between all objects in a dataset are usually expressed in a form of a matrix called distance matrix.

---