Posts tagged with: clustering

## Large Scale Spectral Clustering with Landmark-Based Representation (in Julia)

In this post we will implement and play with a clustering algorithm of a mysterious name Large Scale Spectral Clustering with Landmark-Based Representation (or shortly LSC – corresponding paper here). We will first explain the algorithm step by step and then map it to Julia code (github link).

### Spectral Clustering

Spectral clustering (wikipedia entry) is a term that refers to many different clustering techniques. The core of the algorithm does not differ though. In essence, it is a method that relies on spectrum (eigendecomposition) of input data similarity matrix (or its transformations). Given input dataset encoded in a matrix $$X$$ (such that each single data entry is a column of that matrix) – spectral clustering requires a similarity (adjacency in case of graphical interpretation) matrix $$A$$ where

$A_{ij} = f(X_{\cdot i}, X_{\cdot j})$

and function $$f$$ is some measure of similarity between data points.

One specific spectral clustering algorithm (Ng, Jordan, and Weiss 2001) relies on matrix $$W$$ given by

$W = D^{-1/2} A D^{-1/2}$

---

## Random walk vectors for clustering (final)

Hi there. I have finally manage to be finishing the long series of posts about how to use random walk vectors for clustering problem. It’s been a long series and I am happy to finish it as the whole blog suddenly moved away from being about Julia and turned into a random walk weirdo… Anyways lets finish what has been started.

In the previous post we saw how to use many random walks to cluster given dataset. Presented approach was evaluated on toy datasets and our goal for this post is to try it on some more serious one. We will then apply same approach onto MNIST dataset – set of hardwritten digits images. We will compare the results to the state-of-the art k-means algorithm.

For the reminder of what is going to be applied – please take a step back to the previous post.

---

## Random walk vectors for clustering (III)

This is the third post about how to use random walk vectors for clustering. The main idea as was stated before is to represent point cloud as a graph based on similarities between points. Similarities between points are encoded in the form of a matrix and the matrix is then treated as a weight matrix of a graph. Such a graph is then traversed randomly resulting in set of random walk vectors (with seed vectors being focused on one different starting point each walk). Each random walk vector represents similarities between points once again – but this time it encodes global dataset shape around given starting point.

In this post we will try to combine many random walk vectors into one matrix that will be used as a data matrix. We will represent each original point as a sequence of numbers that reflect given point different clusters memberships. Having that representation we will use NMF clustering technique to cluster these new data points. (I think it might be considered as a type of consensus clustering as well)

---

## Random walk vectors for clustering (part II – perspective switch)

This post is a second part of the series of posts that will result in combining random walk vectors for clustering. So far we have understood what the similarity is, how to build similarity matrix based on distance matrix. We know the similarity matrix is a structure that encodes similarities between all objects in our dataset.

Today we will further motivate our quest for similarity matrix started in the previous post. We will tell a little bit about what the graph is and how to switch from point cloud perspective to graph perspective. The bridge between these two worlds is the similarity matrix introduced last time.

Basics first.

A wide range of structures occurring in nature are naturally expressed as networks. Whenever a set of objects and relations among them appears, the whole setting can be interpreted this way. A typical example of such network is – becoming hot topic in last decade – social network. A node in that type of a network is an abstraction of a social unit (mainly a person) and an edge represents social relation between pairs of nodes like friendship, business relation, co-interest etc. Another type of network is the internet itself where websites are represented as nodes and links between websites as connections between them. In fact, any set of objects with relation between them can be modeled as a network.

Formally a network is called graph and it consists of set of nodes and set of edges and is usually formalized as a pair $$G = (V,E)$$ where $$V$$ is a set of $$N$$ objects indexed with integers $$\{1, 2, \dots, N \}$$ and $$E$$ is a function:

$E : V^2 \rightarrow \mathcal{R}$

---

## Random walk vectors for clustering (part I – similarity between objects)

This post opens a series of short posts that discuss how to use many random walk vectors (vectors describing probability distributions of random surfer on a graph – like pagerank) to find homogeneous groups of objects in a dataset. The main idea here is to combine the existing work of many researchers into one method. The mathematics behind these ideas might be complex, but we will try to be as ELI5-oriented as possible – so anyone can read, understand (at least basic bits) and implement it himself/herself. As the whole thing here relies on components that exist on different abstractions levels we will start from basic atomic terms that will later be enclosed in a building blocks for the final solution.

We will start from explaining notion of similarity between objects. It is essential for many machine learning problems.

Among various similarity measures the most popular is euclidean distance (in fact it is dissimilarity measure – we will soon tranform it into similarity), it is a length of a line segment between two points in euclidean space spanned by object features dimensions. Euclidean distance aggregates similarities between each of the features. It is defined as follows:

$d(a,b) = \sqrt{\sum_{i}^{D} (a_{i} – b_{i})^2}$

If two objects do not differ at any of the dimensions euclidean distance is zero, otherwise it is a positive number. Distances between all objects in a dataset are usually expressed in a form of a matrix called distance matrix.

---