Browsing posts in: Data manipulation

Are you OK, Cyberpunk? – Transformers’ diagnosis.

At the end of 2020, after 8 years since announcement, Polish game development studio CDPR released its flag game titled Cyberpunk. A big success of CDPR’s previous game Witcher 3 and their “gamers-first” approach implied CDPR being perceived as a golden child of a gaming industry. CDPR was seen as one of few healthy apples in a basket of rotten fruits. Of course, there was only one emotion towards CDPR – love.

All these raised the expectations towards CDPRs new game very high. Announcement of Keanu Reeves – persona absolutely loved by the internet – “playing” one of the characters Johnny Silverhand grew the hype to the limits. Studio surpassed 8 mln pre-orders world-wide potentially beating GTA5. Right after launch there was over 1mln people playing it on Steam.

The sweetness unfortunately went together with a little bit of a bitterness. During first 24 hours Cyberpunk Reviews on Steam reached only 73%. Given steam binary review system, it means that 27% of people expressed negative feelings about the game. The goal of this post is to figure out what are the reasons behind it via data-driven approach.

cyberpunk
(image from CyberpunkGame official twitter account)

Continue Reading

---

Chess position evaluation with convolutional neural network in Julia

In this post we will try to challenge the problem of chess position evaluation using convolutional neural network (CNN) – neural network type designed to deal with spatial data. We will first explain why we need CNNs then we will present two fundamental CNNs layers. Having some knowledge from the inside of the black box, we will apply CNN to binary classification problem of chess position evaluation using Julia deep learning library – Mocha.jl.

Introduction – data representation

One of the challenges that frequently occurs in machine learning is proper representation of the input data. Ideally, data is desired to be represented in a way that it carries as much information while being digestable for the ML algorithms. Digestibility means fitting in existing mathematical frameworks where known abstract tools can be applied.

A common convenient representation of single observation is a vector in \(\mathbb{R}^n\). Assuming such representation, ML problems may be seen from many different angles – with benefit of using well known abstractions/interpretations. One perspective that is very common is algebraic perspective – having the input data as a matrix (one vector per column), its eigendecomposition or various factorizations may be considered – they both yield important results in the context of machine learning. Set of vectors in \(\mathbb{R}^n\) shapes a point cloud – when geometry of such cloud is considered manifold learning methods emerge. Linear model with least squares error has closed form solution in algebraic framework. In all of these cases, representing input data as vectors implies broad range of tools to handle the problem effectively.

For some domains though it is not obvious how to represent input as vectors while preserving original information contained in the data. An example of such domain is text. Text document is rich in various types of information – there is a semantics and syntax of the text or even personal style of the writer. It is not clear how to represent this unnamed information contained in text. People tend to simplify it and use Bag of Words (BoW) approach to represent text (which completely ignores ordering of words in a document – treats it a a set).

Another domain that suffers from similar problem is domain of images. The spatiality of the data is missing when representing images as vectors of dimensionality equal to the total number of pixels. When one represents image that way the spatial information is lost – the algorithm that later consumes the input vectors is usually not aware the original structure of images is a set of 2-dimensional grids (one matrix for each channel).

So far our neural network has not been aware of two dimensional nature of input data (MNIST). It could of course find it out itself learning relations between neighboring pixels, but, the fact is, it had no clue so far.

Continue Reading

---

Random walk vectors for clustering (final)

Hi there. I have finally manage to be finishing the long series of posts about how to use random walk vectors for clustering problem. It’s been a long series and I am happy to finish it as the whole blog suddenly moved away from being about Julia and turned into a random walk weirdo… Anyways lets finish what has been started.

In the previous post we saw how to use many random walks to cluster given dataset. Presented approach was evaluated on toy datasets and our goal for this post is to try it on some more serious one. We will then apply same approach onto MNIST dataset – set of hardwritten digits images. We will compare the results to the state-of-the art k-means algorithm.

For the reminder of what is going to be applied – please take a step back to the previous post.

Continue Reading

---

Reading CSV file into Julia

As for someone experienced in R I naturally look for data.frame-like structure in Julia to load csv file into it. And luckily it is present and seems to work pretty well. You need to install package called “DataFrames” to operate on R-like dataframes:

Pkg.add("DataFrames")

and load it after installation:

using DataFrames;

The whole documentation is available here. For now we will try to load simple CSV file and play with it. You can use iris dataset. It is a toy dataset meant for various machine learning tasks. Lets download it and read into a variable called iris

iris = readtable("iris.csv")

Continue Reading

---