Attention mechanism in NLP – beginners guide

The field of machine learning is changing extremely fast for last couple of years. Growing amount of tools and libraries, fully-fledged academia education offer, MOOC, great market demand, but also sort of sacred, magical nature of the field itself (calling it Artificial Intelligence is pretty much standard right now) – all these imply enormous motivation and progress. As a result, well-established ML techniques become out-dated rapidly. Indeed, methods known from 10 years ago can often be called classical.

This sort of revolution has happened recently. The default architectural choice for NLP related problems,  recurrent neural network,  has been seriously challenged – to say the least. This very solid architecture is being quickly replaced by networks based on attention mechanism only that drops RNN entirely achieving at least comparable (and often better) performance both in NLP and Computer Vision.

This post is an attempt to go through the most significant papers related to attention mechanism with the goal to grasp basic knowledge and intuition about it. We will start by looking at its very first NLP application where it was introduced to solve neural machine translation in 2015. Then we will go through improvements to attention introduced in Transformer – neural networks architecture that uses one specific variant of attention mechanism as its main building block – skipping thus far seemingly necessary recurrent connections.

 

Continue Reading

---

Are you OK, Cyberpunk? – Transformers’ diagnosis.

At the end of 2020, after 8 years since announcement, Polish game development studio CDPR released its flag game titled Cyberpunk. A big success of CDPR’s previous game Witcher 3 and their “gamers-first” approach implied CDPR being perceived as a golden child of a gaming industry. CDPR was seen as one of few healthy apples in a basket of rotten fruits. Of course, there was only one emotion towards CDPR – love.

All these raised the expectations towards CDPRs new game very high. Announcement of Keanu Reeves – persona absolutely loved by the internet – “playing” one of the characters Johnny Silverhand grew the hype to the limits. Studio surpassed 8 mln pre-orders world-wide potentially beating GTA5. Right after launch there was over 1mln people playing it on Steam.

The sweetness unfortunately went together with a little bit of a bitterness. During first 24 hours Cyberpunk Reviews on Steam reached only 73%. Given steam binary review system, it means that 27% of people expressed negative feelings about the game. The goal of this post is to figure out what are the reasons behind it via data-driven approach.

cyberpunk
(image from CyberpunkGame official twitter account)

Continue Reading

---

Bellman Equations, Dynamic Programming and Reinforcement Learning (part 1)

Reinforcement learning has been on the radar of many, recently. It has proven its practical applications in a broad range of fields: from robotics through Go, chess, video games, chemical synthesis, down to online marketing. While being very popular, Reinforcement Learning seems to require much more time and dedication before one actually gets any goosebumps. Playing around with neural networks with pytorch for an hour for the first time will give an instant satisfaction and further motivation. Similar experience with RL is rather unlikely. If you are new to the field you are almost guaranteed to have a headache instead of fun while trying to break in.

Continue Reading

---

Counterfactual Regret Minimization – the core of Poker AI beating professional players

code in python | code in go

Introduction

Last 10 years has been full of unexpected advances in artificial intelligence. Among great improvements in image processing and speech recognition – the thing that got lots of media attention was AI winning against humans in various kind of games. With OpenAI playing Dota2 and DeepMind playing Atari games in the background the most significant achievement was AlphaGo beating Korean master in Go. It was the first time machine presented super-human performance in Go marking – next to DeepBlue-Kasparov chess game in 1997 – a historical moment in the field of AI.

Around the same time a group of researchers from USA, Canada , Czech Republic and Finland had been already working on another game to solve: Heads Up No Limit Texas Hold’em

Over the years (their first papers about poker date back to 2005) researchers from University of Alberta (now in collaboration with Google Deepmind) and Carnegie Mellon University have been patiently working on advances in Game Theory with the ultimate goal to solve Poker.

Continue Reading

---

Monte Carlo Tree Search – beginners guide

code in python   code in go

For quite a long time, a common opinion in academic world was that machine achieving human master performance level in the game of Go was far from realistic. It was considered a ‘holy grail’ of AI – a milestone we were quite far away from reaching within upcoming decade. Deep Blue had its moment more than 20 years ago and since then no Go engine became close to human masters. The opinion about ‘numerical chaos’ in Go established so well it became referenced in movies, too.

Surprisingly, in march 2016 an algorithm invented by Google DeepMind called Alpha Go defeated Korean world champion in Go 4-1 proving fictional and real-life skeptics wrong. Around a year after that, Alpha Go Zero – the next generation of Alpha Go Lee (the one beating Korean master) – was reported to destroy its predecessor 100-0, being very doubtfully reachable for humans.

Continue Reading

---

Variational Autoencoder in Tensorflow – facial expression low dimensional embedding

Digest

The main motivation of this post is to use Variational Autoencoder model to embed unseen faces into the space of pre-trained single actor-centric face expressions. Two datasets are used in experiments later in this post. They are based on youtube videos passed through openface feature extraction utility:

The datasets are:

  • Donald Trump faces

    because of the recent presidential election in USA it was very easy to get videos of frontal-positioned faces of Donald Trump and use it as input dataset

  • Edward Snowden faces

    because he provided long lasting Q&A session for internauts being a good source of faces

The high level idea is to build VAE face expression model for single actor only and then embed new unseen face into VAE latent space – from where original actor with similar face expression is reconstructed. The code in python (using Google TensorFlow) is available on github

Example videos presenting results of the embeddings of my face into latent face expression space for different actors are presented below:

Continue Reading

---

Large Scale Spectral Clustering with Landmark-Based Representation (in Julia)

In this post we will implement and play with a clustering algorithm of a mysterious name Large Scale Spectral Clustering with Landmark-Based Representation (or shortly LSC – corresponding paper here). We will first explain the algorithm step by step and then map it to Julia code (github link).

Spectral Clustering

Spectral clustering (wikipedia entry) is a term that refers to many different clustering techniques. The core of the algorithm does not differ though. In essence, it is a method that relies on spectrum (eigendecomposition) of input data similarity matrix (or its transformations). Given input dataset encoded in a matrix \(X\) (such that each single data entry is a column of that matrix) – spectral clustering requires a similarity (adjacency in case of graphical interpretation) matrix \(A\) where

\[
A_{ij} = f(X_{\cdot i}, X_{\cdot j})
\]

and function \(f\) is some measure of similarity between data points.

One specific spectral clustering algorithm (Ng, Jordan, and Weiss 2001) relies on matrix \(W\) given by

\[
W = D^{-1/2} A D^{-1/2}
\]

Continue Reading

---

Automatic differentiation for machine learning in Julia

Automatic differentiation is a term I first heard of while working on (as it turns out now, a bit cumbersome) implementation of backpropagation algorithm – after all it caused lots of headaches as I had to handle all derivatives myself with almost pen-and-paper like approach. Obviously I made many mistakes until I got my final solution working.

At that time, I was aware some libraries like Theano or Tensorflow handle derivatives in a certain “magical” way for free. I never knew exactly what happens deep in the guts of these libraries though and I somehow suspected it is rather painful than fun to grasp (apparently, I was wrong!).

I decided to take a shot and directed my first steps towards TensorFlow official documentation to quickly find out what the magic is. The term I was looking for was automatic differentiation.

Continue Reading

---

Chess position evaluation with convolutional neural network in Julia

In this post we will try to challenge the problem of chess position evaluation using convolutional neural network (CNN) – neural network type designed to deal with spatial data. We will first explain why we need CNNs then we will present two fundamental CNNs layers. Having some knowledge from the inside of the black box, we will apply CNN to binary classification problem of chess position evaluation using Julia deep learning library – Mocha.jl.

Introduction – data representation

One of the challenges that frequently occurs in machine learning is proper representation of the input data. Ideally, data is desired to be represented in a way that it carries as much information while being digestable for the ML algorithms. Digestibility means fitting in existing mathematical frameworks where known abstract tools can be applied.

A common convenient representation of single observation is a vector in \(\mathbb{R}^n\). Assuming such representation, ML problems may be seen from many different angles – with benefit of using well known abstractions/interpretations. One perspective that is very common is algebraic perspective – having the input data as a matrix (one vector per column), its eigendecomposition or various factorizations may be considered – they both yield important results in the context of machine learning. Set of vectors in \(\mathbb{R}^n\) shapes a point cloud – when geometry of such cloud is considered manifold learning methods emerge. Linear model with least squares error has closed form solution in algebraic framework. In all of these cases, representing input data as vectors implies broad range of tools to handle the problem effectively.

For some domains though it is not obvious how to represent input as vectors while preserving original information contained in the data. An example of such domain is text. Text document is rich in various types of information – there is a semantics and syntax of the text or even personal style of the writer. It is not clear how to represent this unnamed information contained in text. People tend to simplify it and use Bag of Words (BoW) approach to represent text (which completely ignores ordering of words in a document – treats it a a set).

Another domain that suffers from similar problem is domain of images. The spatiality of the data is missing when representing images as vectors of dimensionality equal to the total number of pixels. When one represents image that way the spatial information is lost – the algorithm that later consumes the input vectors is usually not aware the original structure of images is a set of 2-dimensional grids (one matrix for each channel).

So far our neural network has not been aware of two dimensional nature of input data (MNIST). It could of course find it out itself learning relations between neighboring pixels, but, the fact is, it had no clue so far.

Continue Reading

---

Optimization techniques comparison in Julia: SGD, Momentum, Adagrad, Adadelta, Adam

In today’s post we will compare five popular optimization techniques: SGD, SGD+momentum, Adagrad, Adadelta and Adam – methods for finding local optimum (global when dealing with convex problem) of certain differentiable functions. In case of experiments conducted later in this post, these functions will all be error functions of feed forward neural networks of various architectures for the problem of multi-label classification of MNIST (dataset of handwritten digits). In our considerations we will refer to what we know from previous posts. We will also extend the existing code.

Stochastic gradient descent and momentum optimization techniques

Let’s recall stochastic gradient descent optimization technique that was presented in one of the last posts. Last time we pointed out its speed as a main advantage over batch gradient descent (when full training set is used). There is one more advantage though. Stochastic gradient descent tends to escape from local minima. It is because error function changes from mini-batch to mini-batch pushing solution to be continuously updated (local minimum for error function given by one mini-batch may not be present for other mini-batch implying non-zero gradient).

Traversing through error differentiable functions’ surfaces efficiently is a big research area today. Some of the recently popular techniques (especially in application to NN) take the gradient history into account such that each “move” on the error function surface relies on previous ones a bit. An example of childish intuition of that phenomenom might involve a snow ball rolling down the mountain. Snow keeps on attaching to it increasing its mass and making it resistant to stuck in small holes on the way down (because of both the speed and mass). Such snow ball does not teleport from one point to another but rather rolls down within certain process. This infantile snowy intuition may be applied to gradient descent method too.

Continue Reading

---

Pages:123