The field of machine learning is changing extremely fast for last couple of years. Growing amount of tools and libraries, fully-fledged academia education offer, MOOC, great market demand, but also sort of sacred, magical nature of the field itself (calling it Artificial Intelligence is pretty much standard right now) – all these imply enormous motivation and progress. As a result, well-established ML techniques become out-dated rapidly. Indeed, methods known from 10 years ago can often be called classical.
This sort of revolution has happened recently. The default architectural choice for NLP related problems, recurrent neural network, has been seriously challenged – to say the least. This very solid architecture is being quickly replaced by networks based on attention mechanism only that drops RNN entirely achieving at least comparable (and often better) performance both in NLP and Computer Vision.
This post is an attempt to go through the most significant papers related to attention mechanism with the goal to grasp basic knowledge and intuition about it. We will start by looking at its very first NLP application where it was introduced to solve neural machine translation in 2015. Then we will go through improvements to attention introduced in Transformer – neural networks architecture that uses one specific variant of attention mechanism as its main building block – skipping thus far seemingly necessary recurrent connections.