Variational Autoencoder in Tensorflow – facial expression low dimensional embedding

Digest

The main motivation of this post is to use Variational Autoencoder model to embed unseen faces into the space of pre-trained single actor-centric face expressions. Two datasets are used in experiments later in this post. They are based on youtube videos passed through openface feature extraction utility:

The datasets are:

  • Donald Trump faces

    because of the recent presidential election in USA it was very easy to get videos of frontal-positioned faces of Donald Trump and use it as input dataset

  • Edward Snowden faces

    because he provided long lasting Q&A session for internauts being a good source of faces

The high level idea is to build VAE face expression model for single actor only and then embed new unseen face into VAE latent space – from where original actor with similar face expression is reconstructed. The code in python (using Google TensorFlow) is available on github

Example videos presenting results of the embeddings of my face into latent face expression space for different actors are presented below:

Intro

My adventure with VAE actually first went through fascination about slightly different model called GAN (Generative Adversarial Networks). I remember the time I first read about it – that was just after I learnt basics about neural networks (backpropagation with different optimization techniques). It was late at night and I was about to fall asleep as usual staring at some (potentially complex) paper with lack of detailed understanding – my personal remedy for insomnia. That time though the idea presented in a paper seemed to be simple at the first glance (still with great potential) – I immediately woke up and stayed awake in excitement for some time. During the upcoming days/weeks I digged a bit deeper to finally discover VAE model

The high level idea of GAN is to map samples from simple prior distribution to very complex data distribution. Often a prior is chosen to be uniform or normal – distribution you can easily draw from. Mapping itself is modelled with neural network (with all benefits – like SGD training – implied). Additional interesting aspect of GAN is its adversarial nature. Two neural networks compete during training. One network tries to generate samples similar to data and the other network tries to get better and better in discriminating fake and real samples. At the end of training generator gets very good in creating artificial samples – to the point where discriminator is unable to sort it out.

The promise GAN carries is to generate artificial objects that do not exist in your training data (still very similar to those existing there). And the promise is being fullfilled. People are able to generate artificial bedrooms images, manga characters, chinese characters or albums covers

One cool implication of GAN prior being simple is that you can easily traverse through latent space and see corresponding samples morphing. Having your GAN network learnt you can sample two points A and B from your latent space – generate intermediate point on a line segment between chosen points and observe corresponding artificial data morphing from point A to B.

The main disadvantage of classical GAN is that you cannot embed your data sample into latent space easily (if at all). Mapping goes one way only. Having your data sample you are not able to tell what coordinates it has in your latent space. One remedy for that is model called Variational Autoencoder.

In fact, VAE (2013) was first introduced before GAN (2014), the story presented in this post expresses my line of thoughts, not chronological events. Thanks to u/hgjhghjgjhgjd for pointing it out

Variational Autoencoder – basics

First of all, Variational Autoencoder model may be interpreted from two different perspectives. First component of the name “variational” comes from Variational Bayesian Methods, the second term “autoencoder” has its interpretation in the world of neural networks. VAE is a marriage between these two worlds. In this post we will only focus on neural network perspective as probabilistic interpretation of the VAE model is still – I have to humbly admit – a bit of a mistery for me (you can take a shot though and look at these two)

Variational autoencoder high level interface is similar to GAN (although model interpretation differs) – it also allows you to generate artifical data samples based on samples drown from simple prior distribution. Additionally VAE provides mapping between input data samples and latent space. Given your input data sample and the model you can easily embed it into latent space.

Variational Autoencoder – neural networks perspective

Autoencoder , in general, stands for a function that tries to model data input identity with purposely limited expressive capacity. It is a function that given input data vector tries to reconstruct it. It is represented by a neural network that consists of two subnetworks called “encoder” and “decoder“. Input data is crunched through encoder first and the resulting vector is new representation of the input – encoder output usually has much lower dimensionality than original input data and so it can capture the most important data features. Out of these features the original input vector is reconstructed with “decoder” network. In a very simple form, training autoencoder reduces to minimization of reconstruction error (difference between original input and its reconstruction)

Variational Autoencoder

Let’s focus on encoder a bit. The encoder output is K-dimensional vector – one of the questions that people found interesting is: “Can we randomly choose K-dimensional vector and feed it to the decoder part of network? Would we get the result that looks like a data sample?

The short answer is no. In fact in a classical autoencoder setting you are not able to control hidden data representation distribution anyhow. Moreover, It is rather difficult to sample from it as encoder output distribution is unknown.

The remedy for that is to try to control encoder output distribution – possibly forcing it to be similar to up-chosen prior. This is where variational part of the name comes from.

One possible realization of Variational Autoencoder error function is defined as a sum of two components below:

\[
\begin{align}
& \left\| X – \mathrm{Decoder}_{\theta}(f(\mathrm{Encoder}_{\phi}(X))) \right\|^2 \\
& \\
& \mathcal{KL}(\mathcal{N}\left(\mathrm{Encoder}_{\phi}(X)) || \mathcal{N}\left(0, I\right)\right)
\end{align}
\]

where
\[
\mathrm{Encoder}_\phi(X) = (\mu(X), \Sigma(X))
\]

and

\[
f(\mu(X), \Sigma(X)) = \mu(X) + \Sigma(X) * \epsilon
\]

where \(\epsilon\) is a sample from \( \mathcal{N}(0,I) \). First component of the error function is reconstruction error. The second one is to force our encoder distribution (implicitly) to be close to upchosen prior (Normal distribution in this case). This component emerges via variational bayesian inference that is the core of the method.

To better understand ugly equations above let’s take a look at visualization of the network, starting with encoder part (This is highly inspired (if not copied!) by Figure (4) in this tutorial):

Variational Autoencoder - Encoder

As you can see the Encoder is designed to output random sample from \( \mathcal{N}(\mu(X), \Sigma(X)) \) (the reparametrization trick – using pre-defined \( \epsilon \) instead of sampling it directly from \( \mathcal{N}(\mu(X), \Sigma(X)) \) makes backpropagation possible). Please note that here is also the place where part of error function is calculated:

\[
\mathcal{KL}(\mathcal{N}\left(\mathrm{Encoder}_{\phi}(X)) || \mathcal{N}\left(0, I\right)\right)
\]

Kullback–Leibler divergence is a function that measure dissimilarity (asymmetric though) between two distributions. The closer two input distributions are the lower value function returns. In other words, this error function component tries to enforce encoder output distribution to match upchosen prior – in our simplest case: \(\mathcal{N}(0,I) \).

Let’s take a look at decoder now:

Variational Autoencoder - Decoder

Input of decoder is generated by Encoder network ( sample from \(\mathcal{N}(\mu(X), \Sigma(X)) \) ). The second error function component is attached at the end of the decoder network. For example:

\[
\left\| X – \mathrm{Decoder}_{\theta}(f(\mathrm{Encoder}_{\phi}(X))) \right\|^2
\]

This part of the error function is responsible for proper input vectors reconstruction (as we aim at minimizing a component that can be interpreted as reconstruction error)

TensorFlow

Implementing neural network can be very tricky. If you want to do it from scratch you need to take care of forward pass, differentiation of your network with respect to upchosen error function, optimization techniques (there is many!), dropout, batch normalization and many more. Doing that in a generic way is error-prone and rather difficult task even for a team of people. Therefore it makes perfect sense to use the existing tool. I chose TensorFlow, but many different alternatives exist.

TensorFlow is an open source library released in November 2015 by Google. TensorFlow is especially handy when one implements machine learning algorithms (deep neural networks in particular). Why is that so? TensorFlow as the name suggests is all about tensors flowing around. Tensor is certain generalization of a matrix (matrix is 2D tensor). Please recall that many machine learning models can be represented as such Tensors flowing. In particular neural networks (and so autoencoders) can be interpreted as sequences of Tensor-level operations – very often linear combination followed by element-wise non-linearity.

One of the TensorFlow core concepts is computational graph. You want your model (and your training procedure) to be represented as such computational graph. Once you have your model represented this way, you can easily train it via gradient descent techniques – as automatic differentiation is implemented in TensorFlow.

Another important thing about TensorFlow is a set of functions that are already implemented there for you. You do not need to implement convolutional layer yourself or handle its derivative as this is already in place. Optimization algorithms such as Adam, Adadelta, Adagrad and many others are implemented too.

TensorFlow runs on GPU – you don’t need to write CUDA kernel yourself though – the only thing you need to do is to ‘choose’ backend you want computatons to take place on (like GPU backend).

High-lever python API is available, making TensorFlow friendly for non experts as well.

All of these lets you focus on modelling your problem directly, making your life a bit easier.

Implementation

You can find my VAE model implementation on github here. I copied prettytensor deconvolutional layer deconv2.py from this repository.

The code is not perfect and has its limitation. The main problem is that tensorflow built-in save function does not remember (of course) higher-level custom abstraction built on top of tensorflow. Therefore if you want to load your pre-trained model and for example use method to walk through latent space you need to make sure your encoder and decoder implementation (exactly) are the ones used during training. Networks implementations are stored in separate file, but it is still cumbersome and I do realize this is not how it should be handled. The reason for such bizarre solution is, as usual, lack of knowledge – one day I will figure it out.

The code depends on set of external libraries – they are listed in README file on github.

Experiments

To run my experiments I used AWS g2.2xlarge ec2 instance with Nvidia K520 Kepler GPU. The cost is about 0.6 USD/h so it costed just few bucks to train my models.

Data gathering and preprocessing

For every actor I found video available online with actor’s frontal face positions in it. Videos were about 1h long each. Then I used FeatureExtraction utility from OpenFace to get set of images of faces of the actor, resized images to 64×64 and further preprocessed them manually. During manual preprocessing I filtered out obvious outliers: images with no face in it and those occluded. Manually preprocessed dataset of images I packed to hdf5 format (you can use images_to_hdf5 from io_utils to do it) that later served as input to VAE model.

HDF5 files are available to download here: Donald Trump, Edward Snowden

Model and training

Encoder and Decoder implementations in repository now are those producing results presented in this post.

class ConvolutionalEncoder(VAEEncoderBase):

    def __init__(self, input_tensor_size, representation_size, batch_size):
        VAEEncoderBase.__init__(self, input_tensor_size, representation_size, batch_size)

    def guts(self):
        conv_layers = (pt.wrap(self.input_data).
                        conv2d(4, 32, stride=2, name="enc_conv1").
                        conv2d(4, 64, stride=2, name="enc_conv2").
                        conv2d(4, 128, stride=2, name="enc_conv3").
                        conv2d(4, 256, stride=2, name="enc_conv4").
                        flatten())

        mu = conv_layers.fully_connected(self.representation_size, activation_fn=None, name = "mu")
        stddev_log_sq = conv_layers.fully_connected(self.representation_size, activation_fn=None, name = "stddev_log_sq")
        return mu, stddev_log_sq

class DeconvolutionalDecoder(VAEDecoderBase):

    def __init__(self, representation_size, batch_size):
        VAEDecoderBase.__init__(self, representation_size, batch_size)

    def guts(self, batch_size = None):
        batch_size = self._determine_batch_size(batch_size)

        return (pt.wrap(self.latent_var).
                fully_connected(4*256, activation_fn=None, name="dec_fc1").
                reshape([batch_size, 1, 1, 4*256]).
                deconv2d(5, 128, stride=2, edges='VALID', name="dec_deconv2").
                deconv2d(5, 64, stride=2, edges='VALID', name="dec_deconv3").
                deconv2d(6, 32, stride=2, edges='VALID', name="dec_deconv4").
                deconv2d(6, 3, stride=2, edges="VALID",  activation_fn=tf.nn.sigmoid, name="dec_deconv5")).tensor

Training lasts 100 epochs (with dataset shuffing after every epoch), batch normalization is used and the model is trained with Adam optimization technique. Output of decoder is modelled with sigmoid, other activation functions (except of those modelling mu and stddev_log_sq ) are ReLU.

Reconstruction results – quick summary

To generate resulting videos (very top of this post) I captured short recording of my face trying different face expressions (some of them could be considered funny!). I kept the processing flow the same: first I recorded the videos then passed through openface to get the images. Then I used reconstruct command to reconstruct original actor face based on latent space representation of my face. I used standard ImageMagick and ffmpeg utilities to compose video with border based on resulting images.

The quality of reconstruction depends on the quality and length of the video and variability of expressions presented in it. For example Donald Trump faces are captured from one of his speeches during the rally where face expressions are rich. Video of Edward Snowden Q&A session was rather emotionless as he tried to answer serious questions.

Because of EC2 costs, I never really experimented with network architecture – that could be a room for potential improvements. Another thing that comes to my mind is to generate extra data samples by taking horizontal mirror copy of images – some actors tend to turn their head in one direction more often. Latent space dimensionality could be increased as well. Experiments where I decreased it to 30 failed though.

---