There are many tasks in image processing that can be solved with Convolutional Neural Networks (CNNs). One of these tasks is called image style transfer. The goal of image style transfer is to apply the style of one image to the content of another image. This way you can create an drawing showing you in the style of Van Gogh, for example.
I am going to explain how style can be extracted from one image and transferred to the content of another image in this article.
I also wrote an overview paper on Image Style Transfer using Convolutional Neural Networks for a computer vision seminar at my university.
The first paper that uses CNNs for style transfer is called Image Style Transfer Using Convolutional Neural Networks and was published by Leon A. Gatys, Alexander S. Ecker and Matthias Bethge at CVPR 2016. If you don’t have access to the paper, you can also read the pre-print on arXiv. This article is based mainly on the paper of Gatys et al.
Convolutional Neural Networks
To understand how style transfer works you have to understand CNNs. These are a special kind of Artificial Neural Networks and they are heavily used in lots of image processing tasks such as image classification, object detection, depth estimation, semantic segmentation or style transfer.
CNNs consist of multiple convolutional layers which apply filters to the output of the previous layer. In contrast to classical image processing these filters do not have to be designed by hand but are learned end-to-end using back-propagation. By stacking multiple convolutional layers the network can learn different features. The filters in the first layers learn simple patterns like edges or corners while the layers in the end learn complex patterns like prototypes of faces, cars, buildings, etc. The increasing complexity along the layers is caused by the increasing receptive field of every neuron in each layer.
Extracting style and content from CNN feature maps
The style of an image (color distribution, brush stroke style, …) can be separated from the content in a simple way. Like already stated in the previous section the filters in the last layers of the CNN learn more complex patterns and abstract from raw pixel values. To simplify, they learn where are People, Cats, Dogs, Cars, etc. So to extract the content of an image, the last feature maps are relevant.
On the other hand, the first layers capture more of the local structures, colors and other stylistic properties of an image. In contrast to the content of an image the style can not be extracted directly. Instead, one has to calculate the correlations between the feature maps on a number of low convolutional layers. These are calculated via Gram matrices \(G^{l}\).
$$G_{ij}^{l} = \sum_{k}F_{ik}^{l}F_{jk}^{l}$$
Here, \(F_{ik}^{l}\) refers to the activation value of the \(i\)th filter at position \(k\) in layer \(l\). Note that \(k\) is one single scalar value but the image (and every feature map) is two-dimensional. That’s because in order to calculate the Gram matrix, the 2D filter map is transformed to a 1D filter map by just concatenating the rows of the 2D map. This results in a long vector with dimension \(M^{l} = W^{l}H^{l}\) (width \(W\) times height \(H\) of layer \(l\)) for every feature map in this layer \(l\). When all these vectors are written as rows in a matrix \(F^{l}\), we have \(N^{l}\) rows, each of length \(M^l\), so the matrix \(F^{l}\) has dimension \(N^{l} x M^{l}\) and it stores the activation values of all filters in a layer \(l\)
Remember that the style of an image is represented by the correlations of different filters. Two filters are highly correlated when their values are high at the same positions. That’s exactly what \(G_{ij}^{l}\) represents. For every position \(k \in \{0, \dots, M^l\}\) the activation value of filter \(i\) is multiplied with the value of filter \(j\). If there is a high correlation between two different filters \(i\),\(j\) on a layer \(l\), the Gram matrix \(G^l\) will have a high value at row \(i\) in column \(j\). The whole matrix then represents the correlation between all filters in a given layer.
To sum up: The content of an image is represented by the feature map \(F^l\) of a high-level (because of the receptive field) layer \(l\) while the style is represented by the correlations of the feature maps on one ore more layers \(l\), each layer described by the Gram matrix \(G^l\).
Gatys et al use the VGG-19 network to extract the feature maps but you could use every other CNN that was trained for object recognition.
Applying the style of one image to the content of another image
Now we can extract the content and style of an image. In order to transfer the style to any other image, Gatys et al make use of an optimization-based algorithm that is normally used to train neural networks. This algorithm is called Backpropagation. When you train a network using backpropagation you have fixed training data and initial weights that you want to optimize so that the error the network makes gets smaller. But as I already wrote in the previous section, we use a pre-trained network (VGG) to extract the features, so we do not want to change the network’s weights. Instead, we want to transfer style to an image.
To achieve this, Gatys et al define an error function that is not differentiated w.r.t the weights but the pixel values of the image \(x\) that should be generated (consisting of the content of one image \(c\) in the style of another image \(s\)).
$$L_{total}(c,s,x) = \alpha L_{content}(c,x) + \beta L_{style}(s,x)$$
The total loss consists of a linear combination of the content and style loss and can be weighted by \(\alpha\) and \(\beta\) in order to control how important style and content is to the user.
To be continued… In the meantime, you can read my seminar paper that I wrote last semester for University.