Paper Insight: Image-to-image translation

Paper insights are short overviews of selected papers, read by and presented by our team members. Academic credit goes to the original authors.

Map to aerial photo

Many problems in computer graphics are a translation of an input image into a corresponding output image. For example, performing colorization of a black and white image, converting horses to zebras, classifying pixels on an image (image segmentation) and similar. This article will cover two very interesting approaches based on generative adversarial networks - GANs [1, 2, 3]. We will also assume that the reader is familiar with GANs, so make sure to refresh your memory on the topic [1].

Traditionally, the image translation problem has been tackled with many different techniques depending on problem conditions, available data, and domain (RGB to semantic map, black-white to RGB...), but essentially it can be said that the goal of image translation problem is to predict output pixels from input pixels [2].

The following image shows some examples of image translation problems.

Examples of image translation

This article will mainly cover the following two papers:

Image-to-Image Translation via Conditional Adversarial Networks [2] - Pix2pix
Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks [3] - Cycle GAN

Images used in this article are taken from [2, 3] unless otherwise stated.

Image-to-Image Translation via Conditional Adversarial Networks - Pix2pix

The paper examines an approach to solving the image translation problem based on GANs [1] by developing a common framework that can be applied to many different forms of problems in which paired training data is available. As GANs have a generator $G$ and a discriminator $D$, the main goal of the generator is to learn how to transform an image $a$ from domain $A$ into an image $b$ from domain $B$ and the goal of the discriminator is to try to classify whether the input image was created by the generator or it came from the available dataset.

Pix2pix makes the assumption that paired data is available for the image translation problem that is being solved. In contrast to that, Cycle GAN, discussed in later parts of this article, is created in order to support working with unpaired data.

Paired vs unpaired data

Having paired data also allows Pix2pix to additionally constrain the learning process by conditioning discriminator on the input image. To accomplish this, conditional GANs (cGANs) are used [4].

Conditional GAN diagram

Training is accomplished by optimizing the following objective:

$$
\mathcal{L}_{cGAN} (G, D) = \mathbb{E}_{x, y} \Big[ \log D(x, y) \Big] + \mathbb{E}_{x, z} \Big[\log (1 - D(x, G(x, z))) \Big]
$$

Generator G tries to minimize this objective against an adversarial D that tries to maximize it.

$$
G^* = arg \min_G \max_D \mathcal{L}_{cGAN} (G, D)
$$

For research purposes, the paper also tries a variant of the mentioned loss without conditioning.

$$
\mathcal{L}_{GAN} (G, D) = \mathbb{E}_{y} \Big[ \log D(y) \Big] + \mathbb{E}_{x, z} \Big[\log (1 - D(G(x, z))) \Big]
$$

Some papers have shown that mixing the GAN objective with a traditional loss (like $\ell_1$ and $\ell_2$) can be beneficial during the learning process [5]. This forces the generator not only to try and fool the discriminator, but also to try and be as close to the input sample on a pixel level in $\ell_1$ or $\ell_2$ sense. This makes sense in case when paired data is available due to images sharing a part of structure like black and white photos compared to the colored photo.

$$
\mathcal{L}_{L1} (G) = \mathbb{E}_{x, y, z} \Big[ ||y - G(x, z)||_1 \Big]
$$

Putting it all together:

$$
\mathcal{L}_{cGAN} (G, D) = \mathbb{E}_{x, y} \Big[ \log D(x, y) \Big] + \mathbb{E}_{x, z} \Big[\log (1 - D(x, G(x, z))) \Big]
$$

$$
\mathcal{L}_{GAN} (G, D) = \mathbb{E}_{y} \Big[ \log D(y)\Big] + \mathbb{E}_{x, z} \Big[\log (1 - D(G(x, z))) \Big]
$$

The final objective for optimization is as follows:

$$
G^* = arg \min_{G} \max_{D} \mathcal{L}_{cGAN} (G, D) + \lambda \mathcal{L}_{L1} (G)
$$

Model and architecture

Generator and discriminator architectures are adapted from [6]. Some of the important things to mention are:

Generator uses skip connections
Discriminator works on patches

Generator

Image-to-image translation problems often map a high-resolution image to a different high-resolution image. Having a traditional bottleneck layer in the generator can force it to learn a complex mapping between input-output, even though this is not required if images are sharing overall structure. For example, in image colorization problems (converting black-white images to colorized images) there is no need to encode-decode the structure and texture of the image as they are almost identical.

Generator with skip connections

The importance of skip connection is best illustrated in the following image.

Comparison between U-net approach and the traditional approach

Adding skip connections to the encoder-decoder structure of the generator drastically improves the quality of generated images. The image also shows the impact of using only $\ell_1$ loss compared to the full objective defined for Pix2pix.

Discriminator and PatchGAN

The general idea is that in order to enforce proper structure and texture generation, patches of the input image are good enough. This is also good because it reduces inference computation complexity, which can improve overall GAN training time. Discriminator works on patches of an input image. The input image is downsampled, and every pixel of this new feature map is classified. The pixels that are classified are representing patches of the original image, while the size of the downsampled feature map is a meta parameter introduced by the Patch GAN.

Patch GAN was evaluated on images of size $286\times286$ and patches of size:

$1\times1$ (called Pixel GAN)
$16\times16$
$70\times70$
$286\times286$ (called Image GAN)

The following image illustrates the influence of different patch sizes.

Patch GAN variations

Patch size of $1\times1$ increases the colorfulness of results but doesn't influence spatial sharpness. Increasing patch size to $16\times16$ improves on increasing spatial sharpness, but artifacts also start appearing. Moving to $70\times70$ improves results quite a bit - artifacts are reduced and spatial sharpness is increased. Increasing patch size to the full image size - $286\times286$ gives very similar results.

More details on overall model architecture can be found in the original paper.

Training

The paper reports nothing new concerning the training phase compared to the classic GAN training techniques developed earlier. The loss is optimized by altering updates to the generator $G$ and the discriminator $D$ during the training phase. The classic trick from [1] is used also as the objective function is modified by changing $\min \log \Big(1 - D(x, G(x, z))\Big)$ to $\max \log\Big(D(x, G(x, z)) \Big)$. The objective is divided by 2 while optimizing $D$ in order to slow down the rate of which $D$ learns relative to $G$ Adam optimizer is used with learning rate set to $0.002$ and $\beta_1 = 0.5$ and $\beta_2 = 0.999$.

Experiments and results

Some problems examined in the paper are:

Semantic labels to photo (inverse of semantic segmentation)
Architectural labels to photo
Map to an aerial photo
BW to color photos
Edges to photo
Sketch to photo
Day to night
Thermal to color photos
Photo with missing pixels - inpainted photo

Paper also gives an ablation study of the components of the loss function and tries to reason on how do components of this function influence the final model.

$$
G^* = arg \min_{G} \max_{D} \mathcal{L}_{cGAN} (G, D) + \lambda \mathcal{L}_{L1} (G)
$$

Various losses for Pix2pix

From the photo it can be seen that $\ell_1$ alone gives reasonable but blurry results. As for the situation with $\mathcal{L}_{cGAN}$ loss, results improve but artifacts appear. Combining the previous two reduces the frequency and intensity of artifacts.

Detailed results are available here. Some nice results can be obtained even on small datasets:

Facade training set consists of just 400 images, trained under 2 hours on a single Pascal Titan X GPU
Day to night training set consists of 91 images

Map to the aerial photo (and inverse)

Image colorization

Cityscapes labels to photo

Facade labels to photo

Sketch to photo

Image inpainting

Thermal to photo

Removing background

Examples of failure

By using the fixed-size patch approach, the system can be adapted to work on varying image sizes. For example, it can be trained on $256\times256$ images and tested on $512\times512$.

Fully convolutional translation

Evaluation of synthesized images is an open and difficult problem because traditional metrics like pixel mean-square error do not capture concepts and structure. Paper employs 2 strategies for such evaluation:

Holistic approach
Using externally trained classifiers on synthesized images

Holistic approach relies on humans doing real vs fake studies over Amazon Mechanical Turk (AMT). The other approach uses a pre-trained semantic segmentation model like FCN-8s [7] that is used to classify synthesized images when labels are available for comparison.

Pix2pix and the community

Since the publication of the paper, the community has had fun! Some nice and funny use-cases were introduced:

Transforming edges to cats
Transforming sketches to pokemon
Transforming sketches to portraits
Generating color palettes
Removing background
Learning to see: Gloomy Sunday - video

Community results

Conclusion

Conditional GANs have performed well on paired image to image translation problems. Approach that Pix2pix has introduced is general enough to fix most of the paired image to image translation problems. Introducing patches has also allowed for computational complexity to be reduced and using a U-net style generator has drastically improved the quality of generated images. It's important to note that applying the Pix2pix approach requires having paired data which sometimes isn't available. In the next part, we will cover the Cycle GAN model that improves on Pix2pix by working with unpaired data.

Cycle GAN - Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks

Having paired data available is actually rather rare and collecting such data can require a large amount of resources. There are special cases where paired data is naturally available like season change, segmentation problems, image colorization, but in general cases, it would be good to have an approach similar to Pix2pix that can also work with unpaired data.

Paired vs unpaired data

The goal of this model is to produce results comparable to the Pix2pix paper, but still be able to learn without requiring paired data. This is exactly what Cycle GAN does [3]. It builds upon the work from Pix2pix by introducing cyclic learning in the optimization process.

Cycle GAN results

Cycle GAN approach

In Pix2pix, model $G$ was trained to translate images from domain $X$ to domain $Y$. Cycle GAN does the same, but additionally it also trains a model $F$ that translates images in the opposite direction - from domain $Y$ to domain $X$.
This introduces a cycle, hence the name, Cycle GAN.

Back-translation is a concept in which after translating from A to B, there is another translation process from B back to A in order to check how close is the original content compared to the one that went through the translation process. There is also an interesting story about Mark Twain and his work The Celebrated Jumping Frog of Calaveras County.
After discovering the French translation of his text and noticing how much of his signature humor and style were lost, Twain re-translated the French version word for word with intentional incoherency back into English with a new title The Jumping Frog: In English, Then in French, and Then Clawed Back Into A Civilized Language Once More by Patient, Unremunerated Toil to illustrate the problem of losing deep and subtle semantics during the translation process.

Cycle GAN diagram

Cycle GAN consists of:

2 generators - $F$ and $G$
2 discriminators - $D_X$ and $D_Y$
2 additional losses:
- forward cycle-consistency loss: $x \rightarrow G(x) \rightarrow F(G(x)) \approx x$
- backward cycle-consistency loss: $y \rightarrow F(y) \rightarrow G(F(y)) \approx y$

Cycle GAN learns mapping functions between two domains, $X$ and $Y$. The available training samples are available ${x_i}^N_{i=1}$, $x_i \in X$ and ${y_i}^M_{i=1}$, $y_i \in Y$. Generator $G$ is tasked with the job of converting samples from domain $X$ to $Y$, while generator $F$ does the opposite, converting samples from $Y$ to $X$. Discriminators $D_X$ and $D_Y$ perform binary classification in which they try to determine whether samples belong to the domain $X$ and $Y$, respectively.

Objective consists of:

Adversarial loss: $\mathcal{L}_{GAN} (G, D_Y, X, Y)$ and $\mathcal{L}_{GAN} (F, D_X, Y, X)$
Cycle consistency loss: $\mathcal{L}_{cyc} (G, F)$

Adversarial loss

Adversarial loss is applied to both $G$ and $F$. For generator $G$ and discriminator $D_Y$, it can be formulated as:

$$
\mathcal{L}_{GAN} (G, D_Y, X, Y) =
\mathbb{E}_{y \sim p_{data}(y)} \Big[ \log D_{Y}(y) \Big]
+ \mathbb{E}_{x \sim p_{data}(x)} \Big[ \log \big(1 - D_Y (G(x)) \big) \Big]
$$

$G$ tries to minimize this loss while discriminator $D_Y$ tries to maximize it. This can be
expressed as:

$$
\min_G \max_{D_Y} \mathcal{L}_{GAN} (G, D_Y, X, Y)
$$

It is similar for $F$ and $D_X$ and can be formulated as:

$$
\min_F \max_{D_X} \mathcal{L}_{GAN} (F, D_X, Y, X)
$$

The Cycle consistency loss

The cycle consistency loss tries to capture how different is the reconstructed sample from the original sample. For example, how different is the original story from Twain compared to the version that was translated from English to French, and then back to English?

The cycle consistency loss consists of:

Forward cycle consistency: $x \rightarrow G(x) \rightarrow F(G(x)) \approx x$
Backward cycle consistency: $y \rightarrow F(y) \rightarrow G(F(y)) \approx y$

The loss is then formulated as:

$$
\mathcal{L}_{cyc} (G, F) =
\mathbb{E}_{x \sim p_{data}(x)} \big[ \| F(G(x)) - x \|_1 \big] +
\mathbb{E}_{y \sim p_{data}(y)} \big[ \| G(F(y)) - y \|_1 \big]
$$

Cycle loss

Full objective

The final loss function is expressed as:

$$
\mathcal{L} (G, F, D_X, D_Y) =
\mathcal{L}_{GAN} (G, D_Y, X, Y) + \mathcal{L}_{GAN} (F, D_X, Y, X) + \lambda \mathcal{L}_{cyc} (G, F)
$$

And the full objective is:

$$
G^*, F^* = arg \min_{G, F} \max_{D_X, D_Y} \mathcal{L} (G, F, D_X, D_Y)
$$

Training details

The training is executed in a traditional GAN style altering the generator and discriminator updates. In the paper the value of $\lambda = 10$ was used along with Adam optimizer with an initial learning rate of $0.0002$ and batch size of 1. The learning rate is fixed for the first 100 epochs, and then linearly decayed to 0 over the next 100 epochs. Discriminators use a $70\times70$ Patch GAN.

Evaluation

Cycle GAN seems to work rather well in experiments given in the paper. The following two images demonstrate how Cycle GAN performs compared to the earlier methods.

Cycle GAN compared to others

The paper reports the results of multiple experiments:

unpaired image-to-image translation problems
paired image-to-image translation problems
with a focus on the importance of the cycle and adversarial loss
with a focus on generality on different datasets

There was also an experiment done via Amazon Mechanical Turk (AMT) that has shown that images generated by Cycle GAN managed to make participants think they are real for around 25% of the time.

Cycle GAN AMT

On paired image-to-image translation problems, Cycle GAN is comparable to Pix2pix:

FCN-scores for different methods, evaluated on Cityscapes labels to photo

FCN-scores for different methods, evaluated on Cityscapes photo to labels

Results on paired images

All parts of the loss function influence the quality of the model. The following image shows the classification performance of the photo-to-labels model for different losses, evaluated on Cityscapes.

Classification performance of photo to labels for different losses, evaluated on Cityscapes

This can also be seen by visualizing the generated images.

Comparison of the generated images obtained by training the model with variations of the loss function

Cycle GAN can be applied to the same problems as Pix2pix. The following few images show the obtained results.

Style transfer

Object transfiguration (horse to zebra)

Season transfer

Photo enhancement

Failure cases

Results are impressive in many of the previously shown images, but things can also fail to work due to many reasons - lack of data, mode collapse, special cases that are not present in the dataset and similar. The authors conclude that the model works rather well if it needs to learn a transformation which requires texture and color changes, but mostly breaks if it needs to learn geometrical transformation, like transforming a cat to a dog.

Conclusion

Cycle GAN has the ability to generate very good results for unpaired and paired data. One of the possible use cases very relevant to practical needs is learning a model to bridge the distribution gap between synthetic and real data in situations in which synthetic data is cheap and easy to get compared to the real data. It is also interesting how learning the inverse mapping can be used to allow the model to learn on unpaired datasets.

Although the results are really impressive, the regular GAN pain points like
mode collapse, unstable training and large amount of computational resources required still remain.

Resources

[1] Generative Adversarial Networks, Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio
[2] Image-to-Image Translation with Conditional Adversarial Networks, Phillip Isola and Jun-Yan Zhu and Tinghui Zhou and Alexei A. Efros
[3] Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks, Jun-Yan Zhu, Taesung Park, Phillip Isola, Alexei A. Efros
[4] Conditional Generative Adversarial Nets, Mehdi Mirza, Simon Osindero
[5] Context Encoders: Feature Learning by Inpainting, Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor Darrell, Alexei A. Efros
[6] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, Alec Radford, Luke Metz, Soumith Chintala
[7] Fully Convolutional Networks for Semantic Segmentation, Jonathan Long, Evan Shelhamer, Trevor Darrell

Paper Insight: Image-to-image translation - Pix2pix and Cycle GAN