[PENDING] Unsupervised translation of grand theft auto V images to real urban scenes

Image-to-image translation can be used to transfer artistic style (eg, to transform a photo to look like a painting by a famous painter) or even to bridge the gap between synthetic and real images. This thesis focuses on the latter and aims to transform the images of the open-world game Grand Theft Auto V (GTA V) to look like realistic urban scenes using state-of-the-art image-to-image rendering methods. Image quality evaluation using semantic segmentation has proven to be more reliable in such cases, as metrics such as Fr´echet Inception Distance (FID) cannot detect such misaligned scene structures.

47 4.3 Example semantic segmentation predictions from DeepLabv3+. age of the Cityscapes validation set. c), (d), (e), (f) Overlay predictions of DeepLabv3+ when trained with the translated GTA V images of the corresponding models. The performance when trained with the original Cityscapes and GTA V images (1st and 2nd row) is also provided for comparison.

Motivation

Domain matching is a field of computer vision and machine learning that aims to solve tasks in a target domain that has a different but related distribution to the source domain data on which a model was trained. Domain matching can also be used as a solution for small annotated datasets, with not enough data to effectively train a deep learning model. Similarly, annotated images from the game GTA V can be used for domain matching as it also simulates realistic driving scenarios.

Goals

Therefore, the model must be adapted to new weather conditions or even new driving scenarios in cities with different layouts. As a result, creating semantic segmentation datasets is very expensive, leading to smaller amounts of available data. Although it consists of synthetic annotated images that simulate driving scenarios in a virtual city, it can be used to adapt a model for semantic segmentation and scene understanding issues in real-world data.

Thesis Outline

Generative Adversarial Networks (GANs)

To do this, the GAN framework uses two neural networks: the generator and the discriminator. This process is called adversary training and is based on game theory, where the generator competes against an adversary (i.e., the discriminator). D(x) is the scalar output of the discriminator and represents the probability that x came from the data rather than pg.

Deep Convolutional GANs

Conditional GANs

Image-to-Image Translation Types

Supervised Translation

It allows the generation of realistic-looking landscapes from simple brushstrokes that represent semantic label maps.

Unsupervised Translation

Multi-Domain Unsupervised Translation

Related Work Using GTA V Images

GTA V

The motivation behind the collection of the GTA V images in [38] is to create a very large dataset with pixel-accurate semantic labels. As mentioned in Chapter 1, such a task is very challenging due to the amount of human effort required to track accurate object boundaries, and as a result, the annotation of a single image requires 60 and 90 minutes for CamVid [5] and Cityscape's [13] dataset respectively. However, the authors extracted 25 thousand images from GTA V and automatically annotated them in just 49 hours.

Manually labeling this dataset following the approach used for CamVid or Cityscapes would take 12 person-years. However, GTA V is not an open source game, which means that the source code is not available. Consequently, the creation of semantic tags is not very simple and the authors use a technique called detouring [6] using a standard graphics debugging tool called RenderDoc [26].

Specifically, they create a wrapper around the game's graphics library (i.e. Direct3D 11) and use it to intercept communications with the graphics hardware to access the game's resources and store all necessary information. After annotating the first image, their annotation tool automatically propagates the labels to all image planes that share the same mesh, texture, and shader combination in all images. In this way, as the annotation progresses, more and more image patches are pre-labeled, significantly reducing the annotation time per image.

Each frame has a resolution of 1914×1052 pixels and there are labels for 19 classes, which are compatible with the Cityscapes dataset described below.

Cityscapes

In this thesis, 18785 frames were used to train the image-to-image translation models, while the remaining 6181 frames were used to evaluate their performance after a formal test split. The videos were recorded over several months in spring, summer and autumn to ensure diversity of data. Images were captured with an automotive-grade stereo camera at a frame rate of 17 Hz.

The sensors, mounted behind the windshield, produced high dynamic range (HDR) images with 16-bit linear color depth. These images are also provided at 8-bit low dynamic range (LDR) for comparability and compatibility with other existing datasets. In addition, they provide 20,000 images with coarse annotations for the remaining 23 cities performed every 20 or 20 m driving distance.

The data can be found at https://www.cityscapes-dataset.com/ and Figure 3.2 shows some examples. The 20,000 coarsely annotated images are used for the image-to-image translation task to transform GTA V to look like Cityscapes. The densely annotated images are then used to evaluate the translated results as explained in section 3.4.

The official train validation split is followed for the densely annotated images, consisting of 2975 and 500 images, respectively.

Mapillary Vistas

Preprocessing

Image-to-Image Translation Models

CycleGAN
AttentionGAN
CUT
DCLGAN

In this way, when an image of the target domain is used as input, the generator is expected to produce almost the same image. In the case of the discriminators, the authors use the 70×70 PatchGAN [24], which consists of five convolutional layers. The authors introduced two architectural schemes for AttentionGAN, the second being an improvement of the first.

Specifically, each of the two generators G and F has two separate subnetworks for generating content masks and attention masks. Unlike CycleGAN and AttentionGAN, CUT omits the use of cycle consistency loss as it is too restrictive. For example, when translating a horse into a zebra, the horse's posture should not change.

The official code of the model can be found at https://github.com/taesungp/contrastive-unpaired-translation. The objective is to maximize the mutual information between the positive pair, while minimizing that of the negative pairs. To project the spots to the embedding space, they use the first half of the generator as an encoder.

This loss can be thought of as a learnable, domain-specific version of the identity loss used by CycleGAN.

Evaluation Protocol

GAN Metrics

Evaluating generated images in the context of unsupervised image-to-image translation is not an easy task, as there is no ground truth about how an image should be translated. These are Inception Score (IS), Fr´echet Inception Distance (FID) and Kernel Inception Distance (KID). Initial score [40] is a metric used to evaluate the quality and diversity of images generated by GANs.

Furthermore, the marginal probability distribution shows how much variety there is in the generator's output. This means that each image has a distinct object which is classified with a high probability and that the collection of generated images has a variety of labels (i.e. the images are different). Therefore, other metrics such as FID are usually preferred over IS for evaluating generated images.

This metric is designed to solve the limitation of IS and compares the generated image statistics with the statistics of a real-world image collection, being more consistent and reliable. Similarly, the mean mw and the covariance matrix Cw are calculated for the generated images. A lower FID indicates better quality images as it means that the statistics of the real and generated images are similar and thus the synthetic images look more realistic.

Same as FID, it measures the difference between the distribution of generated images and real-world images in the presentation space of pre-trained Inception v3 as a measure of image quality.

Semantic Segmentation

The authors find that separable atrous convolution significantly reduces the computational complexity of the model. This chapter presents the qualitative and quantitative results of the image-to-image translation task, evaluating the performance of four tested models. In the experiments, the four models were trained using 256 × 256 random image crops and tested on 512 × 256 full-size images.

All models can capture the color style of 8-bit LDR images from the Cityscapes dataset. It is interesting to note that models based on cycle consistency learning (i.e., CycleGAN and AttentionGAN) can better preserve the geometry of objects in images. Another common feature of the Cityscapes dataset is the Mercedes star on the hood of the car used to capture the images.

Unfortunately, all models tend to hallucinate star logos at the bottom of the images for this reason. In this case, the models capture the bright and vibrant colors of the latter data set. Appendix A shows some examples of side-by-side translations of the same GTA V images to Cityscapes and Mapillary Vistas using CycleGAN.

Additionally, Appendix C shows translation examples of two models tested at the beginning of the experiments, but not included in the analysis.

Semantic Segmentation Results

In this way they manage to better approximate the statistics of the Cityscapes dataset, but at the cost of changing the semantic content of the images. This happens because, as mentioned earlier, CUT and DCLGAN make more errors than CycleGAN and AttentionGAN, which change the semantic content of the images. This can improve the quality of the results and can also serve as a method for domain adaptation.

In this work, four state-of-the-art unsupervised image-to-image translation models were trained to translate GTA V images to match the style of two different real data sets. Contrastive learning allows more variety to the translated images and performs much better when images from the two domains are not very similar. When evaluating the quality of the images in such cases, the performance in semantic segmentation was a more reliable measure than GAN metrics such as FID and KID.

However, even CycleGAN and AttentionGAN make errors due to the mismatch between the structure of GTA V landscapes in Los Santos (a fictional city that appears in GTA V) and images from real-world datasets. Since Los Santos is based on the real city of Los Angeles, it would be interesting to create a dataset with images of the latter and retrain the image-to-image translation models in an unsupervised manner. Feeding data to networks closer to ideal results could possibly improve the realism of GTA V rendered images.

Teh, redakteurs, Proceedings of the 34th International Conference on Machine Learning, volume 70 van Proceedings of Machine Learning Research, bladsye 214–223. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS'16, bladsy 820–828, Red Hook, NY, VSA, 2016. Krause, redakteurs, Proceedings of the 35th International Conference on Machine Learning, volume 80 van Proceedings of Machine Leernavorsing, bladsye 1989–1998.