Recently, a paper swept through tech media with its eye-catching title: “The GAN is dead; long live the GAN! A Modern Baseline GAN”. I’m not a fan of such extravagant titles—truly valuable research doesn’t need a flashy headline to attract attention. After reading the paper with a hint of resentment, I found that it indeed did not present any particularly significant innovations.
This paper proposes a baseline GAN model called R3GAN (pronounced “Re-GAN”). R3GAN combines the RpGAN loss and a special gradient penalty (GP) loss, and redesigns the GAN architecture based on the state-of-the-art convolutional network ConvNeXt. Experiments show that R3GAN achieves FID scores comparable to those of diffusion models on FFHQ and low-resolution ImageNet image generation. The work mainly contributes through engineering experiments and does not propose many scientific innovations. In this blog post, I will briefly introduce the main implementation details of R3GAN and provide references for each aspect without delving too deeply. Interested readers can refer to the references summarized at the end.
A Review of GANs
In this section, we will review the necessary knowledge related to Generative Adversarial Networks (GANs) that is essential for understanding R3GAN.
Fundamentals of GANs
Like most other generative models, the training objective of a GAN is to model a mapping from an easily sampled distribution (typically a Gaussian distribution) to a distribution that is difficult to train (the training dataset). Specifically, a GAN uses a Generator to transform noise $z$ drawn from a Gaussian distribution into images $x$. While most other generative models have their own theoretical foundations that define the generator’s learning objective, GANs employ another neural network—the Discriminator—to determine the training objective for the generator.
The two models learn via a game: the discriminator attempts to distinguish whether an image is “fake” (i.e., generated) or real, while the generator strives to improve the quality of the generated images so that the discriminator cannot tell the difference. They share the same optimization objective, though one aims to minimize it while the other aims to maximize it.
In the above loss function, there are various choices for the function $f$. R3GAN opts for the softplus function, as illustrated in the figure above.
Two Classic Architectures: DCGAN and StyleGAN
The pioneering GANs were implemented with fully connected networks. In the subsequent development of GANs, two classic architectures emerged: DCGAN in 2016 and StyleGAN in 2019.
DCGAN is a GAN whose generator is based on convolutional neural networks (CNNs). Its hallmark is that it gradually upsamples low-channel features while simultaneously reducing the number of channels, until a three-channel image of the target size is generated.
StyleGAN, on the other hand, is known for its stable training and suitability for image editing. Unlike traditional GAN generators, StyleGAN takes a noise vector $z$ that is preprocessed by a mapping network via a bypass, and injects the information using the AdaIN operation from style transfer. Because the method of inputting the noise changes, the original low-resolution feature map input is replaced by a constant.
Two Major Challenges: Difficult Convergence and Mode Collapse
Compared to other generative models, GANs are often criticized for being “difficult to train”. This difficulty is evident in issues of convergence and mode collapse. Poor convergence implies that the model does not fit the dataset well, and we can use FID to evaluate the similarity between the model’s outputs and the training set. Mode collapse refers to the phenomenon where, for a multi-category dataset, the model generates only a few of the categories, as illustrated below (Image source). To detect mode collapse, we can either have the network randomly generate a large number of images and use another classification network to count the number of categories present, or use the generation recall metric to roughly assess the diversity of the model’s sampling.
R3GAN Implementation
In the introduction, R3GAN criticizes the various small tricks in StyleGAN that are used to enhance GAN stability and advocates for using a generator as simple as possible. Although the paper is written in this manner, R3GAN is in fact based on the earlier DCGAN, with an updated loss function and the incorporation of the latest CNN architectures—making it almost unrelated to the StyleGAN architecture. Let’s examine R3GAN from these two aspects: the loss function and the model architecture.
Loss Function
Regarding the GAN loss that defines the adversarial game, R3GAN replaces the standard GAN loss with the one from the RpGAN (Relativistic Pairing GAN) paper. In contrast, the RpGAN loss feeds the difference between the discriminator outputs for a pair of real and fake samples into the activation function $f$, rather than feeding the outputs separately.
Based on previous research findings, the authors briefly explain the benefits of the RpGAN loss both intuitively and theoretically:
- Traditional GAN losses only require the discriminator to distinguish between real and fake samples, without enforcing that the gap between real and fake samples be as large as possible. By feeding the difference between a pair of real and fake samples into the loss function, the RpGAN loss encourages this gap to be maximized.
- According to theoretical analyses from previous work, under some simple configurations the standard GAN loss can have a number of suboptimal local minima that grow exponentially, whereas every local minimum of the RpGAN loss is a global minimum.
R3GAN also re-examines the optimal gradient penalty (GP) loss through ablation experiments. The term n-GP indicates that the model’s gradient with respect to the input should be as close as possible to the constant $n$, thereby stabilizing training. The commonly used GPs are 0-GP and 1-GP:
- 0-GP: In the optimal case, the model produces exactly the same output for any input.
- 1-GP: In the optimal case, the model’s output changes smoothly with the input; that is, if the norm of the input tensor changes by 1, the norm of the output tensor also changes by 1.
The authors argue that 0-GP is more suitable for the GAN discriminator, because when the generator’s outputs are identical to the training data, the discriminator should be unable to distinguish between any inputs and should give the same output for all.
For applying GP to the discriminator, there are two forms: $R_1$ and $R_2$, which apply the penalty to real and fake data respectively. The authors found that using both $R_1$ and $R_2$ yields better results.
To summarize, R3GAN uses the loss function combination of RpGAN + $R_1$ + $R_2$. The authors demonstrate through simple experiments that this configuration is optimal. As shown in the figure below, on a simple dataset with 1000 categories, the optimal loss configuration is able to generate data for all categories, with a smaller distribution distance $D_{KL}$ (similar to the FID metric—the smaller, the better). Omitting the RpGAN loss results in reduced output diversity and convergence, while omitting $R_2$ causes training to fail completely.
Modernized Convolutional Networks
After identifying a simple yet effective loss function, the R3GAN paper further explores improved convolutional network architectures. The paper mentions five configurations:
- A: The original StyleGAN2.
- B: Removing most of the design elements from StyleGAN2, making the model nearly identical to DCGAN.
- C: Replacing the loss function with the new one discussed in the previous section.
- D: Adding ResNet-style residual connections to the VGG-like network.
- E: Updating ResNet with modules from ConvNeXt.
Let’s skip configuration A and look directly at the differences between configuration B and the early DCGAN. According to the authors, the key differences in configuration B are:
- a) Using the $R_1$ loss.
- b) Employing a smaller learning rate and disabling momentum in the Adam optimizer.
- c) Eliminating normalization layers in all parts of the network.
- d) Replacing transposed convolution with bilinear upsampling.
Notably, if changes a), b), or c) are not implemented, training fails. Item d) is the standard configuration for upsampling in modern neural networks and helps prevent checkerboard artifacts.
The new loss function in configuration C has already been discussed in the previous section.
Prior to this work—including in StyleGAN—most GAN architectures used VGG-like structures without residual blocks. Configuration D introduces the standard 1-3-1 residual blocks from ResNet into the network.
Configuration E further updates the design of the convolutional layers. It first introduces the group convolution operation (dividing channels into groups so that channels within the same group are connected; note that group=1 corresponds to depthwise convolution). Because this operation is more efficient, the network can incorporate more parameters without increasing overall runtime. Additionally, configuration E employs the inverted bottleneck blocks from ConvNeXt, whose design is inspired by the fully connected layers in Transformers.
Let’s review the simple ablation study results for each configuration once more. It appears that the new loss function does not offer much improvement; ultimately, the modifications to the network architecture prove to be more effective. The best configuration, model E, slightly outperforms StyleGAN2.
Quantitative Experimental Results
Finally, let’s examine the quantitative results presented in the paper. As mentioned earlier, we mainly care about two metrics for GANs: diversity and convergence/image quality. The former can be reflected by the number of classes or recall, and the latter can be assessed using FID (and the $D_{KL}$ used in this post).
Diversity
On small multi-class datasets, R3GAN is able to generate all classes and exhibits the best similarity to the training set, whereas StyleGAN2 fails to generate some classes.
Another metric that reflects image diversity is recall, which roughly indicates how much of the training set’s content can be found in the generated set. The paper does not provide detailed tables but merely notes that on CIFAR-10, StyleGAN-XL achieves a recall of 0.47, while R3GAN reaches 0.57. However, overall, R3GAN’s recall is still lower than that of diffusion models.
Convergence
A major highlight touted by this work is that, on certain datasets, its FID scores surpass those of diffusion models. Let’s look at the FID results on both single-class and multi-class datasets.
First, consider the classic FFHQ face dataset. On this dataset, which has relatively low diversity, GANs have generally performed very well. R3GAN achieves a better FID than StyleGAN2 and most diffusion models—and it does so with only a single inference pass (NFE=1). However, its FID does not surpass that of the best previous GAN models. (But those earlier GANs employed a trick to improve FID without enhancing image quality, which R3GAN did not use.)
Next, consider the more diverse CIFAR-10 and ImageNet datasets. R3GAN’s performance is superior to that of all diffusion models and most GANs. However, R3GAN has not been tested on higher-resolution ImageNet. Nowadays, state-of-the-art generative models are typically evaluated on ImageNet-256, but R3GAN does not provide corresponding experimental results.
Summary and Comments
R3GAN is essentially a modernized version of DCGAN. It introduces improvements in two main aspects: the loss function and the model architecture. On the loss function side, DCGAN employs the RpGAN + $R_1$ + $R_2$ loss; on the architecture side, R3GAN replaces the original VGG-like structure with the latest convolutional design from ConvNeXt. Experiments indicate that R3GAN surpasses all diffusion models and most GANs in terms of FID scores on FFHQ-256 and ImageNet-64, although it falls slightly short of the best previous GANs. In terms of generation diversity, however, R3GAN still does not match the performance of diffusion models.
In terms of research contribution, this paper does not introduce any new theories or ideas—it entirely repurposes methods proposed in previous work. Its main contribution lies in offering engineering insights that may help us develop better CNN-based GANs. In terms of experiment results, R3GAN has not been tested on the current mainstream benchmark, ImageNet-256, and there is no evidence that it can outperform diffusion models. From the experimental results on other datasets, one can infer that R3GAN’s best performance is roughly on par with earlier GANs, without making any fundamental improvements to the GAN framework. In summary, I believe this paper is a mediocre work that just meets top conference standards, making its selection as a Poster at NIPS 2024 quite reasonable.
References
- DCGAN: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks
- StyleGAN: A Style-Based Generator Architecture for Generative Adversarial Networks
- StyleGAN2: Analyzing and Improving the Image Quality of StyleGAN
- GP (WGAN-GP): Improved Training of Wasserstein GANs
- RpGAN: The Relativistic Discriminator: A Key Element Missing from Standard GAN
- RpGAN Landscape Explanation: Towards a Better Global Loss Landscape of GANs
- ConvNeXt: A ConvNet for the 2020s
- ImageNet FID Trick: The Role of ImageNet Classes in Fréchet Inception Distance