0%

At the beginning of this year, the multiscale autoregressive model VAR opened a new direction for image generation: by modeling image generation as next-scale prediction, and generating all pixels of the same scale at once per round, VAR achieves high-quality image generation at extremely fast speeds. Subsequently, many works have attempted to improve upon it. To compensate for the information loss introduced by the VQ (Vector Quantization) operation in VAR, HART (Hybrid Autoregressive Transformer) represents the lost information through a residual image and uses a lightweight diffusion model to generate this residual image. With these improvements, the authors used HART to accomplish text-to-image generation at a high resolution of $1024 \times 1024$. In this blog post, we will learn about the core methods of HART and analyze its experimental results on text-to-image tasks.

Paper link: https://arxiv.org/abs/2410.10812

Previous Work

All the autoregressive image generation methods involved in this paper originate from VQVAE and VQGAN. Before reading this paper, it is recommended that readers familiarize themselves with these two classic works.

HART is developed directly based on VAR (Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction), and some of its ideas are similar to MAR (Masked Autoregressive models, from the paper Autoregressive Image Generation without Vector Quantization). You are welcome to read my previous posts on these.

VAR explaination

On top of the two-stage generation method in VQGAN, VAR makes the encoder output multiple scales of image tokens (instead of only the highest-scale tokens). During generation, VAR autoregressively generates token maps of different scales, and the token map at each scale is generated all at once in a single round of Transformer inference.

The VQ operation causes loss of information in the encoder output, so all image generation models using VQ-based autoencoders end up with slightly reduced quality. Methods like VAR and VQGAN have no choice but to use VQ because they model the distribution of tokens using a categorical distribution. To completely remove VQ, MAR replaces the categorical distribution with a diffusion model, allowing to use a more precise VAE for image compression.

Compensating for VQ’s Information Loss

To mitigate the quality degradation caused by VQ in VAR, HART uses a straightforward approach: since VQ inevitably causes information loss, we can treat that lost information as a residual image. After generating the image with the standard VAR, we use a diffusion model to generate the residual image. Adding the residual image to the original output yields a higher-quality final image.

Let’s get a direct feel for this idea from the figures in the paper. The first row shows reconstruction results from the VAR autoencoder and from HART’s hybrid autoencoder. Due to the VQ operation, the VQ autoencoder struggles to reconstruct the input image. The second row shows the original output from VAR and the residual image. We can see that after adding the residual image, the details become richer, no longer blurry as before.

In the next two sections, we will learn how HART respectively improves the token generation model of VAR and its autoencoder.

Generating Residual Image Using a Diffusion Model

To understand the entire method, we first need to see how HART’s “residual image” comes about. Therefore, let’s look at the modifications on the token generation model, then see the corresponding modifications in the autoencoder.

First, let’s review how VQ errors are introduced in VAR. VAR borrows the classic Laplacian pyramid idea to model token maps at multiple scales.

In other words, VAR does not split the full image into token maps at different resolutions but with the same content. Instead, it splits it into the lowest-resolution image plus the information lost at each scale. This “information loss” includes not only what comes from downsampling but also what results from VQ.

Even though the multiscale decomposition takes into account the information loss from VQ, the final reconstructed features (i.e., the decoder inputs, obtained by summing up the token lookups) still cannot perfectly match the encoder output features. The “residual image” HART wants to generate with a diffusion model is precisely the difference between the reconstructed features and the encoder output features shown in the figure above.

Unlike the discrete token maps, the residual image is continuous. To generate this continuous image, HART refers to MAR, employing an image-conditioned diffusion model. The goal of this diffusion model can be interpreted as: given the decoded image from the discrete token map, how do we use a diffusion model to generate additional details to improve image quality?

A schematic of HART’s generation model is shown below. The generation process before the last step is exactly the same as VAR. In the final step, the intermediate hidden state of the Transformer is fed into an MLP diffusion model. The diffusion model predicts a residual value independently for each token. In other words, this is not an image diffusion model but rather a per-token pixel diffusion model. Tokens are sampled independently from each other. Thanks to this independence assumption, HART can use a lightweight diffusion model to generate the residual image, adding almost no extra time to the overall generation process.

HART also changes VAR’s class conditioning to text conditioning. We will discuss this later in the experiments section.

AE + VQVAE Hybrid Autoencoder

Now that we know where HART’s residual image comes from, we can go back and look at the corresponding modifications in the autoencoder. Currently, the autoencoder decoder has two types of inputs: (1) the approximate reconstructed features formed by summing up the discrete tokens from VAR, and (2) the precise reconstructed features (equal to the encoder output features) when the residual image from HART is added. To handle both types of input simultaneously, the decoder is trained such that half of the time it takes the encoder’s output, and the other half it takes the reconstructed features from the discrete tokens. Of course, during generation, since the residual image is added, the decoder’s input can be considered the same as the encoder output.

The figure below uses “token” terminology differently from VAR. VAR calls both the encoder outputs and decoder inputs “feature maps,” and calls the index map after the VQ operation the “token map.” HART, however, calls the encoder outputs “continuous tokens” and the reconstructed features “discrete tokens”. In this blog post, we follow VAR’s naming. Likewise, what HART calls “residual token” is referred to here as the “residual image.”

In this sense, HART’s hybrid autoencoder is like a VAE without KL loss (i.e., an ordinary autoencoder) and also like a VQVAE.

High-Resolution Text-to-Image Implementation Details

Let’s briefly see how HART extends the class-conditioned ImageNet $256 \times 256$ VAR to a $1024 \times 1024$ text-to-image model.

  • Text conditioning: Instead of using cross-attention to incorporate text condition, as in many T2I models, HART follows VAR’s approach to class embeddings, adding the text embedding as the input to the first scale and as input to the AdaLN layers.
  • Positional encoding: For the scale and image position indices, VAR uses learnable absolute position embeddings. HART, however, uses sinusoidal encoding for scale and 2D RoPE (rotary position encoding) for the image coordinates.
  • Larger scales: In the original VAR, the largest token map side length is 16, HART appends additional side lengths 21,27,36,48,64.
  • Lightweight diffusion model: Since the diffusion model only needs to model the distribution of single tokens, it has only 37M parameters and needs just 8 steps to achieve high-quality sampling.

Quantitative Results

  • Let’s first look at the most popular “benchmark” metric—ImageNet $256 \times 256$ class-conditioned generation. The authors did not include results for the best MAR model, so I’ve added them here.

In this task, the main difference between HART and VAR is whether or not a diffusion model is used to produce the residual image. As we can see, the residual diffusion model hardly increases the inference time, yet it significantly improves the FID metric (considering the lower the value, the harder it is to improve). Moreover, comparing the speeds of different models, we see that the greatest advantage of VAR-like models lies in their fast inference.

Next, let’s look at the text-to-image generation metrics, which are the main focus of this paper. In addition to the commonly used GenEval (mainly measuring text-image alignment), the authors also show two metrics introduced this year: metrics on the MJHQ-30K dataset and DPG-Bench.

These metrics may not be very convincing. According to user-voted rankings at https://imgsys.org/rankings, Playground v2.5 is the best, while SD3 and PixelArt-Σ are about the same. However, the MJHQ FID and DPG-Bench metrics do not reflect that ranking. In particular, since the FID uses the Inception V3 network trained on ImageNet $299 \times 299$, FID does not accurately capture high-resolution image similarity, nor does it capture similarity in more complex images.

In summary, HART’s performance on high-resolution text-to-image tasks cannot yet be reflected by the experimental results. According to some community feedback (https://www.reddit.com/r/StableDiffusion/comments/1glig4u/mits_hart_fast_texttoimage_model_you_need_to_see/), HART has issues in generating high-frequency details. Looking back at HART’s method, we can infer that this might be caused by the suboptimal design of the residual diffusion model.

Summary

To mitigate the information loss caused by the VQ operation in VQ-based autoencoders, HART treats the lost information as a residual image and uses a lightweight pixel diffusion model to independently generate each pixel of that residual image. HART applies this improvement directly to VAR and boosts the ImageNet FID metric. However, HART still cannot compete with diffusion models in high-resolution text-to-image tasks, and since diffusion models have various acceleration tricks, HART does not have an advantage in generation speed either.

VQ operations transform complex images into image tokens. This makes the distrbution of image tokens easy to learn, but sacrifice autoencoder reconstruction quality. Many works have tried to improve the original nearest-neighbor VQ operation in VQVAE. Regardless, the error introduced by VQ is inevitably present. From another aspect, HART alleviates VQ reconstruction errors by generating a residual image with a separate model. This design idea is promising—it may be possible to eliminate VQ errors entirely. But there’s no free lunch: improving generation quality typically means increasing training and inference time. Although HART’s approach of using a lightweight pixel diffusion model to generate the residual image does not slow down the model, its effectiveness is still not sufficient. Perhaps replacing it with a diffusion model with larger receptive field could improve the quality of the residual image without significantly increasing generation time.

This past April, Peking University (PKU) and ByteDance published a paper on Arxiv titled Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction, introducing a brand-new paradigm for image generation called Visual Autoregressive Modeling (VAR). This autoregressive generation method represents high-definition images as multi-scale token images and replaces the previously popular next-token prediction with a next-scale prediction approach. On the ImageNet $256×256$ image generation task, VAR outperforms DiT. Our research group was quick to read the paper and found that the work offers notable innovations, though whether it can fully replace diffusion models remains to be seen. Typically, attention on such a paper would gradually decrease over time, but two recent events have catapulted VAR’s popularity to an unprecedented level: serious violations by the paper’s first author resulted in ByteDance suing him for 8 million RMB, and the paper was selected as the best paper at Neurips 2024. Taking this opportunity, I decided to thoroughly study this paper and share my findings.

In this post, I will first review early works closely related to VAR: VQVAE and VQGAN. Then, I will introduce the methodological details and experimental results of the paper, and finally, I will share my own tests and theoretical investigations of the VAR approach. During my reading of the VAR paper, I noticed a design flaw. Experimental results suggest that the paper does not provide a complete analysis of why this approach is effective. I encourage everyone to carefully read that section and share your own insights.

Paper link: https://arxiv.org/abs/2404.02905

Review of VQGAN

VAR can be considered an improved version of VQGAN, which, in turn, builds upon VQVAE. To better understand VAR, the most direct approach is to revisit these two classic works, VQVAE and VQGAN. We will start with the autoregressive generation paradigm, then move on to image autoregressive generation, and finally review the implementation details of VQVAE, VQGAN, and Transformer.

Autoregressive Image Generation

Autoregressive (AR) Generation is a straightforward sequence-generation paradigm: given the first $n$ elements of a sequence, the model outputs the $(n+1)$-th element; we then append the newly generated element to the input sequence and again output the $(n+2)$-th element, and so forth. Below is an example of text autoregressive generation:

1
2
3
4
(empty) -> Thank
Thank -> you
you -> very
very -> much

Strictly speaking, the model does not output what the next element should be, but rather what it could be, i.e., the probability distribution of the next element. By repeatedly sampling from the distribution of each next element, the model can generate a wide variety of sentences.

AR only applies to sequential data whose order is clearly defined. To use it for images, we need to do two things: 1) break the image into a set of elements, and 2) assign a order number to these elements. The simplest approach is to split the image into pixels and generate them in a left-to-right, top-to-bottom order. For instance, the following is a schematic of the classic image AR model PixelCNN. Suppose the image has $3 \times 3$ pixels, labeled from top-left to bottom-right. When generating the 5th pixel, the model can only use the previously generated 4 pixels. The model outputs a probability distribution for the pixel’s possible grayscale values 0, 1, …, 255.

Incidentally, there are many ways to model probability distributions. Here, we use categorical distribution. The advantage of this is its simplicity and ease of sampling. The downside is that the elements must be discrete values. Even though, theoretically, a pixel’s grayscale value could be any real number between 0 and 1 (assuming it has been normalized), using PixelCNN means we are restricted to 256 discrete values 0, 1/255, 2/255, ..., 1 and cannot represent more precise values.

VQVAE

Although PixelCNN can do image generation, it is extremely slow: since pixels are generated one by one, the rounds of network inference equals to the number of pixels. Is there a way to speed up generation? We hope the size of images we need to generate be smaller.

To accelerate PixelCNN, VQVAE introduced a two-stage image generation method that leverages an image compression network: first generate a compressed image, and then reconstruct it into a realistic image via an image compression network. Because the compressed image contains fewer pixels and it can be reconstructed fast, the entire generation process speeds up considerably.

Below is a generation example using VQVAE. Based on the categorical distribution output by PixelCNN, we can sample a compressed image made up of discrete values. These discrete values are analogous to words in natural language processing (NLP): each discrete value holds a special meaning. We can interpret each discrete value as representing the color of a pixel patch in the original image. By using the image compression network’s decoder, we can reconstruct a clear original image from the compressed image.

The training process for VQVAE is the reverse of the generation process. We start by training an image compression network. This network, composed of an encoder and a decoder, is called an autoencoder. The compressed images are referred to as latent images. Once the autoencoder is trained, we convert all the training images into latent images and train PixelCNN to generate these latent images. Interestingly, only the encoder is used when training PixelCNN, while only the decoder is used at generation time.

In the above discussion, we skipped an implementation detail: how do we make the network handle discrete values as input or output? Inputing discrete values to networks is relatively straightforward: in NLP, we use an embedding layer to convert discrete words into continuous vectors. The approach for inputing disrete values is similar. But how do we get the network to output discrete values? This is where vector quantization (VQ) comes into play.

We are all familiar with quantization—for instance, rounding a decimal to the nearest integer. In essence, rounding is converting a decimal to the nearest integer. By the same logic, for vectors, if we have a predefined set of vectors (analogous to “integers”), vector quantization transforms an arbitrary vector into the nearest known vector, where “nearest” refers to Euclidean distance.

A concrete example is shown below. The encoder outputs some continuous vectors in a 2D featuer map. By searching the nearest-neighbor in the embedding layer (also known as “codebook”), each continuous vector is converted to an integer that represents the index of that nearest neighbor. This index can be treated like a token in NLP, so the encoder’s output features are turned into a token map. Then, when feeding the token map into the decoder, we look up the table (the embedding layer) using those indices to convert the indices back into embedding vectors. Strictly speaking, an autoencoder that compresses images into discrete latent images is called “VQVAE,” but sometimes the term “VQVAE” is used to refer to the entire two-stage generation method.

The terms “encoder features”, “token map”, and “embeddings” have various names in different papers, and most authors only use mathematical symbols to refer to them. Here, we use the terminology from the VAR paper.

We will not dive into the specific learning process of the embedding layer here. If you are not familiar with vector quantization, I recommend carefully studying the original VQVAE paper.

VQGAN

VQVAE’s results are not great, mainly because both its compression network and its generation network are underpowered. Consequently, VQGAN method made improvements on both of VQVAE’s networks:

  • VQGAN method replaced the VQVAE with VQGAN (Similar to VQVAE, sometimes we use name of the autoencoder, VQGAN, to denote the entire method). On top of VQVAE, VQGAN adds perceptual loss and a GAN loss during training, which substantially improves the reconstruction quality of the autoencoder.
  • Instead of using PixelCNN as the generation model, VQGAN method used a Transformer-based approach.

Transformer

Transformer is currently the most dominant backbone network. Compared with other network architectures, its biggest hallmark is that the elements of a sequence communicate information only through attention operations. To handle the text autoregressive generation task, the earliest Transformers used two special designs:

  • Because attention alone cannot reflect the order of input elements, each token embedding is combined with positional encoding before being fed into the network.
  • AR requires that earlier tokens not see information from subsequent tokens. Therefore, a mask is added to self-attention to control information flow between tokens.

VQGAN uses the exact same design, treating image tokens like text tokens and generating them with a Transformer.

From Next-Token Prediction to Next-Scale Prediction

Traditional image autoregressive generation employs next-token prediction:

  • An autoencoder compresses the image into discrete tokens.
  • Tokens are generated one by one in a left-to-right, top-to-bottom order.

Even though the number of tokens is drastically reduced by the autoencoder, generating tokens one at a time is still too slow. To address this, VAR proposes a faster, more intuitive AR strategy:

  • An autoencoder compresses the image into multiple scales of discrete token maps. For example, if a single latent image used to be $16 \times 16$, we now represent that same latent image using a series of token maps with scales $1 \times 1, 2 \times 2, …, 16 \times 16$.
  • Start from the smallest token map and generate larger token maps in ascending order of scale.

Given this strategy, we must modify both the autoencoder and the generation model. Let us check how VAR accomplishes these modifications.

Multi-Scale Residual Quantization Autoencoder

First, let us look at the changes in the autoencoder. Now, the token maps are not just a single map but multiple maps at different scales. Since the definition of token map has changed, so must the definitions of encoder features and embeddings, as shown below.

We can still use VQVAE’s vector quantization approach. The new question is: how do we merge multiple scales of token maps into a single image, given that the encoder output and the decoder input are each just one image?

The simplest approach is to leave the encoder and decoder unchanged, letting them still take and produce the largest-scale image. Only in the middle (at the vector quantization / embedding lookup stage) do we downsample the image.

VAR, however, uses a more advanced approach: a residual pyramid to represent these latent features. Let us briefly recall the classic Laplacian Pyramid algorithm in image processing. We know that each downsampling step loses some information. If that is the case, we can represent a high-resolution image as a low-resolution image plus its “losses” at each resolution scale. As illustrated below, the rightmost column represents the output of the Laplacian Pyramid.

When computing the Laplacian Pyramid, we repeatedly downsample the image and compute the residual between the current-scale image and the upsampled version of the next-scale image. Then, by upsampling the lowest-resolution image and adding the residuals from each layer, we can accurately reconstruct the high-resolution original image.

Our goal is to do something analogous for the encoder features. How do we split the largest-scale encoder features into a accumulation of different-scale features?

In constructing the Laplacian Pyramid, we rely on two operations: degradation and restoration. For images, degradation is downsampling, and restoration is upsampling. For the latent features output by the encoder, we need to define analogous degradation and restoration. Rather cleverly, instead of simply defining them as downsampling/upsampling, VAR references the paper Autoregressive Image Generation using Residual Quantization, regarding the quantization error introduced by vector quantization as a part of degradation. In other words, our new goal is not to ensure that the sum of all scale features equals the encoder features exactly, but rather that the sum of all scale embeddings is as similar as possible to the encoder features, as depicted below.

Given this, we define degradation as downsampling + vector quantization/embedding lookup, and restoration as upsampling + a learnable convolution. Let us see how VQVAE’s original vector quantization and embedding lookup should be applied in this new context.

First, consider the new multi-scale vector quantization operation. Its input is the encoder features; its output is a series of token maps at different scales. The algorithm starts from the lowest scale, and in each loop outputs the token map at the current scale, then passes the residual features to the input of next scale.

As for the multi-scale embedding lookup operation, its input is the multi-scale token maps, and its output is a single feature image at the largest scale, to be fed into the decoder. For this step, we only need to do an embedding lookup and restoration (upsampling + convolution) on each scale’s token map separately, then sum up the outputs from all scales to get features similar to the encoder’s. Note that for simplicity, these diagrams omit some implementation details, and some numerical values may not be perfectly rigorous.

In summary, to implement scale-wise autoregressive generation, we must encode an image into multi-scale token images. VAR employs a multi-scale residual quantization approach: it decomposes the encoder features into the smallest-scale feature plus residual features at each scale, and applies vector quantization to the features at each scale. This not only effectively split the features into multiple scales but also has another benefit: in standard VQVAE, only the largest-scale features are quantized, making the quantization error large; in VAR, the quantization error is distributed across multiple scales, thereby reducing total quantization error and improving VQVAE’s reconstruction accuracy.

Next-Scale Autoregressive Generation

Once we have compressed the image into multi-scale token maps, the rest is straightforward. We simply flatten all tokens into a one-dimensional sequence and train a Transformer on that sequence. Since the task is now “next-scale prediction,” the model outputs the probability distributions for all tokens at the same scale in one loop, rather than merely the next token. Thus, even though the sequence becomes longer, the model is still faster overall because it can generate all tokens of a given scale in parallel. Meanwhile, the attention masking changes accordingly. Now, tokens at the same scale can see each other, but tokens at previous scales cannot see those at later scales. The following diagram illustrates the difference in attention masks and generation procedures for a
$3 \times 3$ token image under the “next-token” vs. “next-scale” approach.

In addition, the VAR Transformer also includes a few other modifications: 1) besides adding a 1D positional embedding to each token, tokens at the same scale share a scale embedding. All embeddings are learnable; 2) the Transformer and VQVAE decoder share the same embedding layer. Moreover, to reuse information from already generated images when creating a new scale, the initial embeddings for the new scale are obtained by performing bicubic upsampling on the previously generated results.

All other design elements of this Transformer are the same as those in VQGAN. For example, the Transformer has a decoder-only structure, and a special token is used at the first layer to encode the ImageNet class for class-conditional generation. The loss function is cross-entropy loss.

Quantitative Experiments on ImageNet

We have now seen most of the core aspects of VAR’s method. Let us briefly look at its experimental results. The paper claims that VAR performs well in both image generation and model scaling experiments. Specifically, VAR beats DiT in terms of FID score (Fréchet Inception Distance), and its generation speed is more than 45 times faster than DiT. Let us focus on VAR’s results on ImageNet $256 \times 256$ generation task. Below is a table from the paper. I have also included results from the MAR (Autoregressive Image Generation without Vector Quantization) by Kaiming He’s group .

First, let’s compare DiT and VAR. In terms of speed, VAR is clearly much faster than DiT for any model size. In terms of FID as a measure of image quality, for the ~600M parameter regime, DiT still outperforms VAR. However, as the model size increases, DiT’s FID does not show improvement, whereas VAR’s FID keeps dropping. Eventually, VAR’s FID even surpasses that of the ImageNet validation set, at that point, there is little meaning in pushing FID any lower.

Then, let’s compare MAR and VAR. MAR can achieve an even more extreme FID (1.55) with a 943M model. But based on the MAR paper, its speed is about 5 times that of DiT-XL, meaning VAR is still faster—by a factor of around 9 compared to MAR.

On ImageNet, the FID of most SOTA models has essentially been maxed out. The main takeaway from the FID results is that VAR exhibits strong generative capabilities, on par with or better than DiT. However, for more challenging tasks like text-to-image, VAR’s performance is yet to be verified. Moreover, while DiT used 250 sampling steps to produce these benchmark numbers, in practical usage people usually sample in 20 steps. And with distillation, the sampling steps can be reduced to 4. Factoring in these acceleration techniques, VAR might not be faster than DiT.

Visualizing VAR’s Multi-Scale Generation

Having covered the main points of the paper, I will share some of my own theoretical analyses and experimental findings on VAR.

Let us look first at random sampling results. I used the largest VAR model with depth d=30. Under the default settings of the official sampling script, the outputs for two random seeds (0 and 15) are as shown below. The chosen ImageNet classes here are volcano, lighthouse, eagle, and fountain, with two images generated for each class. The generation is very fast, taking only about one second to produce all 8 images.

We can also observe the temporary images decoded at each scale after generation is completed. As expected, the image progresses from coarse to fine detail:

To further investigate which image components each scale is responsible for, we can do the following experiment: from a certain scale onward, switch to a different random seed generator. In GIF of each scale, the unchanged portions come from the earlier scales, and the varying parts come from the subsequent scales. As we can see, from around the third scale onward, the overall content of the image is essentially fixed—that is, structural information is determined in the first two scales. The further we go, the more image details are refined.

These results are rather surprising: does a $2 \times 2$ token map already determine the overall content of the image? Let us examine this in more detail.

Flaws in Single-Scale Generation

Some of you may feel something is off when studying VAR’s sampling algorithm: when generating the token map at a given scale, each token is sampled independently from a probability distribution.

According to the paper, VAR’s scale-autoregressive approach is a new autoregressive probabilistic model:

where $r_k$ denotes the token map at the $k$-th scale (from smallest to largest), and there are $K$ scales in total. Tokens in the same scale $r_k$ are generated in parallel. This means that during training (with a cross-entropy loss) and sampling, the method treats the probability of a token map as the product of the probabilities of all its tokens, assuming independence:

where $r_k^i$ is the $i$-th token at the $k$-th scale, and $I_k$ is the total number of tokens at that scale. I believe the above equation does not hold in principle. Even with the conditions imposed by previous scales, it is unlikely that the distributions for each token in the same scale are mutually independent. And as $I_k$ gets larger, the error introduced by that assumption grows.

Normally, independently sampling tokens at the same scale could result in inconsistencies in the generated image (like a “seam” every 16 pixels if each token represents a
$16 \times 16$ block). But why does VAR’s output still look coherent? On closer inspection of the VAR generation algorithm, we may find two features that greatly reduce discontinuities:

  • VAR’s autoencoder uses vector quantization. This ensures the decoder’s input is always “reasonable,” and that the decoded image remains coherent.
  • When generating a new scale, the model’s input is initialized with the bicubic upsampling of the previous scale’s image. This bicubic sampling ensures continuity among token embeddings.

Moreover, to further mitigate the negative effects of independent sampling, VAR effectively determines the image’s overall content by the time it completes the second or third scale; subsequent scales only refine image details. (Because as the number of tokens increases, the error from independent sampling grows). We have already verified this through the visualizations above. To show that only the first few scales matter, I did a bold experiment: after the Transformer generates the first two scales of tokens, I randomly generate all subsequent tokens. In the figure below, I froze the outputs of the first two scales and then generated multiple images with different random seeds for the later scales. The results show that if the first two scales are generated well, the rest of the tokens (even if sampled randomly) hardly affect the final image quality.

Based on these experiments, I think the real reason VAR works cannot be merely summarized as “next-scale prediction is a superior new generation paradigm.” The core success factor may be its multi-scale residual quantization autoencoder, which at least accomplishes the following:

  • Uses vector quantization to ensure the decoder input is always valid.
  • Adopts a multi-scale residual design, where each new scale’s token map not only records the information lost by downsampling but also the precision lost through vector quantization. Compared to a simple, human-interpretable Laplacian pyramid, this learnable degradation process may be more powerful.
  • Performs bicubic upsampling of the lower-scale tokens, ensuring continuity in the generated images.

Of course, these components are entangled with one another. Without more in-depth experiments, we cannot pinpoint the single most crucial design element in VAR.

Multi-scale generation itself is not new—prior works like StyleGAN and Cascaded Diffusion have adopted similar strategies. However, VAR makes a bold choice: tokens within the same scale are sampled independently. Surprisingly, this mathematically questionable design does not severely degrade image quality. Thanks to this design, VAR can sample tokens within the same scale in parallel, drastically boosting generation speed.

Conclusions and Commentary

Previously, AR methods like VQGAN fell short in sampling speed and generation quality. The fundamental reasons are that next-token prediction for image tokens is both somewhat misguided and slow. To address this, VAR proposes a new AR strategy: decompose the latent into multiple scales, and generate the latent via next-scale prediction. To accommodate this, VAR modifies both the autoencoder and the Transformer used in VQGAN. In the autoencoder, images are encoded into multi-scale residual token maps, and in the Transformer, each scale’s tokens are assumed to have independent distributions. Experiments show that on ImageNet, VAR surpasses diffusion models like DiT in terms of image quality and is at least 45 times faster. Moreover, experiments suggest that VAR follows the scaling law.

From my perspective, as with other cutting-edge generative models, VAR has essentially maxed out ImageNet FID scores. Its performance on more challenging image-generation tasks remains to be proven. Recently, ByteDance released a text-to-image version of VAR called Infinity, but that model has not been open-sourced yet. We can continue to follow up on subsequent VAR-related work. As for speed, VAR may not be significantly faster than DiT once techniques such as reduced sampling steps and model distillation (for DiT) are applied. Of course, it is possible that VAR can be further accelerated in ways that have not been explored as extensively as with diffusion models.

Mathematically, the VAR approach has flaws: the token map’s distribution should not be the product of the independent distributions of its tokens. At least, the paper does not provide any analysis on this (nor does MAR, which uses a similar approach). Yet, simple generation experiments show that due to other designs that enforce continuity, the model outputs coherent images even if tokens at the same scale are sampled independently or even randomly. We need deeper experiments to really uncover why VAR is so effective.

I think if a research project could clearly explain which parts of VAR are making the biggest difference, retain those and discard the rest to propose a superior generation model, it would be a significant contribution. Potential directions for exploration include:

  • Only the first few scales of tokens seem crucial in VAR. Perhaps we could generate those earlier scales using a more refined approach—for instance, a diffusion model—to ensure quality, while using a more efficient (maybe faster than a Transformer) model for the higher-scale token images. This could further enhance both quality and speed.
  • VAR still relies on a VQ autoencoder, and no matter how you improve it, vector quantization reduces reconstruction accuracy. On the other hand, VQ can regularize the decoder’s input. Is it possible to replace VQ with VAE for its higher accuracy? And if so, how would we design a multi-scale encoding algorithm without VQ?

Video diffusion models generally have the problem that video quality decreases with the increase of video length. Therefore, the authors of Diffusion Forcing propose a new sequence generation paradigm: Noise of different levels is independently added to each element in the sequence when training sequence diffusion models. The effectiveness of this paradigm is verified by simple video generation and decision-making tasks. I will introduce this work mainly from the perspective of video generation.

Paper arxiv: https://arxiv.org/abs/2407.01392

Previous work

As we will see later, Diffusion Forcing is closely related to the two previous mainstream sequence generation paradigms: autoregressive generation (AR) and full-sequence diffusion models.

In autoregressive sequence generation, the model continually predicts the next (n-th) element based on the first n-1 elements of the sequence. AR is the most common in NLP and is used by both RNN and Transformer.

The diffusion models can directly generate data of any shape. If we treat video not as a sequence of images, but as a “3D image”, we can directly extend a 2D image diffusion model to a 3D video diffusion model. This approach is referred to in this paper as the “full-sequence diffusion model”. Early works using this approach included Video Diffusion Models by the authors of DDPM. The authors of Stable Diffusion also proposed a similar work, Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models (Video LDM), based on the LDM.

The full-sequence diffusion models can only generate fixed-length video. In order to extend it to long video generation, it has to be combined with AR. However, the quality of frames generated in this method does not match the training set, causing continuous quality degradation in the autoregressive process. Inspired by Cascaded Diffusion Models, Stable Video Diffusion and other works try to mitigate this problem by adding noise to constrained image/video frames.

AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation further explores the combination of AR and full-sequence diffusion models: when generating text with diffusion models, the noise for each token varies, and the earlier the token, the less the noise. Coincidentally, FIFO-Diffusion: Generating Infinite Videos from Text without Training shows how different video frames can be generated with different levels of noise on a pre-trained video diffusion model. Perhaps inspired by these works, Diffusion Forcing systematically explores how to independently add noise to sequence elements during training.

Research Motivation

The author of this paper finds the shortcomings of the AR and full-sequence diffusion models and considers that the two generative paradigms are complementary:

  • AR cannot add new optimization objectives in inference time, and there is a quality degradation problem caused by the mismatch of training and inference samples.
  • Full-sequence diffusion models cannot generate sequences of varying lengths.

Conversely,

  • Full-sequence diffusion models can add Classifier-guidance in inference time and there is only a little degradation within training sequence length.
  • AR can generate sequences of varying lengths.

Then, is it possible to combine the two? During sampling, we want the sequence to be generated autoregressively. Meanwhile, from the aspect of noise levels, we hope that each element can be gradually generated from full noise to full clarity. Here is what Diffusion Forcing does: When generating a sequence, the earlier elements have less noise, while the newer elements have more noise. All the elements of different noise levels are denoised at the same time in diffusion models. For example, if the sampling requires 3 DDIM steps and we want to generate 3 frames, a denoising step would make $[x_1^{1/3\cdot T}, x_2^{2/3\cdot T}, x_3^{T}]$ to $[x_1^0, x_2^{1/3\cdot T}, x_3^{2/3\cdot T}]$.

To implement this kind of sampling, we must modify the training approach of diffusion models so that the model can correctly denoise even if the noise levels of inputs are not uniform. Unlike previous works, the authors find that we do not have to fix the noise levels of each element in training as we do in sampling, but can independently sample the noise levels of each frame.

Simple Video Generation Models

The idea of this paper is very concise. Based on the video DDPM, it just changes the noise levels of each frame in training and sampling. To further understand the method of this work, let’s look at the method and experiment of video generation in this paper.

Overall, the training method is the same as the DDPM epsilon-prediction, but with different frame noise levels.

In terms of inter-frame relationships, Diffusion Forcing models causal relationships, meaning that the current frame can only see information from previous frames.

Specifically, this work uses the hidden variables of RNN (exactly GRU) to model the information passed from previous frames. After introducing RNN, the paper complicates the simple formulas of DDPM. I do not recommend readers delve into the contents of the RNN part.

Because of the different noise levels in different frames, we now need to define a two-dimensional noise schedule table for different frames and denoising steps. To create different noise levels at the beginning, the denoising timestep of newer frames will stay in place. The details of the simultaneous denoising algorithm are introduced in the appendix of the paper.

The authors find that Diffusion Forcing can be extended to video generation with infinite length: when generating the next segment, the RNN’s initial hidden state is set to the output of the RNN of the previous segment without a sliding window.

The authors train two baseline models with the same RNN architecture: an autoregressive model and a full-sequence causal diffusion model. The qualitative results show that the results of Diffusion Forcing are better than those of the baseline methods, whether within the training video length or beyond the length. The results can be viewed on the official project website:

https://boyuan.space/diffusion-forcing/

Critical Analysis

At the beginning of the paper, it says that AR lacks the capacity to add conditions in inference. But this is not fatal for video generation, because you can usually add conditions in training and use Classifier-free Guidance in inference.

The authors say they implemented Diffusion Forcing with RNN for simplicity. But it is clear that 3D U-Net should be the easiest and most intuitive way to do this. After all, the earliest video diffusion model was done with 3D U-Net. In the official warehouse, an undergraduate student helped them implement a 3D U-Net with temporal attention that works better than the original video model.

I think the video generation baselines in the paper are not powerful enough. Most autoregressive video generation / image-to-video generation models employ a noise augmentation method proposed in Cascaded Diffusion, which adds noise to the condition image and inputs this noise scale as an additional condition to the denoising model. This design is similar to the Diffusion Forcing’s principle. To illustrate the benefits of the new approach, it is necessary to compare Diffusion Forcing with these more powerful baseline AR models.

The design of the full-sequence video diffusion model looks strange. The motivation of this type of video diffusion model is to treat the video as a 3D image, allowing the exchange of information between frames, and only ensuring the coherence of the video within the training length. The authors now implement a casual version of the full-sequence video model using RNN, which is certainly not as good as the non-casual version. Although the authors say that Diffusion Forcing is always more coherent than the full-sequence diffusion models, I doubt whether Diffusion Forcing can beat non-casual full-sequence diffusion models.

The main benefit of Diffusion Forcing in video generation should be long video generation beyond training length. Therefore, it doesn’t matter if the full-sequence diffusion models perform better within the training length. The author should compare the method that directly combines autoregressive and full-sequence diffusion models to show the superiority of Diffusion Forcing in long video generation.

To sum up, I think the author’s experiment on the video generation task is not sufficient. Indeed, half the paper focuses on decision-making tasks, not just video generation tasks. I believe Diffusion Forcing will mitigate the degradation in long video generation. We may see better long video diffusion models using Diffusion Forcing by large companies. But the fundamental problem with long video generation is the loss of memory, an essential issue that Diffusion Forcing cannot solve.

My biggest inspiration from this work is that we always treat videos as complete 3D data, but forget that video can be treated as an image sequence. If the video is treated as 3D data, different frames can only see the information of other frames at the current denoising time through temporal attention. But for sequential data, we can do more design on the dependence of different frames, such as using different denoising levels like this work. I’ve long been thinking of a sequence-generation paradigm that has a stronger sequence dependency: Can we condition the current element with all information (including intermediate denoising outputs and intermediate variables of the denoising network) from all other elements in all denoising steps? This kind of strongly conditioned sequence model may be helpful to the consistency of multi-view generation and video segment generation. Since the generation is conditioned on another denoising process, any edits we make to this denoising process can naturally propagate to the current element. For example, in video generation, if the entire video is conditioned on the denoising process of the first frame, we can edit the first frame by any image editing method based on the diffusion model and propagate the changes to the whole video. Of course, I just provide a general idea, temporarily without considering the details, you are welcome to think in this direction.

One might also wonder if Diffusion Forcing could be extended to model pixel relationships. I don’t think there’s a problem with training at all. The problem is with inference: Diffusion Forcing needs to predefine the denoising schedule table for different elements and timesteps. For sequential data such as video, it is natural that the earlier the frame, the lower the noise level. However, how to define the denoising schedule of different pixels is not trivial.