Playing with Diffusion Models
In this project, we use the DeepFloyd IF diffusion model, a two stage model trained by Stability AI.
The first stage takes in a text prompt, and outputs a 64x64 image. The second stage takes in a 64x64 stage and upsamples it to 256x256.
The main parameter that can be adjusted is num_inference_steps
, which controls how many "denoising steps" to take. The high level summary is that
a diffusion model is trained to remove noise from an image, and it "generates" novel images by taking some number of denoising steps on complete noise.
(You can think of noise for images as more and more grainy images).
More steps means higher quality, but more computation time.
Below, you can see three different prompts to the diffusion model, in increasing number of denoising steps.
We use seed=18010
(which is also used for the rest of the project).
steps = 5
steps = 5
steps = 5
steps = 10
steps = 10
steps = 10
steps = 20
steps = 20
steps = 20
steps = 40
steps = 40
steps = 40
In the lowest level of steps (5), you can see the repetitive texture that hasn't been smoothed out. Diffusion models generate from pure noise,
but the fact that the texture covers the entire image seems to suggest that this pattern isn't from the noise, but the unrefined textures introduced
by the model.
The higher levels of steps do have increased quality, but it quickly reaches diminishing returns. Interestingly, at the highest number of steps,
the rocket ship prompt seems to regress, being more cartoon like. However, the other two prompts seem to get sharpened and more detailed.
As discussed earlier, diffusion models are trained to remove noise from an image. An important part of diffusion is adding noise. The forward process
can be described by:
\(
\begin{equation}
x_t = \sqrt{\bar{\alpha_t}}x_0 + \sqrt{1 - \bar{\alpha_t}}\epsilon
\end{equation}
\)
Where \( \epsilon \sim N(0, 1) \), and \( \bar{\alpha}_t \) are hyperparameters that control the variance and mean over time.
In this project, we use DeepFloyd's hyperparams.
Below, a visual example of the forward noise is shown.
We can utilize the off-the-shelf DeepFloyd model to remove noise. We can do this denoising with one step,
by rearranging the forward process equation:
\(
\begin{align}
x_t &= \sqrt{\bar{\alpha_t}}x_0 + \sqrt{1 - \bar{\alpha_t}}\epsilon \nonumber \\
\sqrt{\bar{\alpha_t}}x_0 & = x_t - \sqrt{1 - \bar{\alpha_t}}\epsilon \nonumber \\
x_0' &= \frac{x_t - \sqrt{1 - \bar{\alpha_t}} * \tilde{\epsilon}}{\sqrt{\bar{\alpha_t}}}
\end{align}
\)
Where, in equation (2), we use \( \tilde{\epsilon} \) to show that we are using the model's estimate of the noise,
and \(x_0' \) to show that this is an estimate of the clean image.
Our belief of the "clean image" is an estimate
of the image at time 0 in a forward noise process.
Notice in the images below that you can hardly see anything!
'king' - 'man' + 'woman' = 'queen'
; here, 'conditioned' - 'unconditioned' = 'high quality photo'
,
'conditioned' = 'some_base_noise' + 'high quality photo'
).
i_start = [1, 3, 5, 7, 10, 20]
. Larger i_start
corresponds with less noise and less strided steps.
In the examples below, you can see that the earier steps are random generations, because there is too much noise. This algorithm is called
SDEdit.
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
Campanile
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
Cat
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
Dog
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
i_start = 1
i_start = 3
i_start = 5
i_start = 7
i_start = 10
i_start = 20
"an oil painting of an old man"
"an oil painting of people around a campfire"
"an oil painting of an old man"
"an oil painting of people around a campfire"
"a lithograph of waterfalls"
"a lithograph of a skull"
"a lithograph of a sunset"
"a lithograph of a skull"