Concept of Stable Diffusion
1. Latent Diffusion Process:
Diffusion Probabilistic Model: We can model the image creation process as a gradual addition of noise. Let x_0 be the original clean image and x_t be the image with noise added at step t (t goes from 0 to T, where T is the total number of steps). The diffusion process defines the probability of transitioning from a noisy image (x_t) to a noisier version (x_t-1) at the previous step:
p(x_t-1 | x_t) = eps * N(mu_t(x_t), Σ_t(x_t))
Here:
eps is a noise level hyperparameter between 0 and 1.
N(mu_t(x_t), Σ_t(x_t)) represents a Gaussian noise distribution with mean mu_t(x_t) and standard deviation Σ_t(x_t). These are typically functions of the current noisy image x_t, often implemented using neural networks.
Reverse Process (Denoising): Stable Diffusion aims to achieve the opposite – denoising the image starting from a highly noisy version (x_T). This is achieved by learning an inverse process p(x_t | x_t-1). Generative models like U-Nets are used to estimate this conditional probability.
2. Text Encoding and Conditioning:
Text Encoder: A separate neural network (often a Transformer model) takes your text description as input and encodes it into a latent representation, z. This essentially captures the semantic meaning of your words.
Conditioning the Denoising Process: The latent representation z is incorporated into the denoising process to guide the model towards generating an image that aligns with your text description. This can be achieved through various techniques, such as concatenating z with the noisy image representation at each step or using an attention mechanism.
3. Loss Functions:
During training, the model learns by comparing the generated image (x_0) with real images and their corresponding text descriptions. Here are some common loss functions used:
Pixel-wise Loss Functions: These measure the difference between the generated image pixels (x_0) and the corresponding pixels in a real image. Examples include Mean Squared Error (MSE) or L1 loss.
Perceptual Loss Functions: These losses compare the generated image and real image through pre-trained image recognition models. This encourages the generated image to not only look similar to the real image but also activate similar neurons in a pre-trained convolutional neural network, ensuring semantic similarity.
4. Optimization:
The entire model is optimized using gradient descent algorithms. The gradients of the chosen loss function with respect to the model parameters (weights and biases in the U-Net, text encoder, etc.) are calculated and used to update these parameters in a way that minimizes the loss function. This iterative process continues for a large number of training steps, allowing the model to learn how to generate realistic images conditioned on text descriptions.
Last updated