GAN 2: Build Better GANs

Some notes of GAN(Generative Adversarial Network) Specialization Course by Yunhao Cao(Github@ToiletCommander)

👉

Note: This note has content including week 1-3 (version as of 2022/6/16)

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Acknowledgements: Some of the resources(images, animations) are taken from:

Sharon Zhou & Andrew Ng and DeepLearning.AI’s GAN Specialization Course Track

articles on towardsdatascience.com

images from google image search(from other online sources)

Since this article is not for commercial use, please let me know if you would like your resource taken off from here by creating an issue on my github repository.

Last updated: 2022/6/16 14:15 CST

Evaluating a GAN

Evaluating is a challenging task because..

GANs are learning to composing not singing - it is hard to determine if it was right or wrong
1. Classifiers have labeled data to compare against

Although in a GAN there are real or fake images, the discriminators in GAN always overfits to the generators and there’s not a universal discriminator that compares two different generators

But we can compare images

We compare several images generated by GANs in terms of their fidelity and diversity. But it can be challenging.

Pixel Distance

Works, but not reliable

Imagine the fake image shifted one pixel to the right, then the whole pixel distance is huge and exaggerated though the two images are similar.

Feature Distance

Less Sensitive to small shifts,

Extract information about the real and fake images and then compare the features (high-level semantic information)

🔥

But how do we extract features? 1. Feature extraction using pre-trained classifiers (maybe just take a middle layer - most commonly the last pooling layer - along the classifier network and each node can act as a feature detector) trained on ImageNet dataset.

Usually the classifier we use is Inception network (Sharon mentioned Inception-V3, which has 42 layers deep but very computationally efficient, but I think the newest is now Inception v4)

Frechet Distance and Frechet Inception Distance

the Frechet Distance is developed to compare distance between curves but can be modified to measure distance between distributions as well.

We will use the classic dogwalker schenerio to illustrate this concept

Both the dogwalker and the dog is walking

they can run in different speeds

they can run in different directions

but neither of them can run backwards

figure out the minimum leash length needed to walk from beginning to end ⇒ what is the minimum leash needed to walk the curves without giving him more slack during the walk.

We can also determine the Frechet Distance between univariate normal distributions using a similar idea.

d(X,Y)=(\mu_X-\mu_Y)^2+(\sigma_X-\sigma_Y)^2

On a multivariate Normal distribution, the Frechet Distance looks like:

d(X,Y)=||\mu_X-\mu_Y||^2+Tr(\Sigma_X+\Sigma_Y-2\sqrt{\Sigma_X \Sigma_Y})

Where $Tr(\cdot)$ is the trace of a matrix (sum of diagonal terms)

And the square root is square root on matrix, not on each element

The above formula is also called Frechet Inception Distance (FID), one of the most commonly used matrix to evaluate embeddings of real and fake images. The lower FID is, it means the closer fake image distributions are to the real image distributions.

Shortcomings of FID

Uses pre-trained Inception model which may not capture all features

Needs a large sample size

Slow to run

Limited statistics used: only mean and sd.

Inception Score

Now replaced by FID

But reported in many papers so good to know

Keeps calssifier intact and don’t use any intermediate values
1. Directly see the output of Inception network
1. If a score for an exact class is high $P(y|X)$ , it means the image is arguably high fidelity since it is easier to recognize and resembles features that is closer to one class.
1. Look across many samples and see that the generator is generating many different classes or not $P(y)$

So, single score KL Divergence

D_{KL}(P(y|X)||P(y)) = P(y|x) \log(\frac{P(y|X)}{P(y)})

Tries to measure how much information you can gain on $P(y|X)$ given just information about $P(y)$ .

So...inception score

IS = \exp(\mathbb{E}_{x \sim p_g} D_{KL}(P(y|X)||P(y)))

But

Can be exploited or gamed
1. Generate one realistic image of each class

Only looks at fake images
1. No comparison to real images

Can miss useful features
1. ImageNet isn’t everything

Ways to sample images for comparison

For real, we usually just sample uniformly from the data

For fake, since we would use a normal distribution to generate noise vector $\vec{z}$ s, we can also use the same distribution to generate fakes for comparison, but with some tricks.
1. There’s a tradeoff between fidelity and diversity
1. If we sample closer to the middle, we will improve the fidelity of sampled fakes since those are the values that occured more in training, but we lost diversity since we gave up certain regions of $\vec{z}$ s.
1. Truncation Trick

Precision and Recall

In a GAN network, we want the generated image distribution (fake distribution) $P_g$ to match exactly with the distribution of $P_r$ (not a subset, not a super-set, but exactly equal)

Percision: area of fake distribution overlapping with the real distribution divided by area of fake distribution. Percentage of fakes that look real ⇒ quality of generated fakes

Recall: Area of real distribution overlapping with the fake distribution divded by area of real distribution. Percentage of real that can actually be faked ⇒ diversity of generated fakes

Models tend to be better at recall ⇒ but we can control precision by using truncation methods.

Shortcomings of GANs

Lack of Intrinsic Evaluation Metrics
1. Even with FID distance its not ALWAYS reliable

Unstable training
1. mode collapse
1. is pretty much solved today
  1. by enforcing 1-L continuity through Wassertain Loss + Gradient Penalty, Spectral Norm
  1. by Instead of using generator model weights of a certain iteration count $\theta_i$ , use a weighted average of weights at different iterations $\hat{\theta} = \frac{1}{n} \sum_{j=1}^n \theta_j$ , this helps to smooth out generation.
  1. progressive growing
    1. see StyleGAN
1. No density estimation for generated images
  1. Cannot do anomoly detection
1. Inverting is not straightforward
  1. feed an image and find associated noise vector
  1. important for image editing

Because of those shortcomings, we will learn some alternatives to GANs (but will have tradeoffs)

Alternatives to GANs

Variational Autoencoders (VAE)

Has two parts encoder and decoder

Encoder takes in images and translates it into a vector in latent space

Decoder takes in a latent space vector and spits out an image

minimizing the divergence between the input and the output, and only uses decoder portion during production.

Has a density estimation

Operation is invertible

Has more stable training since the optimization problem of VAEs are easier

but produces lower quality results
1. Lot of work put in to make result better

Autoregressive Models

Looks at previous pixels to generate new pixels fro a latent representation. It is supervised because it requires some pixels to be present first before it can generate.

FLOW models (Example: Glow)

🔥

New Idea. Uses invertible mapping between latent space and generated images

Hybrid Models

e.g. VQ-VAE combines Autoregressive Model with VAE model.

Machine Bias

Machine Bias affects lives and we should be aware of those issues.

Defining Fairness

How to concretely measure how “fair” your model is?

Demographics parity
1. Overall predicted distributions should be the same for different classes

Outcome that is a representation of actual demographics

Equality of Odds
1. Make False Positive, False Negative EQUAL across classes

Ways Bias enter into models

training data
1. no variation in data
1. how data is collected
1. diversity of the labellers
1. “correctness” in culture
  1. side of the driver in the car

coders who designed the architecture

StyleGAN

🔥

Ability to generate realistic human faces

Goals:

Greater fidelity on high-resolution images

Increased diversity of outputs

More control over image features

Tried both W-Loss and original GAN loss but found each one worked well with different dataset

Style can be any variation in the image, from high level to low level.

Basic Introduction

The $w$ noise is injected multiple times into the StyleGAN Generator through a technique called AdaIN(Adaptive Instance Normalization)

Random noise is added to increase the variance of the generated image

Progressive Growing

Idea: Easier to generate blurry faces and keep growing

🔥

Gradually trains the generator from low-res images to high-res images

Progressive Downsampling for Discriminator

Say we want to grow the output of the generator from $4 \times 4$ to $8 \times 8$ , we would put an upsample layer to the previous $4 \times 4$ output and the upsample layer then be followed by 99% nearest neighbour interpolation and 1% convolution. We will be able to learn some parameters for the new convolution layer. As time goes on, the percentage owned by the convolution layer would be tuned manually bigger and bigger (until we completely throw away the NN algorithm).

The percentage owned by the transpose convolution layer will be described by $\alpha$ , and the percentage owned by NN algorithm would be $1 - \alpha$

(Pretty much same idea for discriminator)

So Progressive Growing in StyleGAN looks like this in the architecture

Noise Mapping Network

8-Layer MLP ⇒ 512 inputs and 512 outputs

Motivation: Mapping z to w helps disentangle noise representation

Probability Distribution of features have certain density, but z has normal prior, it is hard for z to be mapped to exact space of features.

The w vector goes into every single one of the progressive layers, but its influence differs at positions

AdaIN (Adaptive Instance Normalization)

Instance Normalization

output = \frac{x-\mu(x)}{\sigma(x)}

🔥

Different from BatchNorm which looks at an entire mini-batch of one channel at a time, instance norm only looks at a channel of a single sample at a time

AdaIN

Apply Adaptive Styles using the intermediate noise vector $w$

w is first inputted into two fully-connected layers (adaptive since the weights of those FC layers can change) to produce $y_s$ and $y_b$ which stands for scale and b, and then

AdaIN(x_i,y)=y_{s,i}\frac{x_i-\mu(x_i)}{\sigma(x_i)}+y_{b,i}

The fraction part is basically just instance norm

Style Mixing

w is injected into all layers of the progressive growing layers, but we can actually inject different $z$ and therefore different $w$ into different parts of the network

Can get more diversity in the things the model see in training

Can get control on granuity of details

Stochastic Variation

Don’t want mess of controlling two or more $w$ noises.

Can change very small details, like a very small piece of hair’s direction
1. Have to inject in a deeper (or later) layer