DLS 5 : Sequence Models

Some notes of DL Specialization Course by Yunhao Cao(Github@ToiletCommander)

👉
Note: This note has content including week 1-4 (version as of 2022/3/25)

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Acknowledgements: Some of the resources(images, animations) are taken from:

  1. Andrew Ng and DeepLearning.AI’s Deep Learning Specialization Course Track
  1. articles on towardsdatascience.com
  1. images from google image search(from other online sources)

Since this article is not for commercial use, please let me know if you would like your resource taken off from here by creating an issue on my github repository.

Last updated: 2022/3/25 13:25 PST

Why Sequence Models

So it seems like we’ve learned enough of our fixed-size and fixed-size output neural nets, its time to think about when the inputs and outputs would not match.

Due to limitations of neural architectures, it seems like the computer only accepts fixed-sized inputs and fixed-sized outputs. The structure of this kind of neural nets makes it easy to accept fixed-structure inputs and fixed-structure outputs.

But consider if we want to transcript an audio clip into a sequence of text, in which case both input and output can be different in length each time.

So what now?

Solution is to use sequence models

Notation

Note that we are feeding fixed-length structured data into the same model recurrently.

x<t>x^{<t>} denotes the t-th sequence input, t starting from 1

y<t>y^{<t>} and y^<t>\hat{y}^{<t>} denotes the t-th sequence label and output, respectively, t starting from 1

and TxT_x denotes the length of the input sequence, where TyT_y denotes the length of the output sequence

x(i)<t>x^{(i)<t>} denotes the t-th element in the input sequence of the i-th training sample

y(i)<t>y^{(i)<t>} denotes the t-th element in the output sequence of the i-th training sample

Tx(i)T_x^{(i)} denotes the length of the input sequence of the i-th training sample, and Ty(i)T_y^{(i)} denotes the length of the output sequence of the i-th training sample

Recurrent Neural Network (RNN)

The recurrent neural network is different from a normal NN structure such that it also accepts a wdda<t1>w_{dd} \cdot a^{<t-1>} input aside from x<t>x^{<t>} ⇒ So that the information from previous inputs is passed into the network.

However, there are some short-comings of this structure

  1. Magnitude of information from the inputs from really early on will gradually decrease as we progress through series of inputs in the sequence.
  1. The NN doesn’t have any information about what is happening later on.
    1. Introduction of Bidirection Recurrent Neural Network(BRNN) later on.

Forward Propagation

a<0>=0a^{<0>} = \vec{0}

a<t>=g(waaa<t1>+waxx<t>+ba)a^{<t>}=g(w_{aa}a^{<t-1>}+w_{ax}x^{<t>}+b_a)
y^<t>=g(wyaa<t>+by)\hat{y}^{<t>} = g(w_{ya}a^{<t>}+b_{y})

To simplify this activation calculation,

a<t>=g(wa[a<t1>x<t>]+ba)a^{<t>}=g(w_{a}\begin{bmatrix} \vec{a}^{<t-1>} \\ \vec{x}^{<t>} \end{bmatrix} +b_a)

where...

wa=[waawax]w_a = \begin{bmatrix} w_{aa} &w_{ax} \end{bmatrix}

Different Architectures(Or forms) of RNN

Music Generation with One-to-many relationship
Machine Translation with many-to-many architecture(but T_x and T_y does not equal)

Sequence Generation & NLP with RNN

We assign each vocabulary with an index, so we would have a max index of vocab_num1\text{vocab\_num} - 1. Each time we pass in a word, we would pass in a one-hot vector v\vec{v} representing the word, with the iith element being 1 and the other elements of the vector being zero. ii would be the index of the word.

Note that we will introduce other ways to represent words later.

Output of y<t>y^{<t>} would be a softmax output layer with also vocab_num\text{vocab\_num} long vector output. Each index of the vector represents the likelihood of a specific word that follows.

Training

👉
Note that first x<1>x^{<1>} and a<0>a^{<0>} would be set to 0 if we are training to generate sequence of text. Why? During training we only have y<t>y^{<t>} labels, and x<t>x^{<t>} would be y<t1>y^{<t-1>} in this case. See below.
Sequence Generation training with RNN model.

Sequence Generation Loss Function:

L<t>(y^<t>,y<t>)=iyi<t>log(y^<t>)L=tL<t>(y^<t>,y<t>)L^{<t>}(\hat{\vec{y}}^{<t>},\vec{y}^{<t>})=-\sum_{i}\vec{y}_i^{<t>}log(\hat{\vec{y}}^{<t>}) \\ L=\sum_t L^{<t>}(\hat{\vec{y}}^{<t>},\vec{y}^{<t>})

Generating Sequences

As we did before, pass in a<0>a^{<0>} and x<1>x^{<1>} as zero vectors

For each y^<t>\hat y^{<t>} that outputs the possibility of each word(a vector representing the possibility of each word) given the input of x<t1>,x<t1>,,x<2>x^{<t-1>}, x^{<t-1>}, \dots, x^{<2>}, we would sample a word index by using the probability distribution given by y^<t>\hat y^{<t>}, and assign it as x<t+1>x^{<t+1>}. We stop until we hit the index of <EOS> or arrive at a limit.

Vanishing Gradient with RNN

Think about the following two sentences

  1. The cat, which already ate ....., was full.
  1. The cat, which already ate ....., were full.

Problem is original RNN is not good at capturing long-term dependencies. See below as it is hard for the gradient to propagate through so many connections.

👉
Seems like we don’t have a solution here. RNN is mainly influenced by local, or nearby inputs. So we’ll introduce other models later to address this problem.
🤧
Also know that exploding gradients(parameters being NaN) can also happen in RNN. But vanishing gradient is generally a bigger problem since we can use gradient clipping to solve exploding gradients.

Bidirectional Sequence Model (Bidirectional RNN/BRNN)

Note: Works on any sequence unit like GRU, LSTM, RNN, etc.

📢
BRNN-LSTM commonly used
Need entire sequence of data to process data, so if used in a speech recognition, we need the person to stop talking to start dealing with data.

Deep RNN

See now that we can stack units of RNN, GRU, or LSTM layer by layer (from bottom to top) to form bigger neural networks. We can also add dense(FC) layers after those sequence model layers.

Gated Recurrent Unit (GRU)

Structure of original RNN
Structure of GRU, note that the purple represents the update logic of c^<t>

Addition of new c<t>c^{<t>}(cell of memory) here(althogh c<t>=a<t>c^{<t>} = a^{<t>}, but later in LSTM they would be different).

Candidate Gate: Γr<t>[0,1],Γr<t>=σ(Wr[c<t1>x<t>]+br)\text{Candidate Gate: } \Gamma_r^{<t>} \in [0,1], \Gamma_r^{<t>}=\sigma( W_r \begin{bmatrix} c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_r )
Candidate for c<t>c~<t>=g(Wc[Γr<t>c<t1>x<t>]+bc)Note: Andrew Ng used g(x)=tanh(x)\text{Candidate for } c^{<t>}\text{: } \tilde{c}^{<t>} = g( W_c \begin{bmatrix} \Gamma_r^{<t>}c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_c ) \\ \text{Note: Andrew Ng used } g(x)=tanh(x)
Update Gate: Γu<t>[0,1],Γu<t>=σ(Wu[c<t1>x<t>]+bu)\text{Update Gate: } \Gamma_u^{<t>} \in [0,1], \Gamma_u^{<t>}=\sigma( W_u \begin{bmatrix} c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_u )
Memory Cell: c<t>=a<t>=Γu<t>c~<t>+(1Γu<t>)c<t1>\text{Memory Cell: } c^{<t>} = a^{<t>} =\Gamma_u^{<t>}\tilde{c}^{<t>}+(1-\Gamma_u^{<t>})c^{<t-1>}

Long Short Term Memory(LSTM)

Now note that now c<t>c^{<t>} and a<t>a^{<t>} both represents different things.
Update Gate: Γu<t>[0,1],Γu<t>=σ(Wu[c<t1>x<t>]+bu)\text{Update Gate: } \Gamma_u^{<t>} \in [0,1], \Gamma_u^{<t>}=\sigma( W_u \begin{bmatrix} c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_u )
Forget Gate: Γf<t>[0,1],Γf<t>=σ(Wf[c<t1>x<t>]+bf)\text{Forget Gate: } \Gamma_f^{<t>} \in [0,1], \Gamma_f^{<t>}=\sigma( W_f \begin{bmatrix} c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_f )
Output Gate: Γo<t>[0,1],Γo<t>=σ(Wo[c<t1>x<t>]+bo)\text{Output Gate: } \Gamma_o^{<t>} \in [0,1], \Gamma_o^{<t>}=\sigma( W_o \begin{bmatrix} c^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_o )
Candidate for c<t>c~<t>=gc(Wc[a<t1>x<t>]+bc)\text{Candidate for } c^{<t>}\text{: } \tilde{c}^{<t>} = g_c( W_c \begin{bmatrix} a^{<t-1>} \\ x^{<t>} \end{bmatrix} + b_c ) \\
Memory Cell: c<t>=Γu<t>c~<t>+Γf<t>c<t1>\text{Memory Cell: } c^{<t>} = \Gamma_u^{<t>}*\tilde{c}^{<t>}+\Gamma_f^{<t>}*c^{<t-1>}
Activation: a<t>=Γo<t>ga(c<t>)\text{Activation: } a^{<t>}=\Gamma_o^{<t>}*g_a(c^{<t>})

Note: Andrew Ng Used ga(x)=gc(x)=tanh(x)g_a(x)=g_c(x)=tanh(x)

Word Embeddings

Remember early in the Sequence Generation RNN models we used one-hot vectors to represent our input words? We can instead use word embeddings to represent our words.

Now our input would still be a vector. This vector would now represent...

In this sample, gender, royal, age, and food each represents an element of the word embedding vector of a particular word

Note that now we no longer add the index of the word to the input. We’ll use word embedding vectors instead.

📢
Motivation of using this might be I want to replace a word in a sentence(and the sentence would still be valid) and using word embeddings can help the model quickly find analogies of a word without being trained on a specific variant of sentences. Also word embedding vectors can be learned from other datasets so transfer learning is made possible.
⚠️
Note that the “features” here are just examples. Usually when we learn word embeddings the feature space wouldn’t necessarily be interpretable since we can choose any orthogonormal vectors as our basis and we might have a “rotated” version of our features as our basis.

Representation

We will use an embedding matrix EE which each column ei\vec e_i represents the embedding of one word. This is so that when we take product of EE and oi\vec o_i, oi\vec o_i representing the one-hot vector of word i, then EoiE\vec o_i represents ei\vec e_i, the word embedding for that specific word.

However in practice we would never use the embedding matrix to find the embedding of the word(to avoid computation overhead). Instead embeddings are stored in rows for easy index retreval.

Calculating Analogies

Let vman\vec{v}_{man} be the vector representing word “man”, we want to ask the following question:

Man to King is as Woman to ?

The answer to the question mark is of course queen.

To find the answer, we can use embeddings to find the closest word, represented by v?\vec{v}_{?}, such that

dist(v?,vwoman+(vkingvman))dist(\vec{v}_?,\vec{v}_{woman}+(\vec{v}_{king}-\vec{v}_{man}))

is minimized.

There are lots of ways to calculate the distance, we will introduce two.

cosine distance: sim(u,v)=uvu2×v2\text{cosine distance: } sim(\vec u,\vec v)=\frac{\vec{u}^{\top}\vec{v}}{||u||_{2}\times||v||_{2}}
square distance: sim(u,v)=uv2\text{square distance: } sim(\vec u, \vec v) = ||\vec u - \vec v||^2

Methods for Learning Word Embeddings

So now instead of using word embeddings as an input, while training word embeddings, we would actually add an embedding layer to the model so that the algorithm can optimize it. Our input will now be reverted back to one hot vectors(stacked together) and we would multiply it by the embedding matrix EE in the embedding layer. Notice that EE can be optimized.

Dumb Method

Learning Goal: To provide great embeddings for words.

Learning Task: Predict the probability of a word given some context words (like 5 words before, or 5 words after, or ten words before and after, last one word, nearby one word etc.)

⚠️
Usually nearby one word works best to learn word embeddings, called skip gram

Model: FC layers connected with a softmax layer. No RNNs are used.

Loss: L(y,y^)=i=1vocab_sizeyilog(yi^)L(y,\hat{y})=-\sum_{i=1}^{vocab\_size} y_i log(\hat{y_i})

To sample the context c,

If we do this uniformly at random, we will find that we will see words like “the, of, and, to” appearing much more frequently than other words

😉
Use heuristics to modify the probability that a

Word2Vec

Use a context word and a target word to train a softmax output neural network.

Negative Sampling

For softmax classification, we used the following formula.

probability of target t given context c: P(tc)=eθtecj=1vocab_sizeeθjec\text{probability of target t given context c: }P(t|c) = \frac{e^{\theta_t^{\top}e_c}}{\sum_{j=1}^{vocab\_size}e^{\theta_j^{\top}e_c}}

But this will be very computationally heavy when vocab_sizevocab\_size gets big.

We can use a hierarchical softmax layer or use negative sampling.

With negative sampling, instead of outputting a softmax vector, we will use a sigmoid activation function to predict the likelihood of two words being an context-target pair.

Generate positive samples by sampling sentences and generate negative samples by randomly choosing and placing words together.

P(y=1c,t)=σ(θtec)P(y=1|c,t)=\sigma(\theta_t^{\top}e_c)

Glove Method

Global Vectors for Word Representation

Lets say that XijX_{ij} denotes the number of times in the sample such that word j appears in the context of i.

Depending on how the context is solved, XijX_{ij} might be equal to XjiX_{ji}

The glove method tends to minimize the following weight...

L(θ)=i,jf(i,j)(θiej+bi+bjln(Xij))2L(\theta) = \sum_{i,j} f(i,j)\cdot(\theta_i^{\top}e_j+b_i+b_j^{'}-ln(X_{ij}))^2

Here f(i,j)f(i,j) is a weighting function such that

f(i,j)={0if Xij=0weighted constantotherwisef(i,j)=\begin{cases} 0 &\text{if } X_{ij}=0 \\ \text{weighted constant} &\text{otherwise} \end{cases}
😉
Since we have a weighted function here, the weights of common words such as “the”, “of”, “for” can be reduced, while the weights of the less common words can be increased, so that we can get a more appropriate word embedding

Notice here that θi\theta_i and eie_i has the same role mathematically, what we usually do is when we need our actual “trained embeddings”, we would take ei=ei+θi2e_i^*=\frac{e_i+\theta_i}{2}.

Neutralizing(Debiasing) Word Embeddings

Our texts to train word embeddings might include words that include stereotypical assumptions. So we want to neutralize them.

How?

Say for example that we want to neutralize words that involve gender bias.

  1. Identifying Bias Term: First we need to calculate the “bias vector” vbiasv_{bias}, that would be the basis vector representing the bias.
    1. We would take a few word pairs representing gender and average their difference, for example, we average over eboyegirl,emanewoman,ekingequeene_{boy}-e_{girl}, e_{man}-e_{woman},e_{king}-e_{queen}, etc.
  1. Neutralize: Then for every word that is not definitional, for example, eke_{k}, we project eke_k onto vbiasv_{bias} and get rid of the projection from vkv_k
    1. Example of this would be occupations, there’s no gender involved.
ek=ekprojvbiasek=ekvbiasvkvbias2vbiase_k^* = e_k - proj_{v_{bias}}e_k = e_k - \frac{v_{bias} \cdot v_k}{||v_{bias}||^2}v_{bias}
  1. Equalization: For words that are pairs in gender, we want to make sure that all of their differences are gender
Taken from Andrew Ng.

Translation and Voice Recognition

🤔
Think about translation or voice recognition where the length of input and the length of the output is not the same...what do we do?

Basic Encoder Decoder Model

This is the basic encoder-decoder model where sequences are first inputed into the model and processed and then passed as an intermediate activation state to the second part of the model. This model works for small or short translation tasks but due to information loss will not be able to perform well on long inputs.
🤔
Note that the encoder structure and the decoder structure doesn’t have to be the same.

Picking the most likely word or sentence(Beam Search)

In the above model, during the decoder phase, every element in the y^<t>\hat{y}^{<t>} output vector gives probability of the following: P(y<t>=ix<1>,x<2>,,x<n>,y<1>,y<2>,,y<t1>)\mathbb{P}(y^{<t>}=i | x^{<1>},x^{<2>},\dots,x^{<n>},y^{<1>},y{<2>},\dots,y^{<t-1>})

So we want the most likely word or sentence, we want to maximize the likelihood of the whole sentence. Let our choice of y hat each time being the iti_t element of the output, then we want to maximize k=1ty^ik<k>\prod_{k=1}^{t}\hat{y}_{i_k}^{<k>}

We can search through the output by using a greedy algorithm, but that’s just too computationally expensive. So instead we will use a heuristic search algorithm called beam search.

Beam Search has a parameter, beam width β\beta, that limits how many choices it can keep. It is basically a BFS with limited numbers of selection to hold while each time we propagate through the sequence.

Beam Search with Beta = 2

Error Analysis in Beam Search & NN

Note that when we employ beam search, it might be hard to detect whether error comes from the neural net or the search algorithm. If we’ve selected y^\hat{y} while the optimal solution should be yy^*, it is usually useful if we inspect the output of the neural net and calculate P(y^)P(\hat{y}), and P(y)P(y^*) accordingly.

If we compare them and find that the neural net outputs P(y)>P(y^)P(y^*) > P(\hat{y}), then we know the beam search didn’t find a better solution even though the neural net was correct. So we might need to increase beam width β\beta.

Otherwise P(y)P(y^)P(y^*) \le P(\hat{y}), then we need to improve the NN since the NN is not giving the optimal solution a bigger weight.

Bleu Score

🤔
That’s something used to validate the quality of a translation or voice recognition result

Bleu Stands for Bilingual Evaluation Understudy.

Basic Idea: Given input x and some reference output ys, we can compare our NN output with the reference ys and determine how many words overlap with the reference, and this will somehow give us the quality of our output.

Note that here we have multiple y because different people might have different translation of the same text.

Here we will first define some vocabularies:

n-gram: one(n=1) or combination of words(n>1)

Suppose a sentence have m words, we can have a total of mn+1m-n+1 n-grams with a stride of 1.

count: how many times a unique n-gram appears in our output.
count-clip: max number of times a unique n-gram appears in each of our references.
Bleu Score on k-grams:Pk=kgramsCountclip(kgrams)kgramsCount(kgrams)\text{Bleu Score on k-grams:}\\ P_k=\frac{\sum_{k-grams} Countclip(k-grams)}{\sum_{k-grams}Count(k-grams)}
Combined Bleu Score(with hyperparameter n):P=BPexp(1nk=1nPk)\text{Combined Bleu Score(with hyperparameter } n \text{):}\\ P=BP\cdot exp(\frac{1}{n}\sum_{k=1}^{n}P_k)
Bleu Penalty: BP={1if output length > reference output lengthexp(1reference output lengthoutput length)otherwise\text{Bleu Penalty: }\\ BP=\begin{cases} 1 &\text{if output length > reference output length} \\ exp(1-\frac{\text{reference output length}}{\text{output length}}) &\text{otherwise} \end{cases}
📎
Since probability shrinks with increase in output length, we want to penalize sentences that are too short.

Attention Model

We talked about how the encoder-decoder model could not handle long sentences, what do we do?

A lot of times translations only require attention to specific areas of text, so we will introduce this intuition to our model.

A full attention model example

Attention model is actually split into three major parts.

First major part is a Pre-attention BRNN(you can use GRU or LSTM instead of traditional RNN here), notice that in the output we don’t have y^<t>\hat{y}^{<t>}, we will name those outputs a<t>a^{<t>}, in which

a<t>=[aundefined<t>aundefined<t>]a^{<t>}=\begin{bmatrix} \overrightarrow{a}^{<t>} \\ \overleftarrow{a}^{<t>} \end{bmatrix}

such that a<t>a^{<t>} is the concatenation of forward flow activation and backward flow activations.

Then, using each activations a<t>a^{<t>}, we can compute the context for each output word, c<t>c^{<t>}.

The method in how we want to compute the context c<t>c^{<t>} based on a<t>a^{<t>} and other variables can vary, but here is an example below.

Formulas:

converge rule: t,t=1Txα<t,t>=1\text{converge rule: }\\ \forall t, \sum_{t'=1}^{T_x}\alpha^{<t,t'>} = 1
c<t>=tα<t,t>a<t>c^{<t>}=\sum_{t'}\alpha^{<t,t'>}a^{<t>}

Here α<t,t>\alpha^{<t,t'>} determines how much attention does input word x<t>x^{<t'>} get when computing the output for y<t>y^{<t>}

e is calculated through a dense layer
α<t,t>=exp(e<t,t>)k=1Txexp(et,k)=softmax(exp(e<t,t>))\alpha^{<t,t'>}=\frac{exp(e^{<t,t'>})}{\sum_{k=1}^{T_x}exp(e^{t,k})} = softmax(exp(e^{<t,t'>}))

Note that below in the example context calculation below the context is calculated differently.

Example context NN, notice that the structure differs a bit from the formula.

A context model example, here s^<t-1> is an activation output of our post-attention model. s^{0} = 0

Once context is computed, we can use the context c<t>c^{<t>} combined with a post-attention RNN to generate the output.

We can see through this visualization that activations of attention, alpha, tend to be high at the input words that are more important to the output word.

Speech Recognition

Ehhh so we input lots of data into speech recognition models and it is almost always that the number of words will be less than the number of frequency input points. Plus we only need nearby sounds to know which letter or word is the person saying. So it seems like we can just do a BRNN without using the attention model.

For each output, we can output a letter or a “_”, here “_” doesn’t mean space it just means a separation of word and we will have another symbol to represent an actual space.

So we can turn “wwwww_______oo_rrrrrrrrrrrrrrrrr_____d” output and process it to be “word”.

Transformer Network

Idea: Unlike RNN, use idea from CNN + Attention

Self-Attention(”one-head”) Value

A<t>(q<t>,K,V)=iexp(q<t>k<i>)jexp(q<t>k<j>)v<i>A^{<t>}(\vec{q}^{<t>},K,V)=\sum_{i} \frac{exp(\vec{q}^{<t>}\cdot \vec{k}^{<i>})}{\sum_{j}exp(\vec{q}^{<t>} \cdot \vec{k}^{<j>})}\vec{v}^{<i>}

Where q,K,Vq, K, V stands for query, key, value, respectively ⇒ sound like a database search, huh? we search through keys of inputs with a single query and whichever matches the query the best(has biggest inner product) will be put to have the most influence for its value to be in our search result.

For each input x<t>x^{<t>}, there is a corresponding q<t>,k<t>,v<t>q^{<t>}, k^{<t>}, v^{<t>}, and each is computed with q<t>=Wqx<t>,k<t>=Wkx<t>,v<t>=Wvx<t>\vec{q}^{<t>}=W^q\vec{x}^{<t>}, \vec{k}^{<t>}=W^k\vec{x}^{<t>}, \vec{v}^{<t>}=W^v\vec{x}^{<t>}. Where Wq,Wk,WvW^q, W^k, W^v are learned parameter matrices.

A(Q,K,V)=softmax(QKdk)VA(Q,K,V)=softmax(\frac{QK^{\top}}{\sqrt{d_k}})V

Here dkd_k is just a scalar (dimension of k vectors) to make sure the inner product doesn’t explode.

Note: Q=[q<1>q<Tx>]Q = \begin{bmatrix} \vec{q}^{<1>} \cdots \vec{q}^{<T_x>} \end{bmatrix}, and same logic for KK and VV.

Multi-head Attention

It turns out that after computing QKV, instead of directly computing a one-head attention value, we can use different parameter matrices and apply them to QKV to get different sets of Q,K,Vs.

There is a hyperparameter h that determines how many heads are computed

The i-th head is computed by A(WiQQ,WiKK,WiVV)A(W_i^QQ,W_i^KK,W_i^VV), and the overall attention vector is the concatenation of all head attention vectors.

Transformer Model

Note: The first multi-head attention in the decoder is actually a masked multi-head attention during training, because we want to pretend that all of our “predicted words” are correct while trying to train to predict the next word.
Note: Input to the decoder is the y^<0>,y^<1>,,y^<t1>\hat{y}^{<0>},\hat{y}^{<1>}, \dots, \hat{y}^{<t-1>} that has been predicted. Here y^<0>=<SOS>\hat{y}^{<0>} = \text{<SOS>}. Note that this input like any other inputs to the transformer network, is limited in length, if we have n+1 words predicted while the limit is n, we would leave out the first word predicted.
Note: Add & Norm layer works like a skip layer + BatchNorm, Add basically means sum of two vectors (so that during back-prop the vanishing gradient can be less of a problem), and after summation, they will be layer-normed.
⚠️
The add sign at the start of both encoder and decoder means that we want to add Position Encoding PEPE to the input, in order to provide position data to the model. The position encoding PE has the same dimension as the input vectors and is added numerically to the input vectors. The result would be our new x. We will give formula to PE later.
PE(t,2i)=sin(pos100002id)PE_{(t,2i)}=sin(\frac{pos}{10000^{\frac{2i}{d}}})
PE(t,2i+1)=cos(pos100002id)PE_{(t,2i+1)}=cos(\frac{pos}{10000^{\frac{2i}{d}}})

Here t denotes the position of the word in the sentence, starts with 1, and i denotes the index of the elemnt in the input vector.

🙄
Note that unlike RNN, Transformers take limited and fixed length inputs. If your input exceeds the length, you can split them into numbers of input arrays. But during training, we would iterate through the input array and feed one at each time, but the predicted words, y^\hat{y}, would be continuously be fed into the decoder.