Transformer(Attention is all you need)
Autoregressive LM(GPT) vs Autoencoding LM(BERT)
Autoregressive LM: Causal Language Model
Autoencoding LM: Masked Language Model
Transformer Architecture
Tokenizing vs Embedding vs Encoding
- Tokenizing: process which converts text to token idx
- Embedding: process which converts Tokenized Words to Vectors
- Encoding: process which converts embedded Vectors to Sentence Matrix
Positional Encoding
Positional encoding describes the location or position of an entity in a sequence so that each position is assigned a unique representation. Positional encoding ensures meaning in the order of words
Full Architecture of Transformer
connection between encoder decoder
Self-Attention
Self attention: sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
attention mask: Tells the model where it should b other to pay attention to for any tokens model will ignore those tokens and it will not be possible to use them to compute the model output. Discriminate between real tokens and padding tokens.
References:
[1] https://beausty23.tistory.com/223 (Embedding vs Encoding)