Transformers and NLP

Transformers are ML models that use sub models called encoder-decoder networks and excel at pattern recognition. The encoder takes in a sequence of tokens, which it encodes into a vector that encompasses the meaning and information of the entire sequence. This vector, known as a hidden state, is then passed on to the decoder model. This model turns the hidden state into a sequence of tokens or produces the expected output the model was trained to perform. The transformer architecture improves on previous encoder-decoder models by removing the need for sequential processing, which required recurrence and prevented parallelization. The researchers of the paper “Attention is All You Need” created a mechanism called multi-headed attention, which made this improvement possible. The multi-headed attention allows the model to focus on different patterns in the sequence to better understand the input. The improved transformer architecture also helped increase the model’s understanding of long-range dependencies, or relationships between distant tokens.

Transformers are often used for natural language processing, or NLP, which is an AI that can understand and respond to human prompts in natural language. For example, Chat-GPT, or Stable Diffusion, which understand human generated prompts and generates their output based on the input. Transformers excel at NLP because of their ability for pattern recognition and long-distance dependency recognition. Outside of NLP, transformers are used for task such as text classifiers, speech recognition, computer vision, and even chess.

Notes-

NLP:

process data encoded in natural language

speech rec, text classifier, nat lang understanding

Transformer: based on the attention mechanism

txt-> tokens->vector. At each layer, tokens contextualized with multi head attention, allowing signal for key tokens to be amplified and less important tokens to be diminished.

No recurrent units, less time than RNNS like LSTM

used in nlp, cv, chess, gpt

Originally used for machine translation

LSTM used for sequential processing( token by token)

Attention mechanism created, multiplies outputs of other neurons

Transformers works by encoder-decoder transduction

Encoder = lstm that takes a sequence of tokens and turns into vector

decoder = other lstm that takes vector, turns into tokens/outputs

Grus can be used instead of lstm(similar in efficiency)

Vanishing gradient problem for long sequences. Attention allows model to better process long distance dependencies

(2016, google translate revamped with a seq2seq

attention is all you need: attention without reccurence sufficient for translation. Removed reccurence, processed tokens in parallel and kept the dot product attention. led to multi headed attention “easier to parallelize due to the use of independent heads and the lack of recurrence. Its parallelizability was an important factor to its widespread use in large neural networks.”

typically self supervised on large generic dataset, supervised fine tuning on task specific set

components:

  • Tokenizers, which convert text into tokens.
  • Embedding layer, which converts tokens and positions of the tokens into vector representations.
  • Transformer layers, which carry out repeated transformations on the vector representations, extracting more and more linguistic information. These consist of alternating attention and feedforward layers. There are two major types of transformer layers: encoder layers and decoder layers, with further variants.
  • Un-embedding layer, which converts the final vector representations back to a probability distribution over the tokens.

Self attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.

self attention helps us create similar connections but within the same sentence

  1. Encoder-Decoder AttentionAttention between the input sequence and the output sequence.
  2. Self attention in the input sequence: Attends to all the words in the input sequence.
  3. Self attention in the output sequence: One thing we should be wary of here is that the scope of self attention is limited to the words that occur before a given word. This prevents any information leaks during the training of the model. This is done by masking the words that occur after it for each step. So for step 1, only the first word of the output sequence is NOT masked, for step 2, the first two words are NOT masked and so on.