How do Large Language Models work?

The classical definition of a Language Model (LM) is a probability distribution over sequences of tokens. But, what does that even mean?

Let's go through this one step at a time. Language is a tricky thing.

Language differences [funny] : r/languagelearning

It's just words/letters that are agreed upon by the world and given some meaning according to different use cases. Now, what do these characters/words/sentences represent?

Information. Right? Information that is irrelevant when separated but in context could make beautiful poetry. Machines on the other hand do not know how to make sense of these words or context or what not semantics of the human language. But what machines are good at is numbers, performing mathematical calculations, and finding patterns that could relate to different scenarios.

A large language model tries precisely to do just that. Finding patterns in the given text data and, here's the important part, given a set of input being able to predict what the next word might be. It's a next-word prediction machine at a very large scale with a huge amount of knowledge base to learn and match patterns from.

Let's see this mathematically now shall we:-

Suppose we have a vocabulary v of a set of tokens. A language model p assigns each sequence of tokens x1, x2, x3, ....., xL ∈ v a probability (a number between 0 and 1): p(x1,…, xL).

Note: Tokens are like smaller chunks of a sentence, like "Never give up", the tokens for this might be ['Never', 'give', 'up']. There are many other ways to tokenize a sentence but for now, we'll stick with this.

The probability intuitively tells us how "good" a sequence of tokens is. For example, if the vocabulary is v = {answer, The, 42, is}, the language model might assign:

p(the, answer, is, 42} = 0.03

p(the, is, 42, answer} = 0.002

p(42, answer, is, the) = 0.00005

Note how the sentence- 'the answer is 42 ' has the maximum probability and that makes sense because it's the most plausible sentence formation with the given tokens in vocab.

Mathematically, a language model is a very simple and beautiful object. But the simplicity is deceiving: the ability to assign (meaningful) probabilities to all sequences requires extraordinary (but implicit) linguistic abilities and world knowledge.

Example:-

Example of causal language modeling in which the next word from a sentence is predicted.

It's time now we see how these large language models are built. And then we'll see a type of LLM that has taken the attention of AI community by a swarm (no pun intended 😬)- Transformer.

As said before, these mathematical models do not understand human english, it needs numbers to work on. So we need to define a way to convert the input text into a vector of numbers representing that text and in context. For example-

$$x1:L=[x1,…,xL], ϕ$$

[the, answer, is, 42] => [[1, 0.1], [0, 1], [1, 1], [1, -0.1], [0,1]], this can be referred to as contextual embeddings.

As the name suggests, the contextual embedding of a token depends on its context (surrounding words)
Notation: vL→ℝd×L to be the embedding function (analogous to a feature map for sequences).
For token sequence x1:L=[x1,…,xL], ϕ produces contextual embeddings ϕ(x1:L)

Types of Language Models

There are basically 3 types of language models-

$$x1:L⇒ϕ[x1:L]$$

Encoder Only (BERT, RoBERTa, etc)

These LMs produce contextual embeddings but cannot be used directly to generate text.

$$x1:L⇒ϕ[x1:L]$$

These contextual embeddings are generally used for classification tasks.

Example: Sentiment Classification

[machine, learning, is, so, cool] -> positive

Decoder Only (GPT-2, GPT-3, etc)

These are our standard autoregressive language models, which given a prompt

$$x1:i$$

produces both contextual embeddings and a distribution over next tokens

$$xi+1$$

(and recursively, over the entire completion xi+1:L)

$$x1:i⇒ϕ[x1:i],p[xi+1∣x1:i]$$

Example: Text Autocomplete

[chat, gpt, is, a] -> large language model

Encoder-Decoder (BART, T5, etc)

These models in some ways can do the best of both worlds: they can use bidirectional contextual embeddings for the input x1:L and can generate the output y1:L.

$$x1:L⇒ϕ[x1:L],p[y1:L∣ϕ[x1:L]]$$

Example- Good for generative tasks that require input, such as translation or summarization.

Move aside all the jargon I said above, what are an encoder and decoder?

Encoder (left): The encoder receives input and builds a representation of it (its features). This means that the model is optimized to acquire understanding from the input.
Decoder (right): The decoder uses the encoder's representation (features) along with other inputs to generate a target sequence. This means that the model is optimized for generating outputs.

Architecture of a Transformers models

Let's jump to the main thing now.... drum rolls, Transformer.

Transformer

The original architecture that was suggested first in the Attention Is All You Need paper by Vaswani et al looks something like in the image below. Scary right 😵‍💫.

It is a neural network model that uses self-attention mechanisms to capture long-range dependencies in the input sequence, making it highly effective for natural language processing tasks such as language translation and language modeling.

Architecture of a Transformers models

But since then it has been the base of all that we see today from Open AI's GPT-3 to now ChatGPT, the conversational LLM which is all over the place these days.

Let's see how a transformer works (let's say for language translation)-

The left block denotes the Encoder, while the right block denotes the Decoder. During training, the Encoder receives input sentences in a certain language, and the Decoder receives the same sentences in the desired target language. The attention layers in the Encoder can use all the words in a sentence, whereas the Decoder can only pay attention to the words in the sentence that it has already translated.

For example, when predicting the fourth word, the attention layer in the Decoder will only have access to the words in positions 1 to 3. To speed up training, the Decoder is fed the whole target but is not allowed to use future words. The second attention layer in the Decoder uses the output of the Encoder and can access the entire input sentence to predict the current word, which is useful for languages with different grammatical rules or when later context can aid in determining the best translation of a word.

The attention mask can also prevent the model from paying attention to special words, such as padding words used to batch sentences of equal length.

Attention mechanism

While reading all this did you pay attention to the word attention being used multiple times? You have your answer, that's exactly what the attention mechanism does- it generates the feature vector such that the final input has paid attention to the most important contexts in the sentence. The below example which I found from a great blog on transformers by Jay Alammar explains why it's important-

Say the following sentence is an input sentence we want to translate:

”The animal didn't cross the street because it was too tired”

What does “it” in this sentence refer to? Is it referring to the street or to the animal? It’s a simple question to a human, but not as simple to an algorithm.

When the model is processing the word “it”, self-attention allows it to associate “it” with “animal”.

As the model processes each word (each position in the input sequence), self attention allows it to look at other positions in the input sequence for clues that can help lead to a better encoding for this word.

Here's a visual way to understand self-attention mentioned in the original paper.

Let's take an example for translation-

The French sentence:

Jane s'est rendue en Afrique en septembre dernier, a apprécié la culture et a rencontré beaucoup de gens merveilleux; elle est revenue en parlant comment son voyage était merveilleux, et elle me tente d'y aller aussi.

The English translation:

Jane went to Africa last September, and enjoyed the culture and met many wonderful people; she came back raving about how wonderful her trip was, and is tempting me to go too.

The way a human translator would translate this sentence is not to first read the whole French sentence and then memorize the whole thing and then regurgitate an English sentence from scratch. Instead, what the human translator would do is read the first part of it, maybe generate part of the translation, look at the second part, generate a few more words, look at a few more words, generate a few more words, and so on. This is what attention does.

The crux of the Transformers is the attention mechanism, which was developed earlier for machine translation (Bahdananu et al. 2017).

One can think of attention as a “soft” lookup table, where we have a query y that we want to match against each element in a sequence x1:L=[x1,…,xL]: [x1,…,xL]y

We can think of each xi as representing a key-value pair via linear transformations:

$$(W key Xi):(W value Xi)$$

and forming the query via another linear transformation:

$$WqueryY$$

The key and the query can be compared to give a score:

These scores can be exponentiated and normalized to form a probability distribution over the token positions {1,…,L}:

Then the final output is a weighted combination over the values:

We can write this all succinctly in matrix form:

def Attention(x1:L:ℝd×L,y:ℝd)→ℝd:

Process y by comparing it to each xi
Return

The softmax activation function transforms the raw outputs of the neural network into a vector of probabilities.

Any time we wish to represent a probability distribution over a discrete variable with n possible values, we may use the softmax function. This can be seen as a generalization of the sigmoid function which was used to represent a probability distribution over a binary variable.

— Page 184, Deep Learning, 2016.

We can think of there as being multiple aspects (e.g., syntax, semantics) that we would want to match on. To accommodate this, we can simultaneously have multiple attention heads and simply combine their outputs.

Self-attention layer- Now we will substitute each xi in for y as the query argument to produce:

def SelfAttention(x1:L:ℝd×L)→ℝd×L):

Compare each element xi to each other element.
Return [Attention(x1:L,x1),…,Attention(x1:L,xL)]

Feedforward layer- Self-attention allows all the tokens to “talk” to each other, whereas feedforward connections provide:

def FeedForward(x1:L:ℝd×L)→ℝd×L:

Process each token independently.
For i = 1,…,L
Return [y1,…,yL]

The softmax layer then turns those scores into probabilities (all positive, all add up to 1.0). The cell with the highest probability is chosen, and the word associated with it is produced as the output for this time step.

Here's a good visualization from Alammar's blog-

Yup, we're almost done. That's how autoregressive large language models end up doing the cool stuff they do. Talking about how these models are able to generate high-quality text and mimic human behavior I can't help but quote the famous Turing Test before I end this.

“A computer would deserve to be called intelligent if it could deceive a human into believing that it was human,” Turing wrote in 1950 defining his now-famous Turing Test.

See you in the next one. Peace out! - Kaus

References-

But what is LLM?

From high level understanding of language modelling to basic mathematical structure.