Transformers for Dummies

How do Large Language Models (LLMs) actually work?

Mar 21, 2025

graffiti wall — Photo by Andreas Fickl on Unsplash

As generative AI systems—especially large language models (LLMs) like ChatGPT—increasingly shape our professional, social, and educational lives, understanding their inner workings is becoming essential to digital literacy. Language models embody complex assumptions, biases, and design choices that profoundly influence their outputs; gaining insight into the machinations behind LLMs empowers users to critically evaluate their reliability, ethical implications, and societal impacts.

Understanding how LLMs function requires, above all else, an understanding of transformers.

Introduced in 2017 by Google researchers, transformers are a type of neural network architecture that have revolutionised artificial intelligence, particularly in natural language processing tasks like translating languages, understanding text, and generating human-like responses.

The beauty of transformers is a mechanism called attention, specifically self-attention. Unlike older neural networks, which processed words sequentially (one after another), transformers can examine all words in a sentence simultaneously, which helps the network better understand context, meaning, and relationships between words.

Transformers are based on three essential ideas: embeddings, self-attention, and the overall transformer structure.

Embeddings
or, Representing words as numbers

When any AI language models receive text, they cannot directly process it as human-readable words. Instead, each word is immediately translated into numbers that the model can understand and manipulate mathematically. This numerical representation is called an embedding.

To visualise embeddings clearly, think of each word as a point placed somewhere in a vast mathematical space with many dimensions—often hundreds or even thousands of dimensions. Each dimension captures some specific aspect of the word’s meaning or usage. The exact position of each word in this space is represented by a vector—a series of numbers (coordinates) like (0.45, -0.23, 0.88, ...).

These numerical coordinates are not random. Embeddings are designed so that words which are similar in meaning or usage are placed close together, while words with very different meanings are far apart. For example, the words ‘happy’, ‘joyful’, and ‘content’ are represented by embeddings that are numerically similar—this means their points sit closely together in the embedding space. By contrast, the embedding for the word ‘sad’ will be placed significantly further away, reflecting its difference in meaning.

How are these embeddings actually created? They start out randomly initialised, but the neural network learns appropriate values by looking at vast amounts of text data. As it encounters millions or billions of words, it gradually learns patterns about how words appear together or relate to each other. For example, because ‘happy’ and ‘joyful’ frequently appear in similar contexts (‘She was joyful’, ‘She was happy’), their embeddings end up closer together after training. Each time the model sees words together in a sentence, it slightly adjusts their embedding coordinates to bring similar words closer and push different words further apart.

This process results in embeddings that reflect genuine semantic relationships between words. It doesn’t just capture simple synonymy (‘happy’ = ‘joyful’), embeddings also reflect more subtle relationships, like analogies (‘king’ is to ‘queen’ as ‘man’ is to ‘woman’), because these relationships produce consistent numeric patterns in embedding space.

Once embeddings are created, transformers can efficiently perform mathematical operations on them. For example, to understand sentences or paragraphs, the model uses these numeric embeddings to precisely calculate how words relate to each other through operations such as self-attention. Without embeddings, the model would have no structured way to perform these calculations—it would have no numerical handle on meaning.

In short, embeddings convert qualitative, human-readable language into quantitative, structured representations. They are the critical first step that allows transformers to mathematically analyse language, enabling the sophisticated language-processing capabilities found in LLMs.

Self-Attention
or, Understanding context

In simple terms, self-attention figures out the meaning of a word by directly measuring how closely related it is to every other word in the same sentence. Take, for example, the opening sentence from Orwell’s 1984:

It was a bright cold day in April, and the clocks were striking thirteen.

Imagine we want the model to fully understand what the word ‘thirteen’ means in this context. Self-attention achieves this understanding through the following process:

Step 1. Convert words into embeddings

As just explained, each word (or token) in the sentence is first represented numerically as a vector called an embedding. These embeddings contain information about the meaning and usage of the words.

So, the word ‘thirteen’ starts as an embedding—a long list of numbers capturing its meaning.

Step 2. Create Queries, Keys, and Values

Next, the embedding for each word is transformed into three new vectors using learned weight matrices. These new vectors are known as:

Query (Q): What information this word is "looking for";
Key (K): What information this word "offers" to others;
Value (V): The actual information this word can provide once a connection is made.

In other words, every word generates three distinct representations of itself (Q, K, V), each serving a different role in the attention calculation.

Step 3. Calculate attention scores

Now, the transformer uses the Queries and Keys to calculate attention scores. Specifically, it calculates how well each word’s Query matches up with every other word’s Key. If two words have Queries and Keys that match closely, this indicates that these words are strongly related or contextually relevant.

Mathematically, this involves taking the dot product (a type of similarity measure) of each word’s Query with each other word’s Key. For instance, when focusing on the word ‘thirteen’, the transformer might find a strong match between the Query for ‘thirteen’ and the Key for ‘clocks’, indicating a close relationship. Conversely, the match between the Query for ‘thirteen’ and unrelated words (like ‘bright’) may be weaker, meaning less attention will be given.

The scores calculated from these matches determine precisely how much each word will pay attention to every other word. Higher scores mean greater importance or relevance.

Step 4. Convert scores to weights

Next, the transformer converts these raw attention scores into weights—numbers between 0 and 1—using a mathematical function called softmax. This makes it easier to interpret the scores as proportions: high attention scores become large weights, meaning the model should strongly consider that word’s information, while low scores result in smaller weights, meaning that word matters less.

In our example, after applying softmax, the word ‘thirteen’ might assign a high weight (close to 1) to ‘clocks’, indicating clearly that understanding ‘thirteen’ strongly depends on ‘clocks’. Other words like ‘bright’ or ‘April’ would receive lower weights because they are less relevant.

Step 5. Combine values using the attention weights

Finally, the transformer uses these weights to combine the Values (V) from all the words in the sentence into a new, updated embedding. Each word’s Value representation is multiplied by its corresponding attention weight, and these weighted Values are added together, producing a final embedding that reflects precisely how the word relates to its surrounding context.

In our sentence, this means the embedding of the word ‘thirteen’ is heavily influenced by the embedding of the word ‘clocks’. As a result, the transformer understands clearly that ‘thirteen’ relates specifically to clocks striking the hour—not to anything else unrelated in the sentence.

Why does this matter?

This step-by-step process allows the transformer to directly measure the strength of relationships between words without needing to read through words sequentially or guess based purely on proximity. It can directly pinpoint connections—even between words located far apart in a sentence—and handle complex relationships with clarity and precision.

In short, the transformer doesn’t just vaguely ‘capture nuance’; it explicitly calculates numerical relationships, providing precise and interpretable connections between words like ‘clocks’ and ‘thirteen’, which helps the model understand exactly what each word means in context.

Transformer Architecture
or, Putting it all together

Transformers are composed of two main components that work closely together: the encoder and the decoder. Each plays a distinct but complementary role, processing input text and producing meaningful outputs through a series of structured steps.

The encoder is responsible for taking input text—such as an English sentence—and converting it into rich, numerical representations that capture the meaning and structure of the sentence. It does this through multiple layers, each repeatedly performing self-attention operations:

Embedding stage

The encoder first converts input words into numerical embeddings. As we’ve discussed, these embeddings place similar words near each other in mathematical space.

Self-Attention layer

Once embeddings are created, multiple layers of self-attention come into play. Each layer recalculates the embeddings based on the self-attention process described earlier: every word measures how closely related it is to every other word in the sentence, updating its own embedding accordingly. After multiple layers, these embeddings become highly refined—they not only reflect individual words, but precisely represent each word’s relationship to the rest of the sentence.

By repeating this self-attention process several times, the encoder progressively develops detailed, contextually-aware representations of the input sentence. After these layers of refinement, each embedding encodes sophisticated information—such as how each word relates semantically and structurally to all others in the input.

The decoder takes the encoder’s output—these refined embeddings—and translates them into useful outputs like translated sentences, answers to questions, or summaries. Like the encoder, the decoder contains multiple layers of attention, but it adds an essential step: cross-attention.

Cross-attention

While the encoder’s self-attention focuses entirely within the input text, the decoder’s cross-attention allows it to look back to the encoder’s output. Cross-attention helps the decoder select precisely which parts of the input embeddings to focus on while generating each new word in the output. For instance, if the decoder is translating an English sentence into French, cross-attention explicitly guides it to the relevant English words each time it generates a corresponding French word.

Decoder’s self-attention

In addition to cross-attention, the decoder also applies its own internal self-attention, allowing each word it produces to consider the previously generated words. This ensures the output sentence remains coherent, grammatically correct, and contextually consistent.

Together, cross-attention and self-attention within the decoder ensure that the generated outputs precisely reflect the meanings and structures originally captured by the encoder. The decoder generates outputs word-by-word, carefully choosing each subsequent word based on what it has previously produced and what it learns from the encoder’s refined embeddings.

Putting this all together, transformers follow these steps:

Encoder Input: Input words are converted into numerical embeddings;
Encoder Self-Attention: Multiple self-attention layers refine embeddings to deeply represent meaning and context;
Encoder Output: The encoder produces refined, context-aware embeddings representing the input sentence;
Decoder Cross-Attention: The decoder examines these encoder outputs, selectively focusing on relevant parts for generating each output word;
Decoder Self-Attention and Output: The decoder generates each word of the output sentence by attending to both the encoder’s information (cross-attention) and its own previously generated words (decoder self-attention).

This structured, layered combination of embeddings, self-attention, and cross-attention allows transformers to effectively manage context, handle relationships across long sentences, and produce highly accurate, context-sensitive outputs.

Why does this architecture matter?

By allowing each word to directly ‘see’ every other word at each step, transformers avoid the weaknesses of earlier methods, which processed sentences sequentially and often struggled with long-distance connections. Transformers handle complexity elegantly, accurately linking ideas across sentences and paragraphs. This ability to simultaneously examine all words in context is the reason transformers have become foundational to recent breakthroughs in language modelling.

Text, Culture, Algorithms

Discussion about this post