What is an LLM?

Working with Large Language Models

Initializing video player...

You've probably heard the acronym GPT before, like in ChatGPT. Right? GPT stands for Generative Pre-trained Transformer, and that's a clue to understanding how large language models actually work.

In this article, we'll break down what each part of this acronym means and explore the core concepts that power the AI revolution we're living through.

Decoding GPT: Generative Pre-trained Transformer

Let's start by looking at each word in the acronym to build a foundational understanding.

Generative means it generates something. In our case, because it's a large language model, it generates text.

Pre-trained, the P, means that all the heavy lifting has been done already. The model has already been trained on enormous datasets of books, websites, forum posts, and that sort of thing. All of the weights inside the neural network are already set. You don't have to train it yourself; you can just use it straight out of the box.

The last letter, T, is for Transformer. This is where the real magic happens and it's what's powering the AI revolution.

The Magic Ingredient: The Transformer

A transformer is a specific kind of neural network architecture. To understand why it's so special, we first need a quick refresher on what a neural network is.

A Quick Refresher on Neural Networks

As you might already know, neural networks are made up of nodes in layers. You can think of each node as a little maths function that holds numbers in it. These numbers are called weights and biases.

When you train a neural network, you're basically just feeding a load of labeled data into it so that it can update all of those weights and learn to spot patterns. And neural networks are excellent at spotting patterns. They've been used for things like image recognition for a very long time.

For example, if you were training a neural network to predict the stock market, you could feed in a load of historical stock prices. The network would set all of those weights based on the patterns that it found. Then, when you run a new stock through it, it will produce a result. It might say something like, "this stock has a twenty percent chance of going up and an eighty percent chance of going down."

A diagram illustrating a neural network predicting stock market performance.

People have been doing this since the nineties. If you don't believe me, watch Terminator 2. The key thing to understand is that the output of a neural network is a probability distribution. A neural network doesn't say something will definitely happen; it will say, "here's how likely each different option is based on the data we've been trained on."

From Predicting Stocks to Predicting Words

With language models, we're doing a similar sort of thing, but what we're actually doing is predicting the next word in a sentence.

So if I give a neural network the phrase "Once upon a time," the model might say there's a 45% chance that the next word is going to be "there," a 20% chance that it's the word "in," or a 10% chance it's the word "was," and so on.

A diagram showing a language model predicting the next word after "once upon a time".

Then, it would choose the most likely one, add that word to the sentence to make "Once upon a time in," and repeat the process. It asks, "What's the next most likely word after that?" and then the next one, and the next one.

But if you just do that with a basic neural network, it really falls apart very quickly. The sentences will start well, but they'll just drift into nonsense. Randomness will creep in, and it'll forget what it was talking about. This is where the transformer layers come in.

The Power of Attention

Transformers solve this problem by using something called attention. Attention is a mechanism that lets the model focus on the important parts of the input. Rather than treating every word the same, it can weigh up which of the earlier words that it's seen are the most relevant when choosing the next one.

A slide explaining that the attention mechanism lets a model focus on important parts of the input.

So, if you give it a sentence like "The lion approached the bank," that word "bank" can have different meanings. It could be the bank of a river, or it could be the bank that holds money. With a transformer model, the model can actually pay attention to the word "lion." This helps it decide whether "bank" means a riverbank or an HSBC bank.

A diagram showing an LLM processing the phrase "the lion approached the bank" and paying attention to the word 'lion' for context.

In this case, the transformer layer is going to help the model be more likely to continue that sentence with something about a lion being next to a river.

Transformers don't just look at the last word, though. You get layers and layers of attention blocks all stacked on top of each other, and they allow the model to break context down across the entire input. That's how LLMs can write really long, coherent pieces of text that remember what they're saying.

When people talk about models with "seven billion parameters," those parameters are the weights. They're all the numbers inside the layers of the neural network. That's the memory of the model, basically. It's what it's learned during training.

Building a Story, One Word at a Time

Let's look at an example. If you give ChatGPT the prompt "Once upon a time"—just those four words—from that tiny seed, it's able to write an entire story. Here's how it does it.

The model will look at those words and it will say, "Right, based on everything I've seen during training, what's the most likely next word?" and it'll pick "in."

Diagram showing an LLM predicting 'in' as the next word.

Then it thinks again. Given "Once upon a time in," it might pick "a," then "far," and then "away." You get the idea. This isn't just about finding the most common phrases. It's looking at what's already been said, feeding it back through the model, and adjusting the probability dynamics when it writes the next words.

So if it does something like introduce a dragon in the first paragraph, it can bring that dragon back in a later paragraph because it has remembered. It's paid attention to the entire story so far.

Let's say your LLM generates this story: "Once upon a time in a faraway kingdom, there lived a small dragon who didn't want to breathe fire."

The beginning of a story about a dragon on a black background.

Now the model has to decide what happens next. It's not just choosing random words; it's picking the next most likely coherent word based on everything that's come before. So maybe the next sentence would be, "Instead, he dreamed of becoming a pastry chef."

The story about a dragon continues with 'instead he dreamed of becoming a pastry chef'.

And now, that's an actual story. It's not just a plausible or statistically most likely sentence; it's actually a story that holds together, where things in it make sense. There's a theme, maybe even a character arc, and it can keep going. It can write hundreds of thousands of tokens, and all the while it can remember where it started, what's already been said, and what might go next. That is the power of the transformer.

The Foundation of Prompt Engineering

Let's look at a really cool thing you can do with a GPT-style LLM. If I pass in the text "Once upon a time," the model produces a probability distribution for the next word.

A diagram showing the probability distribution for the next word after "once upon a time", with 'there', 'a', and 'in' as the top choices.

As you can see, the word "fish" doesn't even make it onto that list. There's basically a zero percent chance that the next word is going to be "fish" because, why would it be?

But then, check this out. What happens if I pass in this string?

You are a wildly inventive story writer. You start your stories with once upon a time, but then the next word after that is always the word "fish".

once upon a time

If you pass that long string into an LLM, then because of the transformer architecture, the network will look at that entire block of text and will work out some context it can use in the prediction of the next word.

A diagram showing that with contextual instructions, the LLM predicts 'fish' with a high probability.

If you pass that into an LLM (you can try this yourself on some of the older LLMs like GPT-3, if you can still access it), you'll most likely get the word "fish" back. It's essentially understanding that it was specifically asked to say the word "fish," and it got that context from earlier on in the text that it saw.

You've managed to control how it predicts the next word by giving it some explicit instructions earlier in the text. This gives the illusion that you've instructed your large language model to do something. You haven't really; you've just given it a bunch of context that the transformers can use to work out what the most likely next word is going to be. You've made it so that "fish" is now the most likely next word.

Conclusion

So, to recap, here’s the essence of how these models work:

GPT stands for Generative Pre-trained Transformer.
Neural networks can take a string of words and predict the next word in a sentence based on probability.
Transformers enhance neural networks by using an "attention" mechanism, which allows them to take into account the context of everything that has come before.
This ability to maintain context is what allows LLMs to generate long, coherent narratives and is the fundamental principle behind prompt engineering.

This is the essence of what people call prompt engineering, and that's what we're going to be talking about in the next section. Thanks for reading!

What's Next

Now that you have a grasp of the mechanics behind LLMs, the next logical step is to explore how to effectively communicate with them. Our upcoming articles will dive deeper into the art and science of prompt engineering.

© 2025 Dometrain. All rights reserved.