Journey through AI: Weekly Lessons from the Undergraduate Classroom
Building the Tower: From Tokens to Transformers in the Classroom
Journey through AI: Weekly Lessons from the Undergraduate Classroom
From Perceptrons to Patterns: When Students Start to Feel the Code
October 24, 2025
This fall I launched something new at George Mason University: UNIV 182 – AI4All: Understanding & Building Artificial Intelligence, the first campus-wide course in AI literacy, open to every undergraduate, regardless of major. It satisfies the Mason Core requirement in Information Technology & Computing, and, more importantly, it’s meant to lower the barrier of entry into AI for every student on campus. This is not an appreciation course. We understand, we apply, we critique, we build. This course has a rhythm. Join us!
Warning: Substack has determined this post is too long for email, but, trust me, you want the long version.
Setting the goal
More than a week ago, we began our unit on large language models. We had already explored how neural networks recognize patterns and how recurrent neural networks (RNNs) handle sequential data. We also understood their limits: how memory decays over time, how information gets lost across long sequences, and how such models struggle to capture meaning in language that stretches far beyond a few words.
I told the class that the next three lectures would probably be our most demanding, conceptually, mathematically, and intellectually. Our goal was ambitious: to understand the technology that drives ChatGPT, Grok, Claude, Perplexity, Gemini, and every system like them. This would be the culmination of their technical journey in the course.
I projected the transformer architecture. It looked dense, intricate, and almost forbidding.Multi-head attention modules were stacked in dozens of layers. Feed-forward networks. Encoder and decoder towers. Arrows everywhere. It looked like something only experts could decipher.
Then I pulled up the original paper, Attention Is All You Need (Vaswani et al., 2017).
I told them, “This is the blueprint for every generative model you hear about today.”
For a moment, the room was quiet. They saw the prize that awaited them. From that point on, they were willing to build from the ground up. I promised that the journey would be worth it, that by the end, we would be able to take apart every layer, every head, every block, and say: transformer, explained.
Text into Numbers: Embedding is all you Need
We started not with architecture, but with a bit of philosophy.
I reminded students that every machine learning system we had studied, whether recognizing images, predicting stock movements, or translating sentences, did the same essential thing. It mapped complex, structured input into vectors of numbers. Those numbers, those embeddings, were where the model’s understanding lived.
An embedding, I told them, was not just a technical detail. It was the currency of meaning for the model.
It compressed high-dimensional, messy input, such as pixels, words, or audio, into a space where relationships could be measured, compared, and learned.
And where was it hiding? In that last hidden layer before the model’s output. The network was constantly working to improve its internal representation of the data, to turn a poor initial embedding into a good, final one.
That, I said, was the entire project of learning. Everything else, the loss, the weights, and the gradients, was bookkeeping for how to make that embedding better.
I wanted students to see this idea clearly because it reframed everything that followed. A transformer, at its core, is simply a highly structured way to refine embeddings, to move from an initial, context-free representation to a rich, context-aware one.
So we started our journey where others had started before us: from bad initial embeddings to good, hopefully better, final embeddings.
Where to Begin? At the Beginning, with Initial Embeddings
But it was important to start at the very beginning: how do we determine that initial input? We cannot feed words directly to the machine. So we dove into tokens and tokenization.
I explained that tokens are not words but fragments of text, sometimes a whole word, sometimes a prefix, sometimes a single letter, or even punctuation. This decision, to tokenize text before embedding it, was not just an engineering convenience. It determined what the model could see, and therefore what it could count, compare, and understand. We realized that there are different ways to tokenize, different algorithms, and therefore real consequences.
It was time for something abstract to meet something concrete, and funny. When we looked closely at tokenization, we saw how deep its effects run.
For example, students finally understood why GPT-4 famously failed to count the number of “r”s in “raspberry”, or why early versions of GPT-5 struggled with “b”s in “blueberry.”
The reason was not arithmetic but representation. These models do not “see” individual letters the way humans do. They see tokens, and the word raspberry may exist as one or two opaque tokens in their vocabulary, not as a sequence of characters to inspect. The model can reason beautifully over meaning but stumble on form because its understanding begins one layer too high.
That realization changed how students thought about modeling.
Every design choice, what to treat as an atomic unit, what to embed, sets the boundaries of what a system can know and manipulate. Tokenization is not trivial preprocessing; it is the first act of interpretation.
From there, the next step made sense. How do we go from tokens to those initial numbers, those bad embeddings a transformer’s job is to fix? Time to reach for the dictionary, the model’s vocabulary of tokens, and map each one to a numerical vector. That mapping gives us the initial embeddings. They are lookup results, pre-trained coordinates that place each token somewhere in the model’s internal semantic space.
Those vectors are where we begin, but they are shallow. So how do we make them better? How do we get from those first crude numbers to the kind of representations that capture similarity, analogy, and eventually meaning?
Great question.
From Bad Initial Embeddings to Good, Final Embeddings: The Journey
That question took us on a detour through the history of embeddings themselves, a detour that was anything but boring.
We revisited the era before transformers, when the field was just beginning to realize that relationships could be geometric.
Enter Word2Vec and GloVe, two ideas that forever changed how language is represented. Both learned that if two words appear in similar contexts, they should live near each other in vector space. Suddenly, the geometry of meaning became visible:
king – man + woman ≈ queen.
When I wrote that on the board, you could almost hear the room lean forward. There were “huh” moments. Yes, that was when we saw how powerful embeddings were. We could do arithmetic on words, add, subtract, and uncover this strange, previously unprobed semantic space that admitted math.
And then came the bad news. It was not enough. A single vector for ‘bank’ cannot serve both the river and the loan. The arithmetic of embeddings can hint at relationships but cannot disambiguate meanings that depend on context.
So we realized the need to capture polysemy. I had the students say it aloud, “poly-SEH-mee.” We laughed that the Greeks always get the good words. But then we used it seriously: polysemy is precisely why static embeddings fall short.
That realization brought us back to our central problem: how can a model adapt a word’s meaning to its surroundings? How can ‘bank’ know which world it is in?
Contextual Embeddings: Attending to Shared Meaning
So we returned to Attention Is All You Need.
What an audacious title, and it turned out they were right.
The paper began with a radical idea: instead of forcing a model to read tokens in sequence, like RNNs or LSTMs did, what if we allowed every token to look at every other token to decide which ones matter most for its meaning?
This was the most difficult part of the lecture. I have seen graduate students struggle to connect the math to the intuition. So I started high, conceptually, on a mountain.
“How do you know what a word means?” I asked. “How do words get their meaning?”
I could not resist bringing in a bit of sci-fi philosophy (but they expect it now from me):
“The gift of words is the gift of deception and illusion... Words can carry any burden we wish. All that’s required is agreement and a tradition upon which to build.”
— God Emperor of Dune
Yes, we got the need for context, but even this is not enough. You cannot jump from there to Query, Key, Value. No, you need to make it tangible for students.
So, we did a fun experiment. I asked my students: What is my meaning? Here I am standing in front of you. How would you assign me a meaning? I could see a few lost faces, but I could also see those that were actively seeking. So, the next leading question. Who am I most similar to? Am I similar to all of you? Are you all contributing equally to my meaning? No, no, they said. That’s right. I think I am most similar to Ben here, or to Anne there. But there could be a bit of Joel in me. How do I get to this?
Ah, yes, I can show you my calling card. This is my query. I am sending this out to each of you. I want us to compare. But when you show me your cards, those are your keys. So, here I am comparing my query to Ben’s key, and then to Anne’s key, and Joel’s. You know what, I want to compare my query to each of your keys, and then based on how much similarity we have, you tell me what percentage of me comes from you.
So, how do I build my meaning? Let’s add a new construct, each one of us’ value. So, Ben comes and gives me a percentage of his value, and Anne a percentage of hers, and Joel’s similarly, but all those percentages are different, based on how my query compared with each of those keys. You just shared your meaning with me. You were my context, and I gained meaning from you. You attended to my meaning, with this math behind.
So, here we go. Time to go from us to the tokens. For every token, we compute three things:
a Query (Q): what it’s looking for.
a Key (K): what it has to offer,
and a Value (V): the content it carries.
The model compares each Query to all Keys, scores their similarity, scales the results, normalizes them with a softmax, and uses those weights to mix the Values. Why normalize? Because we want those to be percentages, to add to 100%. The outcome? A new vector zi for each token i: a contextual embedding, no longer static, no longer alone, that better embedding we were seeking all along, with bad initial embeddings xi.
Now we saw it, the algorithm that brought it all together:
“This is it,” I told them. “This is how ‘bank’ knows if it’s about finance or a river.” Let us repeat how simple and brilliant this idea is: every token asks, “Who around me helps define me?” This is what I asked you. And so every token rewrites itself based on the answers. Meaning is now collective, shared. Context isn’t added after the fact but creates the meaning itself.
But how do we know what these Qs and Ks and Vs are? We don’t. And what do we do when we don’t? We initialize them to something, and then we learn them. That is what transformers learn and refine: Qi = xi WQ, Ki = xi WK, Vi = xi WV
And we ran the algorithm forward together. We saw it in action. But we were not done yet.
How Rich is my Language: Multiple Embeddings
And yet, and yet, natural language is so rich. We had made sense of attention: how one token can look at all the others, decide which matter most, and rewrite its meaning accordingly. So I turned to the class and asked, “Is one attention mechanism enough?”
Do we remember the channels in the convolutional neural networks? Why did we need them? Here that concept of channels becomes heads.
A single attention head captures only one view of relationships. But language, and really, any sequential data, carries many patterns at once.
On the board, I start writing examples:
Syntax: subject → verb
Semantics: object → adjective
Position: word order, distance, rhythm
Each one of these patterns expresses a different kind of relationship that matters for meaning. So what if we gave the model multiple ways to look—multiple “lenses” of attention, each searching for a different type of structure in the same sentence?
That’s the idea behind multi-head attention. I draw two columns on the board and say:
“Imagine this:
Head 1 focuses on subject–verb links: cat ↔ sat
Head 2 tracks noun–location links: cat ↔ mat or the ↔ mat
Each head lives in its own relational subspace.”
So, we want to allow for the opportunity to capture different relationships. We realize that we want several go’s at this, several potential embeddings for a token, learned in parallel, how efficient! But how do we combine them together? We want one embedding vector for a token? What do we we do when we do not know what to do? Yeah, just concatenate them. And so the equation followed on the board:
MultiHead(Q, K, V) = Concat(head₁, …, headₕ)
where each head uses its own projection matrices: Wᵩ, Wᴷ, and Wⱽ.
That’s the math. That’s what multi-head attention does. Each head captures a distinct relational pattern, and their outputs are concatenated and transformed together.
And now this is the moment we can finally go to the architecture. And I point to those little rectangles that say “Head.” I point back to that transformer diagram we saw on day one, the one that looked impossibly dense, with all those color-coded boxes labeled Head 1, Head 2, Head 3. “Now,” I say, “this is design.” Each head has a job. Each head tells part of the story. Together, they let the model build a representation rich enough to handle the true complexity of language.
From Layers to Blocks to the Thing Itself
Now enough math. Now, architecture. Ready to peel back the layers. One transformer layer = what we have done so far: one round of attention + refinement. One transformer block = stacked layers. Why? For potentially the opportunity to enrich/refine the representations. These days? Mostly a block = one layer. So, what is a transformer?
A Transformer is one or more blocks (usually MANY blocks)
So, we go back to the diagram that started it all. And I ask them “What do we not understand yet?” Some say” encoder, decoder, these we did not mention yet, right?” Right. What are these things?
So we turn back to the diagram on the board.
Encoder, Decoder: Both, One?
The encoder, the part of the transformer that understands. It reads the entire input sequence at once, allowing every token to attend to every other token. Complete visibility. I tell them: “If you’re translating English to French, the encoder’s job is to understand the English.” It takes the input text and turns it into a set of contextual embeddings, those meaning-rich vectors that capture the essence of the sentence. The encoder’s job is not to generate words but to build those embeddings, to build that understanding.
The decoder is the part that speaks, but it can’t see the entire future. It must write word by word, left to right. So we introduce a mask: each token can only attend to the tokens before it. That’s why we call it causal attention; the model can only look back, never forward.
Now, here’s where things get interesting. In an encoder–decoder model, like the original Transformer or T5 introduced in that famous Attention paper, the decoder also gets to look across to the outputs of the encoder. This is called cross-attention. It lets the decoder align what it’s generating with what the encoder understands. You can think of it as a conversation between the two: one side comprehends, the other side communicates.
The encoder reads The cat sat on the mat. The decoder says Le chat s’est assis sur le tapis. And each new French word it writes, it double-checks with the encoder: am I still faithful to the meaning?”
Then I fast-forward to today’s models. GPT, Claude, Gemini, these are decoder-only transformers. Why? Because they’re not translating; they’re predicting the next token in a sequence. Every new word becomes part of the past for the next prediction. So, instead of an encoder–decoder dialogue, we have a single self-talking decoder that reads and writes in one continuous loop.
The architecture that looked so forbidding now feels like a story: one side that understands, one side that speaks, and in modern models one that does both. They nod.
Now there’s only one question left to answer: How does the model actually produce language? How do those embeddings, now so rich and refined, turn into words we can read? Great question.
Predictive, Generative? How to Generate.
We start with what we already know: by the time we reach the final layer, each token’s vector is no longer static. It is a distilled, context-rich embedding, meaning encoded as numbers. Each of those vectors then goes through one last linear transformation: it is multiplied by a weight matrix (often tied to the embedding matrix we started with) to produce a vector of scores, one per word in the model’s vocabulary.
Those scores are called logits. Why logit? To many students, a strange word. Well, I tell them, these are odds ratios. The word comes from “log-odds”: the logarithm of a probability ratio. A higher logit means higher confidence, but the values themselves don’t yet sum to one. Only after applying softmax do those logits become true probabilities, a distribution the model can sample from to decide what word comes next. But how do we sample? Great question.
Each number now represents the model’s belief about which token is most likely to come next. And here comes the choice. “This is where language happens.”
The model can take the highest-probability token, that’s the argmax strategy, or it can sample from the distribution, introducing variability and creativity. We can control that creativity with temperature (a lower temperature makes the model conservative, a higher one makes it adventurous), or we can limit the sampling pool with top-k or nucleus (top-p) sampling. A student asks: what is top-k, and what is top-p? Oh, I tell them. Computer scientists love to do this. You know, when you have scores with things. And you rank them from high to low. Do you pick the top 5 without looking at the scores? That is top-k, with k being 5. Or do you say, no, I will put a threshold here. Nothing with score lower than 0.3 is picked. This is top p, with p being 0.3.
So, back to whatever token is chosen becoming the next input. And so the model repeats the process: predict, choose, append, predict again. Token by token, it generates a sentence, a paragraph, a poem, a story. The prompt giving into a sentence or paragraphs now. Those ingested together, combined with your new prompt, giving birth to new paragraphs. Yes, they got it. We stitched it together, little by little, patiently, but with the prize in mind. We built the tower.
So, now the anchor for what comes next. First, I tell them I will disappoint them. We did all this work, but, no, when we interact with ChatGPT, we are not interacting with the trained decoder. That is hidden from us, hiding in its own tower. We will talk about rewards, and policy. The prize? We will see reinforcement learning in action. The bonus? We will understand sycophancy and all kinds of idiosyncrasies and seeming personalities.
And then the next big hook. Hallucinations. All was probabilities, no? There were choices. Some say that is creativity. Others say it is hallucination. It is not lying. It is not malice. It is just statistics. How do we bound that, take that captive for a goal? Well, we are not done yet. Several weeks left in this course. Hopefully, as exciting for them as they are for me.
The architecture that looked like mystery at the start is now a system of logic, transparent, layered, elegant. That’s the rhythm of AI4All: start with wonder, stay with rigor, end with understanding.
What are the students doing now? Well, they just presented on their “field” experiments on whether GPT-5, Grok, Claude, and Perplexity can reason. They designed reasoning problems and collected interactions and statistics. I am so looking forward to sharing some of their findings with you.
Missed our other posts tracking the course? You can find them here:











