From Words to Meaning: The Evolution to LLMs

The Road

to LLMs

From a Single Neuron to a Global Brain

Singapore Technology Week 2025

About the Talk

I'll be talking about the evolution of language models

The Age of Counting: We'll start with the simple but powerful idea of statistical prediction (N-grams).

The Age of Meaning: We'll witness the breakthrough of giving words meaning and memory (Embeddings & RNNs).

The Age of Attention: We'll explore the paradigm shift that allowed models to see the whole picture at once (Transformers).

The Age of Scale: Finally, we'll see how pushing these ideas to their limits created the LLMs we know today.

"I flew from London to Singapore last week. 
I'm really enjoying my stay in ______."

How did we go from a simple autocomplete...

...to machines that seem to understand context, meaning, and even reasoning?

I. The Foundation Era: Statistical NLP

1950s-2000s

N-gram Models: The Counting Approach

Core principle based on frequency counting:

P(word | previous n-1 words)

# Simple N-gram counting
data = ["the cat sat on the mat", "the dog sat on the rug"]
# Probabilities: {('the', 'cat'): 1, ('cat', 'sat'): 1, ...}

N-gram Strengths & Fatal Flaws

Strengths: Simple, interpretable, computationally efficient.
Limitations: Statistical patterns, no semantic understanding.

The 'Singapore' Problem

"I flew from London to Singapore last week. 
I'm really enjoying my stay in ______."

-> Paris
-> London
-> New York

What if words could have meaning, not just frequency?

II. The Neural Revolution: Learning Representations

The shift from counting to learning.

Word Embeddings: The Semantic Breakthrough

Words as vectors in semantic space.
Each word becomes a dense vector of real numbers
Key insight: Words with similar meanings have similar vector positions.

What Does an N-Dimensional Vector Look Like?

# Example: 300-dimensional word vector for "queen"
queen_vector = [
    0.2341,  -0.1829,   0.4521,  -0.7234,   0.1892,  # dims 1-5
   -0.3456,   0.6789,  -0.2134,   0.8901,  -0.4567,  # dims 6-10
    0.1234,  -0.5678,   0.9012,  -0.3456,   0.7890,  # dims 11-15
    ...                                              # ...
   -0.2109,   0.5432,  -0.8765,   0.1098,  -0.6543   # dims 296-300
]

Visualizing the "Queen" Vector

queen =

[

0.1

0.8

0.3

. . .

0.5

0.4

]

royalty

woman

Words had Semantic Relationships

vector("king") - vector("man") + vector("woman") ≈ vector("queen")

The Context Problem

Word2Vec limitation: Each word has ONE fixed vector.

"The bank is next to the river."
"I need to go to the bank to deposit a check."

The word "bank" has the same vector in both sentences !

The realization: Meaning depends on context.

Each word has fixed meaning regardless of context.
How can we understand context for each word?

RNNs Sequential Context Modeling

Core innovation: Process sentences word-by-word, building a "hidden state" or running summary of everything seen so far.

Breakthrough: The same word gets different representations in different contexts!


h₁ = RNN(x₁, h₀)    → “I”  
h₂ = RNN(x₂, h₁)    → “flew”  
h₃ = RNN(x₃, h₂)    → “from”  
...
h₁₅ = RNN(x₁₅, h₁₄) → “here”

At each time step t, the RNN receives:

The current word vector xₜ
The hidden state from the previous step hₜ₋₁

A new hidden state hₜ that captures all context up to and including this word.


Word:       I → flew → from → London → ... → here
Embedding:  v1   v2     v3     v4     ...    v15
Hidden:     h1   h2     h3     h4     ...    h15

Controlling Creativity

Models don't always pick the single most likely next word—they can sample from a distribution of possible words.

With context captured - the model starts understanding context a 'little'

"I flew from London to Singapore last week. 
I'm really enjoying my stay in ______."

✅ “Singapore” — based on prior sentence.
❌ “London” — possible if model misunderstood context.
❌ “hotel” — grammatically correct, but less likely.

RNN Limitations

Sequential bottleneck: Must process left-to-right. No parallelization.
Information decay: Earlier context gets "forgotten" in long sequences.

Can we capture long distance context efficiently without sequential processing?

"Instead of relying only on the hidden state, let's allow the model to 'look back' and pay direct 'attention' to the most relevant input words at each step of the output."

The Paradigm Shift (2017)

III. The Transformer Revolution

(Self)Attention Is All You Need

Self-Attention

Query (Q): "What am I looking for?"
Key (K): "What do I represent?"
Value (V): "What information do I carry?"

Self-attention enables.

Parallel processing: All positions computed simultaneously.
No fixed context window: Attention to the entire sequence.

If transformers process in parallel, how do they know the order of words?

Add special "position vectors" to each word


| Token  | Word vector | Positional vector| Combined Vector|
| ------ | ----------- | ---------------- | -------------- |
| "I"    | v_I         | p_1              | v_I + p_1      |
| "flew" | v_flew      | p_2              | v_flew + p_2   |
| "from" | v_from      | p_3              | v_from + p_3   |

Transformer Analysis: Context Understood!


                "I flew from London to Singapore last week. 
                I'm really enjoying my stay in ______."

word "stay" pays strong attention to Singapore:


                    Self-Attention Heatmap for "stay":
                      "I":         0.05   # Subject (low relevance)
                      "flew":      0.15   # Travel context
                      "from":      0.10   # Direction indicator
                      "London":    0.05   # Origin (not current location)
                      "to":        0.20   # Direction to destination
                      "Singapore": 0.85   # 🎯 The Destination = Current Location!
                      "last":      0.10   # Time context
                      "week":      0.08   # Time context

Result: ✅ "Singapore" (with high confidence!)

The Attention Era

✅ Direct, dynamic connections between *any* words, processed in parallel.

✅ We can now model true long-range dependencies and contextual meaning.

Key Question

Can we simplify the Transformer architecture to generate text efficiently by focusing only on the generation task?

Transformer Architecture Challenge for Text Generation

Original Transformer (2017): Built for translation
Encoder: Understands input sentence
Decoder: Generates output sentence

What if we just used the decoder part and made it really, really good?

GPT’s Decoder-Only Approach

The Insight: For text generation, decoder-only is enough!

Simpler architecture = easier to scale
All compute focused on generation task
Masked self-attention prevents "cheating" (seeing future words)

V. The Scaling Revolution

The Scaling Laws Discovery

More parameters + more data + more compute = better capabilities.

This led to Emergent Abilities—capabilities that appear suddenly at scale.

In-context learning
Chain-of-thought reasoning
Code generation

Model Scale Over Time

Model	Parameters	Year
GPT-1	117 million	2018
GPT-2	1.5 billion	2019
GPT-3	175 billion	2020

Final Takeaways

Vector Embeddings: Capture semantic meaning of words and concepts.

Attention Mechanisms: Enable models to focus on relevant parts of input for building contextual understanding.

Scale: With increased data and parameters, models started to generalize the structure of human knowledge - to become a "global brain" that can reason, code, and create.

Thank You

If you want to reach out or get a link to the slides, can scan this QR code