The Road  

to LLMs

From a Single Neuron to a Global Brain

Singapore Technology Week 2025

About the Talk

I'll be talking about the evolution of language models

  • The Age of Counting: We'll start with the simple but powerful idea of statistical prediction (N-grams).

  • The Age of Meaning: We'll witness the breakthrough of giving words meaning and memory (Embeddings & RNNs).

  • The Age of Attention: We'll explore the paradigm shift that allowed models to see the whole picture at once (Transformers).

  • The Age of Scale: Finally, we'll see how pushing these ideas to their limits created the LLMs we know today.
"I flew from London to Singapore last week. 
I'm really enjoying my stay in ______."

How did we go from a simple autocomplete...

Bad autocomplete

...to machines that seem to understand context, meaning, and even reasoning?

ChatGPT Completion

I. The Foundation Era: Statistical NLP

1950s-2000s

N-gram Models: The Counting Approach

Core principle based on frequency counting:

P(word | previous n-1 words)
# Simple N-gram counting
data = ["the cat sat on the mat", "the dog sat on the rug"]
# Probabilities: {('the', 'cat'): 1, ('cat', 'sat'): 1, ...}

N-gram Strengths & Fatal Flaws

  • Strengths: Simple, interpretable, computationally efficient.
  • Limitations: Statistical patterns, no semantic understanding.

The 'Singapore' Problem

"I flew from London to Singapore last week. 
I'm really enjoying my stay in ______."

-> Paris
-> London
-> New York

What if words could have meaning, not just frequency?

II. The Neural Revolution: Learning Representations

The shift from counting to learning.

Word Embeddings: The Semantic Breakthrough

  • Words as vectors in semantic space.

  • Each word becomes a dense vector of real numbers

  • Key insight: Words with similar meanings have similar vector positions.

What Does an N-Dimensional Vector Look Like?

# Example: 300-dimensional word vector for "queen"
queen_vector = [
    0.2341,  -0.1829,   0.4521,  -0.7234,   0.1892,  # dims 1-5
   -0.3456,   0.6789,  -0.2134,   0.8901,  -0.4567,  # dims 6-10
    0.1234,  -0.5678,   0.9012,  -0.3456,   0.7890,  # dims 11-15
    ...                                              # ...
   -0.2109,   0.5432,  -0.8765,   0.1098,  -0.6543   # dims 296-300
]

Visualizing the "Queen" Vector



queen =
[
0.1
0.8
0.3
. . .
0.5
0.4
]
royalty
woman

Words had Semantic Relationships

Word2Vec Queen
vector("king") - vector("man") + vector("woman") ≈ vector("queen")

The Context Problem

Word2Vec limitation: Each word has ONE fixed vector.

  • "The bank is next to the river."

  • "I need to go to the bank to deposit a check."

The word "bank" has the same vector in both sentences !

The realization: Meaning depends on context.

Each word has fixed meaning regardless of context.
How can we understand context for each word?

RNNs Sequential Context Modeling

Core innovation: Process sentences word-by-word, building a "hidden state" or running summary of everything seen so far.

Breakthrough: The same word gets different representations in different contexts!


h₁ = RNN(x₁, h₀)    → “I”  
h₂ = RNN(x₂, h₁)    → “flew”  
h₃ = RNN(x₃, h₂)    → “from”  
...
h₁₅ = RNN(x₁₅, h₁₄) → “here”                
                    

At each time step t, the RNN receives:

  • The current word vector xₜ
  • The hidden state from the previous step hₜ₋₁

    And it outputs:
  • A new hidden state hₜ that captures all context up to and including this word.


Word:       I → flew → from → London → ... → here
Embedding:  v1   v2     v3     v4     ...    v15
Hidden:     h1   h2     h3     h4     ...    h15

                

Controlling Creativity

Models don't always pick the single most likely next word—they can sample from a distribution of possible words.

With context captured - the model starts understanding context a 'little'

"I flew from London to Singapore last week. 
I'm really enjoying my stay in ______."
    Likely top predictions:
  • ✅ “Singapore” — based on prior sentence.
  • ❌ “London” — possible if model misunderstood context.
  • ❌ “hotel” — grammatically correct, but less likely.

RNN Limitations

  • Sequential bottleneck: Must process left-to-right. No parallelization.
  • Information decay: Earlier context gets "forgotten" in long sequences.
Can we capture long distance context efficiently without sequential processing?

"Instead of relying only on the hidden state, let's allow the model to 'look back' and pay direct 'attention' to the most relevant input words at each step of the output."

The Paradigm Shift (2017)

Attention

III. The Transformer Revolution

(Self)Attention Is All You Need

Self-Attention

    For each word position i, compute:
  • Query (Q): "What am I looking for?"
  • Key (K): "What do I represent?"
  • Value (V): "What information do I carry?"
    Self-attention enables.
  • Parallel processing: All positions computed simultaneously.
  • No fixed context window: Attention to the entire sequence.

If transformers process in parallel, how do they know the order of words?


Add special "position vectors" to each word


| Token  | Word vector | Positional vector| Combined Vector|
| ------ | ----------- | ---------------- | -------------- |
| "I"    | v_I         | p_1              | v_I + p_1      |
| "flew" | v_flew      | p_2              | v_flew + p_2   |
| "from" | v_from      | p_3              | v_from + p_3   |

Transformer Analysis: Context Understood!


                "I flew from London to Singapore last week. 
                I'm really enjoying my stay in ______."
                

word "stay" pays strong attention to Singapore:


                    Self-Attention Heatmap for "stay":
                      "I":         0.05   # Subject (low relevance)
                      "flew":      0.15   # Travel context
                      "from":      0.10   # Direction indicator
                      "London":    0.05   # Origin (not current location)
                      "to":        0.20   # Direction to destination
                      "Singapore": 0.85   # 🎯 The Destination = Current Location!
                      "last":      0.10   # Time context
                      "week":      0.08   # Time context
                    

Result: ✅ "Singapore" (with high confidence!)

The Attention Era

✅ Direct, dynamic connections between *any* words, processed in parallel.

✅ We can now model true long-range dependencies and contextual meaning.

Key Question

Can we simplify the Transformer architecture to generate text efficiently by focusing only on the generation task?

Transformer Architecture Challenge for Text Generation

  • Original Transformer (2017): Built for translation
  • Encoder: Understands input sentence
  • Decoder: Generates output sentence

  • What if we just used the decoder part and made it really, really good?

GPT’s Decoder-Only Approach

The Insight: For text generation, decoder-only is enough!

  • Simpler architecture = easier to scale
  • All compute focused on generation task
  • Masked self-attention prevents "cheating" (seeing future words)

V. The Scaling Revolution

The Scaling Laws Discovery

More parameters + more data + more compute = better capabilities.

This led to Emergent Abilities—capabilities that appear suddenly at scale.

  • In-context learning
  • Chain-of-thought reasoning
  • Code generation

Model Scale Over Time

Model Parameters Year
GPT-1 117 million 2018
GPT-2 1.5 billion 2019
GPT-3 175 billion 2020

Final Takeaways

  • Vector Embeddings: Capture semantic meaning of words and concepts.

  • Attention Mechanisms: Enable models to focus on relevant parts of input for building contextual understanding.

  • Scale: With increased data and parameters, models started to generalize the structure of human knowledge - to become a "global brain" that can reason, code, and create.

Thank You


If you want to reach out or get a link to the slides, can scan this QR code