Singapore Technology Week 2025
About the Talk
I'll be talking about the evolution of language models"I flew from London to Singapore last week.
I'm really enjoying my stay in ______."
1950s-2000s
Core principle based on frequency counting:
P(word | previous n-1 words)
# Simple N-gram counting
data = ["the cat sat on the mat", "the dog sat on the rug"]
# Probabilities: {('the', 'cat'): 1, ('cat', 'sat'): 1, ...}
"I flew from London to Singapore last week.
I'm really enjoying my stay in ______."
-> Paris
-> London
-> New York
What if words could have meaning, not just frequency?
The shift from counting to learning.
Words as vectors in semantic space.
Each word becomes a dense vector of real numbers
Key insight: Words with similar meanings have similar vector positions.
# Example: 300-dimensional word vector for "queen"
queen_vector = [
0.2341, -0.1829, 0.4521, -0.7234, 0.1892, # dims 1-5
-0.3456, 0.6789, -0.2134, 0.8901, -0.4567, # dims 6-10
0.1234, -0.5678, 0.9012, -0.3456, 0.7890, # dims 11-15
... # ...
-0.2109, 0.5432, -0.8765, 0.1098, -0.6543 # dims 296-300
]
vector("king") - vector("man") + vector("woman") ≈ vector("queen")
Word2Vec limitation: Each word has ONE fixed vector.
"The bank is next to the river."
"I need to go to the bank to deposit a check."
The word "bank" has the same vector in both sentences !
Core innovation: Process sentences word-by-word, building a "hidden state" or running summary of everything seen so far.
Breakthrough: The same word gets different representations in different contexts!
h₁ = RNN(x₁, h₀) → “I”
h₂ = RNN(x₂, h₁) → “flew”
h₃ = RNN(x₃, h₂) → “from”
...
h₁₅ = RNN(x₁₅, h₁₄) → “here”
At each time step t, the RNN receives:
hₜ₋₁
hₜ that captures all context up to and including this word.
Word: I → flew → from → London → ... → here
Embedding: v1 v2 v3 v4 ... v15
Hidden: h1 h2 h3 h4 ... h15
Models don't always pick the single most likely next word—they can sample from a distribution of possible words.
With context captured - the model starts understanding context a 'little'
"I flew from London to Singapore last week.
I'm really enjoying my stay in ______."
If transformers process in parallel, how do they know the order of words?
Add special "position vectors" to each word
| Token | Word vector | Positional vector| Combined Vector|
| ------ | ----------- | ---------------- | -------------- |
| "I" | v_I | p_1 | v_I + p_1 |
| "flew" | v_flew | p_2 | v_flew + p_2 |
| "from" | v_from | p_3 | v_from + p_3 |
"I flew from London to Singapore last week.
I'm really enjoying my stay in ______."
word "stay" pays strong attention to Singapore:
Self-Attention Heatmap for "stay":
"I": 0.05 # Subject (low relevance)
"flew": 0.15 # Travel context
"from": 0.10 # Direction indicator
"London": 0.05 # Origin (not current location)
"to": 0.20 # Direction to destination
"Singapore": 0.85 # 🎯 The Destination = Current Location!
"last": 0.10 # Time context
"week": 0.08 # Time context
✅ Direct, dynamic connections between *any* words, processed in parallel.
✅ We can now model true long-range dependencies and contextual meaning.
Can we simplify the Transformer architecture to generate text efficiently by focusing only on the generation task?
What if we just used the decoder part and made it really, really good?
The Insight: For text generation, decoder-only is enough!
More parameters + more data + more compute = better capabilities.
This led to Emergent Abilities—capabilities that appear suddenly at scale.
| Model | Parameters | Year |
|---|---|---|
| GPT-1 | 117 million | 2018 |
| GPT-2 | 1.5 billion | 2019 |
| GPT-3 | 175 billion | 2020 |
If you want to reach out or get a link to the slides, can scan this QR code