Skip to content

Continual Learning: A milestone to AGI

In this post, we will talk about what continual learning is, why it is essential for building Artificial general intelligence and why it is a non-trivial problem.

The goal in machine learning is to build systems that can "learn" to perform tasks without being explicitly instructed on how to perform said tasks. We can formulate the problem like so, let's say we have an input space X and an output space Y and there exists a corresponding Y for every X such that their mapping is not completely uniform. We can assume that there exists a function F* that can represent the mapping for every X to their corresponding Y.

Think of it like this, the most fundamental model of computation is that for every state there is a corresponding action you should take, and once you get to a terminal state you should have a final result, this is what an algorithm is, an algorithm is basically a function. Usually, algorithms are represented as programs written by humans that run on computer hardware, but with this formulation it is a function.

Machine learning is the process of finding said function, if we know this function exists, how can we find it. One of the first problems that needs to be solved is the function representation. A computer program is one representation of the function, but it is not the only one, functions can be represented in all sorts of ways, Finite state machines, Counting machines, Turing Machines, Neural networks, Decision tress, etc. Remember that a function is fundamentally an algorithm.

To build on the formulation above, we can assume that there is a space of functions F just as there is an input and output space. A function that represents the mapping from the input to the output space should exist in the function space, in fact, there could be multiple functions in the function space that can represent the mapping from X to Y.

It is important that we use a good representation of the function space because if our representation is limited, we could end up with a space that does not contain a single solution to our X→Y mapping. There are other factors to consider when choosing a representation for the function space like the ease of search.

Learning is the process of searching the function space to find functions that fit a specific X→Y mapping. The problem is that in most cases we don't know F*, but we have samples from it, this collection of samples is what we usually call a dataset. In most cases, it is almost impossible to get all the possible samples there are for F*, so we are almost always working with a subset of data. One way we try to find F* in machine learning is by using gradient decent and neural networks as function representation.

Artificial Intelligence (AI) is when a man made system exhibits intelligence, and one way we can tell that it does is by testing it on tasks. One way we can make AI is via Machine Learning, we assume that there is a function that works well for the tasks we plan to use to test the machine's intelligence. Artificial General Intelligence (AGI) is basically a man-made system that exhibits intelligence not just on a specific set of tasks but a wide variety of them, the wider the task set, the more general it is. This is like an AI that is not just "narrow" to one domain but can thrive in any domain.

AGI is basically a way to highlight this kind of AI because for a long time the kinds we have been developing seem to thrive mostly in narrow domains. When thinking about AGI, it is better not to think of its intelligence in terms of just tasks but in terms of learning ability. Instead of asking how many tasks can the system perform?, we ask how good is its ability to learn a task once it has been presented to it?. Another way to say this is that AGI is a system that is expected to perform well on tasks that it is yet to see. The main problem here is that because we don't know what task the machine would be presented with, we would need the machine to learn on the go, this is where continual learning comes in.

Continual learning is where a system can learn gradually as it sees more data. The main issues with this is that the data can be presented in any order meaning the distribution of the data at time t1 and t2 can be drastically different and not even reflect the underlying distribution of the data from the true function F*. This is a big problem for offline algorithms that require that you have a "dataset" that has the same distribution as samples from F*, but in continual learning, for various reasons you won't have this dataset. Trying to adapt offline methods like deep learning + stochastic gradient descent is almost impossible for achieving continual learning.

Given that the real world distribution is non-stationary, an AI system without continual learning ability would be unable to adapt quickly to it. A simple case of learning one task after another would be difficult for such system because they are not built to integrate new information into their models. For example, if I train a neural network to recognize cats, and it performs well, when I train it to recognize dogs, its performance for recognizing cats degrades, this is a concept known as catastrophic forgetting. Learning a new friends name, a shortcut to work, or a new way to perform a task is all continual learning, and it is essential that AGI can do this.


A conversation between me and ChatGPT shows how non-trivial continual learning can be.
link: https://chatgpt.com/share/68a432fb-50fc-8004-8f80-1a0e52629535

**Me**:  
Assuming I have a perfect algorithm that can model any dataset as a function. 
when I fit a function for a subset of data S1 and fit another for a
larger set of S2 where S1 is a subset of S2. How trivial is it that I could 
go from first function to the next?

**ChatGPT**:  
Let’s slow-walk this carefully.

You’re positing a perfect algorithm — meaning, given any dataset, it can fit 
a function that exactly matches the data (no error, no overfitting
caveats, just a magical exact oracle). Now you ask:
• Suppose we train it on subset S1, get function f₁.
• Then we train it on a larger superset S2 (where S1 ⊂ S2), get function f₂.
• How “trivial” is it to go from f₁ to f₂?

⸻

The trivial case

If “perfect algorithm” means memorization (like a lookup table that returns 
exactly the outputs for the inputs you’ve seen, undefined elsewhere), then
going from f₁ to f₂ is trivial: you just add new entries. No structural change, 
just an extension.

But if by “perfect algorithm” you mean some universal function approximator 
that produces the simplest consistent function with the data (something
like Solomonoff induction, or minimum description length reasoning), then 
things get hairier.

⸻

Why it may not be trivial
• Discontinuities in the learning rule: When new data arrives, the “simplest” 
function that explains S2 might be very different from the one that
explained S1. You might need to rewrite the whole hypothesis from scratch.
• No guaranteed nesting: f₁ is guaranteed to agree with f₂ on S1 (since both 
were trained on S1), but the form of f₁ may not be easily extendable to
f₂. Example: with two points you might infer a straight line; with three you 
might infer a parabola. They agree on the first two, but you can’t
“tweak” the line into a parabola without a qualitative jump.
• Occam’s razor instability: if the algorithm is biased toward simpler explanations, 
adding new data can cause radical jumps in the chosen function.
The “simplest” model for S1 might be completely different from the “simplest” 
model for S2.

⸻

When it is trivial

It becomes trivial if your “perfect algorithm” is incremental by design:
• Example: If it’s kernel regression with a delta kernel (memorization), 
extending is trivial.
• If it’s a hierarchical Bayesian prior where posteriors update smoothly 
with more data, then f₂ is just the updated posterior predictive
distribution, which is a clean extension of f₁.
• If you allow storage of all candidate hypotheses and simply reweight 
them with new evidence (like Solomonoff induction), then the update rule is
also trivial—but computationally intractable.

⸻

So: the triviality hinges on the inductive bias of your “perfect algorithm”. 
If “perfect” means exact and incremental, then yes, trivial. If “perfect”
means always finds the globally simplest consistent rule, then it’s not 
trivial at all—tiny changes in data can cause dramatic changes in the
function.

⸻

This gets into an almost philosophical crack: are you imagining the “perfect 
function finder” as a lookup-table God or as a Platonic law-discoverer?
The former makes extension trivial, the latter makes it a nightmare of model 
instability.

Would you like me to illustrate both cases with a concrete toy example 
(say, going from two points → three points)?

Additional Reading


This article was shared on LinkedIn | Twitter/X