-
Notifications
You must be signed in to change notification settings - Fork 110
Sensorimotor Inference Algorithm Discussion
October 23, 2014
This document is a discussion of thoughts so far on the sensory-motor inference pooling algorithm. It starts with a proposal for an algorithm that best matches all the desired criteria (of all the algorithms considered), followed by a set of possible modifications to the proposal and their impact, and a list of advantages and disadvantages of the proposal.
Every layer of cells will implement the same algorithm, but will differ in their proximal and distal inputs. For instance, Layer 4 will have sensory-motor input on the distal dendrites, while Layer 3 will have lateral input from other cells on the distal dendrites. Each layer will be able to do spatial pooling, temporal pooling, and temporal memory.
Within the layer, we can think of the columns as representing the input to the layer, and the cells within a particular column representing which context that input was seen in. For Layer 4, that context would be the sensory-motor context ("How did I get to this input? Oh, from input A with movement 1, or from input B with movement 2."). For Layer 3, the context would be the high order context ("How did I get to this input? Oh, from this long sequence of inputs."). Therefore, bursting a set of columns (activating all cells in the columns) could be seen as saying, "I see this input now, but all contexts for how I got here are valid."
While the input is unpredicted, the set of active columns should be changing. But while the input is correctly predicted, the set of active columns should be fixed (representing an active "pooled" representation of the input). Finally, if an input was unpredicted, but previously pooled over (flash inference), it should burst the correct set of columns in higher layers, as if the higher layers are saying, "I've seen this input before as part of this pooled representation, but I don't know in what context I'm now seeing this pooled representation."
-
Columns have proximal dendrites, and each dendrite has many segments. Each segment connects to one set of cells in the lower layer. The dendrite represents a pool of input patterns.
-
Cells have distal dendrites, and each dendrite has many segments. In Layer 4, these segments will connect to both sensory and motor bits. In Layer 3, these segments will connect to a set of other cells in the same layer.
-
A layer will know whether its input from the lower layer was correctly predicted by that layer. This signal will simply be whether the input is sparse (predicted) or dense (unpredicted).
Basic sensorimotor inference is a simple extension of Temporal Memory. [Please see this page for description of sensorimotor inference] (https://github.com/numenta/nupic.research/wiki/Sensorimotor-Inference-Algorithm).
If the input was predicted,
Pool input into the active columns (by adding segments to their dendrites connecting to this input).
If the input was not predicted,
If columns have segments that match the input,
If there are predicted cells in these columns,
Activate these cells.
If there are no predicted cells in these columns,
Activate all the cells in the column (burst).
Pick cells in the column to learn on.
If columns don't have segments that match the input,
Activate columns using the Spatial Pooler inhibition rule.
Pick cells in the column to learn on.
Note: When picking cells in the column to learn on, if we had recently activated this column and picked cells, pick the same cells as before. See Q4 below for details.
Let's run through the training and inference for a simple sensory-motor example using two Layer 4s.
Say the world looks like this: ABC DEF GHI
Let's say each letter can be seen from 3 different angles (eg. A = A1, A2, A3).
We have two layers in our hierarchy. We want the lower layer (L1) to pool over the different angles for a letter (A1, A2, A3 => A). And we want the higher layer (L2) to pool over the different letters (ABC => P1).
We start our training (arbitrarily) at B2.
- L1 gets B2 as (dense / unpredicted) input. We activate columns in L1 using the Spatial Pooler inhibition rule, and pick cells in the column to learn on.
- We move right to B3. We activate a new set of columns in L1 using SP inhibition, and pick cells in the column to learn on.
- We repeat this within B, until the transitions between B1 / B2 / B3 are predicted by the input layer.
- At some point, there will be one set of columns activated in L1 by an unpredicted input (let's say B1), but the transition to B2 was predicted by the lower layer. Now, the columns and cells in L1 stay active. They form connections on their proximal dendrites to B2.
- This repeats until B1, B2, and B3 are all pooled into the active columns in L1 (we'll call these the B columns).
- Then we move to C2. This movement wouldn't have been predicted by the input layer, so it bursts C2.
- The B columns become inactive due to the unpredicted input, and a new set of columns becomes active using the SP inhibition rule.
- We repeat the above process until there is a set of C columns in L1 that have pooled over C1, C2, and C3.
- We then move back to B1. This movement was unpredicted by the input, so it bursts B1.
- The B columns in L1 each have one segment that is driven by B1, so they become active. Since there are no predicted cells in those columns, the columns burst, and we pick cells in the column to learn on. These cells form distal connections to sensory "C" and motor movement "left".
- Then we move to C3 and repeat steps 9-10. L1 has now learned to the transitions B => C and C => B.
- From C3, we move to B1. At the moment the movement command comes in, L1 will predict the cells in the B columns that formed distal connections in step 10.
- When B1 comes in, it will drive the B columns to become active. Now there are predicted cells in the B columns, so those cells become active. They reinforce their distal connections to "C" and movement "left".
- Repeating this process while moving around in the ABC section of the world will allow L1 to pool over A, B and C, and learn to move between them. Once these movements become predicted, L2 will be able to pool ABC into a set of columns (call it P1).
- Now, we move to E (specifically, E3). This is a bigger movement than before, and neither L1's input layer nor L1 is able to predict it. So both burst.
- Eventually, the process repeats within DEF so that L1 pools over D, E and F, and L2 pools DEF into P2. Now, we start jumping between points in the DEF world and the ABC world, so L2 can learn the transitions between P1 and P2.
- After more training, we have learned the whole world (including GHI => P3).
- Let's try inference. First, we flash B3. L1 will burst B columns, and L2 will burst P1 columns.
- Then, we move to F1. At the point of movement, L2 will predict cells in P2 columns. When we see F1, L1 will will burst F columns. Some of these cells will drive the P2 columns in L2. Since some cells in P2 columns were predicted, only those will be activated.
- Then, we move to H2. At the point of movement, L2 will predict cells in P3 columns. When we see H2, L1 will will burst H columns. Some of these cells will drive the P3 columns in L2. Since some cells in P3 columns were predicted, only those will be activated.
- At this point, we can move from any angle of any letter to any other letter, and see that L2 knows exactly which of P1, P2, or P3 it is in, and which sensory-motor context those pooled inputs are seen in. Moving within a letter causes stability in L1, and moving within a letter group (like ABC) causes stability in L2.
If you work it out, you'll see that if we were to train another world with elements shared with this one, the system would not get confused.
For example, if we trained in the same way on a new world: MBN OEP QHR
(Note that B, E and H are shared.)
And we started by flashing M2, and then moved to E3 and then to H1, L2 would still activate a different set of cells than it would have for the previous world. This is remarkable, considering that E3 and H1 are ambiguous, and have the same relative positions in the two worlds. This is made possible by the fact that we started from M2, and L2 has learned the sensory-motor transitions in the two separate worlds.
Let's run through the training and inference for two simple high-order sequences using two Layer 3s. Instead of sensory-motor context on the distal dendrites of cells, we will allow lateral connections to other cells to form.
Say we have the following sequences:
ABC
DEF
GHI
Say each letter is actually a sequence of three components (eg. A = A1, A2, A3).
Finally, say we often see them together in the following (larger) sequences:
ABC DEF GHI
GHI DEF ABC
We have two layers in our hierarchy. We want the lower layer (L1) to pool over the component sequence for a letter (A1, A2, A3 => A). And we want the higher layer (L2) to pool over the smaller sequence of letters (ABC => P1).
So we start by training on the smaller sequences, first, with ABC.
- L1 gets A1 as (dense / unpredicted) input. We activate columns in L1 using the Spatial Pooler inhibition rule, and pick cells in the column to learn on.
- We then see A2. We activate a new set of columns in L1 using SP inhibition, and pick cells in the column to learn on.
- We repeat this with A3. During this whole process, the input layer is learning the transitions between A1, A2, and A3.
- After some time and seeing other inputs, we start seeing ABC again.
- L1 gets A1. We activate a set of columns in L1 using SP inhibition and pick cells in the column to learn on.
- L1 gets A2, but this time it was predicted after A1 by the input layer. L1 now starts pooling, connecting the active columns to A2.
- L1 gets A3 (predicted by input layer), and pools over it as well. Now there are a set of columns in L1 that represent A.
- We repeat this with B and C, until L1 has made separate pools for all 3.
- Now, we start seeing the smaller sequences of letters. We start with ABC.
- We first see A1, which activates the A columns in L1. (See Q8 below for how this can happen.)
- Then, we see A2 (predicted by input layer), and then A3. Throughout this time, L1 has A columns active, and cells in those columns active.
- Now, we see B1. This was unpredicted by the input layer (since we started training by showing A, B, and C separately). It will cause B columns to burst, and cells in the columns to be selected as active. These cells will form lateral distal connections to the previously active cells in the A columns.
- Then we see B2. Since this will have been predicted by the input layer, it will keep the B columns and cells in L1 active.
- This process will repeat until ABC has been learned, with particular cells in the L1 B columns having lateral connections to A cells, and cells in the C columns having lateral connections to B cells.
- We repeat this with DEF and GHI. Now, L1 has learned each of these smaller sequences.
- As soon as these sequences are predictable by L1, L2 will start pooling over them. In the same manner (but at a higher level), L2 will form pools ABC => P1, DEF => P2, and GHI => P3.
- We finally come to the longer sequences, starting with ABC DEF GHI.
- We start by showing A1. This will burst A columns in L1, which will burst P1 columns in L2. In L1, B cells will be predicted.
- While we move to A2 and then A3, L1 will be stable, and L2 will be stable.
- We then move to B1. This will have been unpredicted by the input layer, so L1 will activate the B columns. However, since there were predicted cells in the B columns, only these will become active. These cells would have been pooled into P1 by L2, so L2 will remain stable and locked in P1.
- Eventually, we cross B and C, and start on the DEF sequence.
- When we see D1, L1 will not have predicted anything (since it has only learned ABC and DEF and GHI). Thus, it will burst the D columns, which will drive the P2 columns to burst as well. Cells in these columns will then form lateral distal connections to cells in the P1 columns that were previously active.
- In this way, P2 will learn the transitions between P1, P2 and P3.
- Now, let's see what will happen if we jump across sequences. We will show the system the following sequence: A2 B2 C2 D3 E3 F3.
- First, we show A2. This will burst the A columns in L1, and the P1 columns in L2. L1 will predict cells in the B columns, and L2 will predict cells in the P2 columns.
- Next, we show B2. This will activate the B columns in L1, but just the cells that were predicted will become active. These cells will predict cells in the C columns. At this point, L2 will remain stable.
- We repeat the process for C2, with L2 remaining stable.
- When we show D3, L1 will with burst the D columns, since it will not have made a prediction. This burst will activate the P2 columns in L2, but just the cells in the P2 columns that were predicted.
- Through E3 and F3, we will see L2 remaining stable in sparse P2 cells.
This is great, because it means that once we've learned to pool hierarchically over a high-order sequence, we can jump between points in the sequence and still see good predictions in the higher layers. Also, all this is done with the same algorithm that powers sensory-motor inference!
For example, let's say a layer has learned the following pools:
BAC => P1
XAY => P2
When we flash A, both P1 and P2 are valid. Should both sets of columns be activated? Should we pick the stronger one?
It might make more sense to pick the stronger one, rather than the union of both, so the higher layers don't get confused. We tend to see this in biology too (people one see one of many representations, even when it is ambiguous, and they flip between the possible representations with priming).
From initial thought experiments, it does seem like it should be able to converge even in the case of partial learning.
Note that the partial learning issue may not actually be a major problem, and could be mitigated by proper a training approach (make small movements first, to learn all the detail, before making bigger ones).
Consider this example:
Let's say a layer (L2) has learned the following pool:
BAC => P1
And a lower layer (L1) has learned to predict the following transition:
A => Y
We want L2 to learn this new pool:
XAY => P2
First, we burst A in L1. This will drive P1 columns in L2.
Next, we move right. L1 will predict Y.
Then we see Y. L2 will see that its input was correctly predicted.
Problem: At this point, we don't want to pool Y into P1 columns.
Idea: Delay pooling by 1 step? Don't start pooling when the input was predicted, but rather wait until the next input was predicted.
The problem with this is that the first transition (A => Y) won't be pooled over. But maybe this is not a big issue.
Or maybe this whole problem isn't a big issue, since it might be very unlikely to actually occur.
Edit (12/9/14): We definitely want to do this. We want the pooled representations to reflect the particular world being represented, not just a single element in the world. This is achieved by selecting pooling cells that are driven by the first predicted (sparse) input cells, rather than by the (dense) input in the previous timestep. TODO: Update learning rule above to reflect this decision.
We currently call this "learn on one cell" mode in Layer 4. When we see the same input in a short period of time, we pick the same cells in the columns to be active as before, to reduce the number of patterns a higher layer would have to pool over.
A way to think about it is that the cells in the columns represent sensory-motor context, but the context is the same no matter from what other input you got here, as long as it's all close by in space (and therefore in training time).
The question is, do we still need this for the algorithm to work at all? Do we need it for the algorithm to work practically and speed up training?
It seems that the same algorithm should work for both sensory-motor inference and high-order sequences, by just varying the inputs on the dendrites. But one layer might not be able to do both simultaneously, simply due to the overwhelming complexity of trying to do both at once. This could be an argument for separating these functions into separate layers.
But how are these layers connected to each other? We would want Layer 4 to catch all changes due to the model's own movements, and pass along to Layer 3 all the changes due to the environment changing.
Q6: Do we need to use the Spatial Pooler inhibition rule to select active columns? Why can't we just select columns randomly?
It's not like the pooling columns are actually meaningfully representing the input anyway, they're just representing the first unpredicted input that was followed by predicted inputs.
This question gets at the heart of: Is there any meaning to overlap between active pooling columns? Does that somehow represent semantic similarity in the inputs that they are pooling over?
One answer to this question is: consider what happens in the case of always unpredicted input. Then no pooling will occur, but we would still want robustness against spatial noise when selecting columns to become active. That's one reason we need the Spatial Pooler inhibition rule when selecting active columns in the pooling layer.
If there is one proximal dendrite per column, and 100 columns with 2% sparsity, that means they can form 50 unique pools. If there are two dendrites per column, they can form 100 unique pools. Is this enough?
Maybe the hierarchy will help exponentially increase the capacity, along with help from topology. How does this work exactly?
Especially in the case of pooling over high-order sequences, we may need to pool the first input into the active columns as well, even if it was not predicted. That way it too can be incorporated into the pool. (This would be a small modification to the ruleset).
This would mean that the pooling layer would form extraneous pooling connections while the input was unpredicted. Is this an issue? Probably not.
Note that this conflicts with Q3.
The representations at the top the hierarchy should have SDR properties. Similar representations should represent semantically similar pools of input. How does this arise from the algorithm?
We may need the following additional rule:
If input was predicted,
If columns have segments that match the input,
Activate these columns, and deactivate any pooling columns.
We can see why this is necessary in the following example:
Let's say there are two layers (L1, L2), and L2 has learned the following pools:
BAC => P1
XAY => P2
If we flash A:
- L1 bursts A columns
- L2 picks P1 over P2 to be active, because P1 is stronger (see Q1 above)
Then we move right:
- L1 predicts C and Y
We see Y:
- L1 activates Y, signaling that the input was predicted
What does L2 do? It must activate P2 somehow, instead of pooling Y into the currently active P1 columns. Hence, this additional rule.
One major disadvantage of pooling on columns is that it doesn't seem very biologically accurate. It might require the inhibition cells on columns to be learning and growing connections.
It may be more biologically plausible to pool on the cells within the column instead. The changes to the rules above would be:
- If input was predicted, pool into active cells.
- If input was not predicted but it drives any pooling cells, activate these cells.
One advantage of this modification is that it would increase the pooling capacity of the layer, by a factor of the number of cells per column.
The major disadvantage with this modification is that during flash inference, individual pooling cells in columns will be activated, rather than bursting the column, as if to say "I do know which context I'm seeing this input in!" (even though there was in fact no prediction, and therefore no way to know the context).
Here is an example of where this breaks:
Let's say a sensory-motor pooling layer has learned all transitions in the following world:
ABC
We want it to also learn the following world:
XBY
Specifically, we want B in ABC (call it B') and B in XBY (call it B'') to be represented with different cells in the same columns. This will allow higher layers to pool separately over the different Bs, and not get confused.
So we start by showing Y from the lower layer. Then, we move left to see B. However, this will activate the B' cells (rather than bursting the B columns), and it is the B' cells that will learn the sensory-motor transition Y => B.
Thus, we need to pool over columns, so that when we see an unpredicted B, it will burst the column rather than drive a sparse set of pooling cells.
M2: Cells pool on their proximal dendrites, instead of columns, but all cells in a column pool identically
This is similar to M1 (above), but has all cells in a column active during pooling, forming identical connections on their proximal dendrites.
The advantage of this modification is that it may be more biologically accurate than having the columns pool, while achieving almost the same functionality.
There are two disadvantages:
-
All cells have to be active during pooling, but only one per column learns on their distal dendrites during transitions. This seems weird.
-
Pooling and learning transitions cannot occur simultaneously. This is because pooling can only happen when all the cells in the column are active (since they have to pool identically), but learning on distal dendrites requires only one cell per column to be active and learning.
These disadvantages might not be so bad though. Really, does it even make sense to be pooling while inferring an input in a particular context? It's possible when pooling on columns (one cell per column can be active, inferring a particular context, while the column is active and pooling), but it might not be necessary.
- Clear meaning of columns and cells within columns
- Symmetry across layers / regions in the hierarchy, the behavior exhibited is identical at every recursive level
- The same algorithm powers both sensory-motor inference and high-order sequence memory
- Supports flash inference (at the column level)
- After training, larger jumps can still be predicted by higher layers even if they can't be predicted by lower layers
- Not biologically plausible to pool on columns
- Capacity for multiple pooled representations in a given layer is limited by number of columns (no exponential boost within the layer)
-
It might be easier to tackle the task of pooling over high order sequences with a hierarchy of Layer 3s rather than pooling over sensory-motor sequences with a hierarchy of Layer 4s.
-
We may need to limit the number of simultaneous predictions a given layer can make, to encourage forming distinct pooled representations in the higher layer.