When neural network representations converge
Architectural adjudication versus computational constraint
Architectural adjudication versus computational constraint
In this blog post I explore possible theoretical implications of the finding that trained neural networks tend to show similar internal representations. I riff off of two recent papers; Chen & Bonner (2024), and Huh et al. (2024). In this blog post, I focus mostly on the former.
Consider the following idealized approach. Let's consider all neural networks that possibly capture some aspects of how the brain realizes a target cognitive function F. Let's call this set of networks set A. Next, after training each network using a task objective that operationalizes F, let's consider every latent dimension that is causally relevant in task performance. To what extent are such latent dimensions shared across networks in our set, and to what extent are these dimensions like the ones the brain employs when it realizes F? Chen & Bonner (2024) refer to the former question as one of universality, and the latter as one of brain similarity.
Now, any endeavor to resolve these questions will need to make countless assumptions in practice, many of which cut to the core of considered and neglected issues in computational cognitive neuroscience:
Today, I sidestep all of these topics and consider what answers we might find to the universality and brain similarity questions if we could reach straight for nature's joints. One reason to entertain "pre-empirical" trains of thought like these is that by mapping out possible answers, we may find better ways of deriving any one of them.
As a start, any neural network dimension that meets our causal relevance condition lands somewhere on a brain similarity and universality axis, which combine into four quadrants.
Each cell is informative in its own right. The top left corner covers dimensions that only few networks "get right". The top right concerns explanatorily useful dimensions that prove an easy discovery across network architectures. The bottom left cell involves dimensions that don't explain the brain and which neural networks with more brain-like architectures avoid during training. The bottom right covers unpredictive dimensions that networks tend to fall into—they are a collective blind spot. This normative language is adopted relative to the goal of explaining the brain.
Now, let’s further suppose that there are systematic relationships between universality and brain similarity. On the one hand, it may be that those dimensions that are widely shared across our network set are similar to the brain’s dimensions, with dimensions that are present in few networks being unlike the brain’s dimensions. This possibility is denoted by monotonically increasing relationships, visualized below in purple. On the other hand, it may be that only few neural networks have dimensions that are similar to the brain’s dimensions, with the vast number of dimensions that networks settle into lacking similarity to the brain. These relations are colored orange. I’ve also highlighted a third class in black, which is more nuanced.
A trusted methodology of the neuroconnectionist research program is to pit neural networks with diverging architectures against each other to find the subset of networks with inner workings that are more likely to capture how the brain computes (Doerig et al., 2023). This is exemplified by Brain Score and controversial stimuli approaches (Golan et al., 2020). For the former, the predictive power of neural networks towards brain activity is compared and contrasted. For the latter, stimuli for which different networks produce distinct classifications are adjudicated upon using human data; with the winner being the one that matches human choices.
It seems to me that this adjudication or pitting approach implicitly assumes the orange or at least black class of relationships. After all, there need to be enough differences in how brain-like the representations emerging from various networks are to justify the idea that the differences between networks are consequential enough for us to scavenge them for mechanistic hypotheses to test on the brain. But if the converse relation holds—if universality and brain similarity positively correlate—it would mean that those dimensions that correspond most to the brain's latent dimensions are precisely the ones agreed upon between one neural network architecture and the next, impeding our rationale for network selection.
Interestingly, Chen & Bonner (2024) find such a positive relation (shown in purple above). They show this by extracting latent dimensions from a large number of neural networks and predicting each dimension from the dimensions extracted from other networks (using cross-validated ridge regression). Similarly, they test which network dimensions are brain-like by predicting each of them using fMRI activity (again, using ridge regression). They find a positive relation with most dimensions falling in the lower left quadrant:
I want to leave a critical evaluation of the paper for another day; certainly, many of the issues haunting the field are applicable here too. Instead, I want to explore what this pattern means conditional on it being correct.
First, in my opinion, this result would not bode well for the endeavor of architectural adjudication. To be sure, the orange points in the plot reflect binned dimensions, and the shadings reflect data densities. There might well be idiosyncratic dimensions that point to the explanatory superiority of specific models. But the general trend is fascinating, and contrary to what I had expected when entering this field and reading the basic literature. I would like to learn more from experts as to why my intuitions might be wrong here.
An upward trend is hard to reconcile with the project of pitting competing networks against each other, but it does raise an interesting question: why the convergence? On the one hand, it could partially reflect dataset statistics. Perhaps any network trained on a particular dataset will wire its internals such that it produces a sort of echo based on what visual features, or sounds, tend to co-occur in the data it is processing. However, it might be more than that. After all, the key finding in Chen & Bonner is not that neural network representations are converging, but that they are converging in such a way that makes them brain-like—and unlike networks the brain is not trained on controlled datasets like ImageNet.
This invites a more exciting possibility. The existence of widely shared yet neurally predictive dimensions could suggest that both the biological and artificial systems are brushing up against inherent computational properties associated with efficiently solving a task. This is interesting because insofar our cognitive function F is indeed adequately assessed by our proposed task objective, it means we have a data-driven way to discover constraints that the human brain also has to reckon with. From this angle, it is the associationist properties of brain and network that are scientifically fruitful, not any architectural differences between networks that are harvested for predictions about how the brain works. I was surprised to discover that some thinkers, such as Nancy Kanwisher, consider this the primary goal of neural network modeling all along (Kanwisher, 2023, Dobs et al., 2022). I think these competing views should be contrasted more explicitly (for example, in an adversarial collaboration).
Finally, let's scrutinize the contribution of network training. What happens when we let networks with random weights perform the task? Interestingly, the authors find that the dimensions shared across untrained networks also tend to be ones which predict brain activity well (untrained effect; red), but not as well as universal dimensions discovered after training (training gain; purple). Interpreting this under our lens, it seems that the core operations common to neural networks architectures—i.e., hard-coded steps which most untrained networks use to process data regardless of weights—may be enough to encroach on the curious computational bedrocks to some extent. Still, training further bootstraps the discovery of these universal dimensions, which, per hypothesis, are also what the brain converges on to handle cognition's demands.
Perhaps all this just emphasizes the importance of construct and ecological validity: it is only if our tasks and datasets are similar enough to what humans do when they hear, see, and think, that we can learn anything useful.
I want to end by foreshadowing a paper by Huh et al. (2024). This work also finds evidence for converging neural network representations, but it's crucially different from Chen & Bonner (2024). First, it is not testing network-to-brain similarity; it only concerns inter-network convergence. In my opinion, this puts it more squarely in the field of neuroAI or machine learning, not neuroconnectionism or computational cognitive neuroscience, which takes up the goal of explaining cognition in the brain. Still, its implications are relevant: it offers explanations for why networks converge that go beyond what I've outlined here, and as such it is a companion piece to the paper highlighted today. I will re-read this paper soon and might write a follow-up to this post.