https://orcid.org/0000-0002-9831-8812

Google Scholar

Selected publications

Saxe, A. M., McClelland, J. L., & Ganguli, S. (2019). A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23), 11537–11546. https://doi.org/10.1073/pnas.1820226116
Abstract | arXiv | DOI

An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: What are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.
Advani*, M., & Saxe*, A. M. (2017). High-dimensional dynamics of generalization error in neural networks. ArXiv.
Abstract | pdf | arXiv

We perform an average case analysis of the generalization dynamics of large neural networks trained using gradient descent. We study the practically-relevant "high-dimensional" regime where the number of free parameters in the network is on the order of or even larger than the number of examples in the dataset. Using random matrix theory and exact solutions in linear models, we derive the generalization error and training error dynamics of learning and analyze how they depend on the dimensionality of data and signal to noise ratio of the learning problem. We find that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks. Overtraining is worst at intermediate network sizes, when the effective number of free parameters equals the number of samples, and thus can be reduced by making a network smaller or larger. Additionally, in the high-dimensional regime, low generalization error requires starting with small initial weights. We then turn to non-linear neural networks, and show that making networks very large does not harm their generalization performance. On the contrary, it can in fact reduce overtraining, even without early stopping or regularization of any sort. We identify two novel phenomena underlying this behavior in overcomplete models: first, there is a frozen subspace of the weights in which no learning occurs under gradient descent; and second, the statistical properties of the high-dimensional regime yield better-conditioned input correlations which protect against overtraining. We demonstrate that naive application of worst-case theories such as Rademacher complexity are inaccurate in predicting the generalization performance of deep neural networks, and derive an alternative bound which incorporates the frozen subspace and conditioning effects and qualitatively matches the behavior observed in simulation.
Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Y. Bengio & Y. LeCun (Eds.), International Conference on Learning Representations.
Abstract | pdf | arXiv

Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.

All publications

Njaradi, V., Dominé, C. C. J., Swanson, R., Mondelli, M., & Saxe, A. (2026). Optimal Representation Size: High-Dimensional Analysis of Pretraining and Linear Probing. https://arxiv.org/abs/2605.20105
Abstract | arXiv

Learning to generalise from limited data is a fundamental challenge for both artificial and biological systems. A common strategy is to extract reusable structure from abundant unlabelled data, enabling efficient adaptation to new tasks from limited labelled data. This two-stage paradigm is now standard in modern training pipelines, where pretraining is followed by fine-tuning or linear probing. We provide an analytical model of this process: structure extraction is formalized as principal component analysis on unlabelled data, and downstream learning as linear regression on a separate labelled dataset. In the high-dimensional regime, we derive exact expressions for training and generalisation error showcasing their dependence on representation dimensionality, unlabelled and labelled sample sizes, and task alignment. Our results show that pretrained representations strongly influence downstream generalisation, and we characterize the optimal representation size as a function of task parameters: with abundant pretraining data but scarce downstream data, maximally compressed representations are optimal, whereas with limited pretraining data, higher-dimensional representations generalise better. Furthermore, we establish an exact trade-off between pretraining and supervision, quantifying how much unlabelled data is required to replace a single labelled sample. Beyond our idealised model, we observe similar phenomenology in autoencoders and pretrained LLMs. Altogether, we highlight that optimising representation size is critical, giving conditions for when compression during pretraining improves generalisation.
Strittmatter, Y., Skye, R., Iglesias, S., Liebana, S., Saxe, A., Ruiz-Garcia, M., Teich, E., & Spitzer, M. (2026). When Collaboration Beats Ability: Mixed-Ability Teams Can Outperform High-Ability Teams Under Coordination Demands. OSF. https://doi.org/10.31234/osf.io/6yvb2_v1
Abstract | pdf | DOI

Collective intelligence describes the capacity of groups to achieve levels of performance that cannot be explained by the abilities of their individuals. The existence of collective intelligence implies that groups composed of mixed-ability members may outperform groups of high ability. Here, we examine when such benefits emerge in humans and whether it is driven by collaboration. We designed a collaborative multiplayer online game and manipulated team composition (mixed-ability vs. high-ability) and coordination demands by exposing teams to two different task environments that varied in how much they encouraged collaboration. We collected data from 280 teams of two human players (70 per condition), totaling 560 participants. We found that mixed-ability teams outperformed high-ability teams in environments designed to encourage collaboration. This performance advantage was due to more collaborative actions and more frequent division of labor. These results show that collaboration emerges selectively as a function of group composition and coordination demands.
Confavreux, B., Singh, A. K., Lee, J. H., Sabran, A., & Saxe, A. M. (2026). Comparing the learning dynamics of in-context learning and fine-tuning in language models. The Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=cJAtzOcAnd
Abstract | pdf

Pretrained language models can acquire novel tasks either through in-context learning (ICL)—adapting behavior via activations without weight updates—or through supervised fine-tuning (SFT), where parameters are explicitly updated. Prior work has reported differences in their generalization performance and inductive biases, but the origins of these differences remain poorly understood. In this work, we treat ICL and SFT as distinct learning algorithms and directly compare the learning dynamics they induce across medium-sized models, analyzing both the evolution of their inductive biases and the underlying internal representations. We find that ICL preserves rich input representations but imposes stronger priors inherited from pretraining, whereas SFT suppresses task-irrelevant features—potentially explaining its weaker generalization in few-shot regimes. These results highlight a mechanistic distinction between context-driven and weight-driven learning.
Zhang, Y., Saxe, A. M., & Latham, P. E. (2026). Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures. The Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=Vit5M0G5Gb
Abstract | pdf

Neural networks trained with gradient descent often learn solutions of increasing complexity over time, a phenomenon known as simplicity bias. Despite being widely observed across architectures, existing theoretical treatments lack a unifying framework. We present a theoretical framework that explains a simplicity bias arising from saddle-to-saddle learning dynamics for a general class of neural networks, incorporating fully-connected, convolutional, and attention-based architectures. Here, simple means expressible with few hidden units, i.e., hidden neurons, convolutional kernels, or attention heads. Specifically, we show that linear networks learn solutions of increasing rank, ReLU networks learn solutions with an increasing number of kinks, convolutional networks learn solutions with an increasing number of convolutional kernels, and self-attention models learn solutions with an increasing number of attention heads. By analyzing fixed points, invariant manifolds, and dynamics of gradient descent learning, we show that saddle-to-saddle dynamics operates by iteratively evolving near an invariant manifold, approaching a saddle, and switching to another invariant manifold. Our analysis also disentangles data-induced and initialization-induced saddle-to-saddle dynamics. In particular, the former leads to low-rank weights while the latter to sparse weights. Equipped with the theory, we predict the effects of data distribution and weight initialization on the duration and number of plateaus in learning. Overall, our theory offers a framework for understanding when and why gradient descent progressively learns increasingly complex solutions.
Jarvis, D., Klein, R., Rosman, B., & Saxe, A. M. (2026). Compositionality and systematicity emerge from iterated learning in deep linear networks. Proceedings of the National Academy of Sciences, 123(19), e2509739123. https://doi.org/10.1073/pnas.2509739123
Abstract | DOI

Humans have a remarkable ability to systematically generalize-reasoning about new situations by combining aspects of previous experiences. Language provides one of the primary examples of this ability and modern machine learning has drawn much inspiration from linguistics. A recent example is iterated learning, a procedure where generations of networks learn from the output of earlier learners. The result is a refinement of the network’s "language" or output labels for given inputs toward compositional structure. Here we theoretically study the emergence of compositional language, and the ability of simple neural networks to leverage this compositionality to systematically generalize. We build on prior theoretical work on linear networks, which mathematically define systematic generalization, by a) applying the analysis of shallow and deep linear network to the iterated learning procedure by deriving exact dynamics of learning over generations; b) refining the definition of systematicity to understand the benefits and limitations of iterated learning. We find that iterated learning does facilitate systematic generalization over standard training paradigms by uncovering compositional substructure in the output labels. Our results confirm a long standing conjecture: that multiple generations of iterated learning are required for compositional structure to emerge, which can outperform a single generation network trained with optimal early-stopping. However, for the network to treat the input systematically and ignore features which do not generalize, the network must be trained on an extremely large dataset. Hence, we define "weak systematic generalization" to explain this emergent systematicity from scale.
Benjamin, A. S., Beyer, A.-L., Kloots, M. D. H., Hwang, J., Karoui, H., Ostrow, M., Rubruck, J., Sandbrink, K. J., Grant, S., Saxe, A. M., & McClelland, J. L. (2026). An Introduction to Connectionist Theories of Semantic Cognition. In S. Sarao Mannelli, F. Mignacco, C.-N. Chou, S. Y. Chung, & A. Saxe (Eds.), Proceedings of the Analytical Connectionism Schools 2023–2024 (Vol. 320, pp. 42–67). PMLR. https://proceedings.mlr.press/v320/benjamin26a.html
Abstract | pdf

Jay McClelland’s lectures spotlighted foundational insights and contemporary advances in neural modelling of cognition. Beginning with the premise that mental concepts correspond to patterns of activity in networked neurons, the connectionist paradigm provides mathematical models that predict and explain a plethora of cognitive phenomena. For instance, in semantic development, connectionist models that learn through gradual error-driven updates capture the progressive differentiation of concepts from broad to fine categories. This observation, and others, were captured in the early Rumelhart model and persist in today’s language models. However, there are shortcomings of simple error-based learning in neural networks, most notably the problem of catastrophic interference, wherein learning new information disrupts previously acquired knowledge. Biological solutions to this problem may reveal additional structures in our brains. For example, in the complementary learning systems framework, the hippocampus rapidly stores episodic experiences while the neocortex integrates them over time, thus mitigating interference and enabling flexible knowledge consolidation. Furthermore, existing schemas facilitate faster acquisition of related concepts, reflecting how prior knowledge shapes learning efficiency. Returning to the phenomena observed in semantic development, theoretical work by Saxe, McClelland and Ganguli provides exact analytical solutions, showing how, for instance, stage-like learning trajectories and transient "illusory correlations" arise from the interaction between the statistical regularities of the environment and nonlinear learning dynamics in a deep neural network. Taken together, these lectures underscored the enduring value of connectionism in bridging psychology, neuroscience, and machine learning.
Anguita, N., Locatello, F., Saxe, A. M., Mondelli, M., Mancini, F., Lippl, S., & Domine, C. (2026). A Theory of How Pretraining Shapes Inductive Bias in Fine-Tuning. https://arxiv.org/abs/2602.20062
Abstract | arXiv

Pretraining and fine-tuning are central stages in modern machine learning systems. In practice, feature learning plays an important role across both stages: deep neural networks learn a broad range of useful features during pretraining and further refine those features during fine-tuning. However, an end-to-end theoretical understanding of how choices of initialization impact the ability to reuse and refine features during fine-tuning has remained elusive. Here we develop an analytical theory of the pretraining-fine-tuning pipeline in diagonal linear networks, deriving exact expressions for the generalization error as a function of initialization parameters and task statistics. We find that different initialization choices place the network into four distinct fine-tuning regimes that are distinguished by their ability to support feature learning and reuse, and therefore by the task statistics for which they are beneficial. In particular, a smaller initialization scale in earlier layers enables the network to both reuse and refine its features, leading to superior generalization on fine-tuning tasks that rely on a subset of pretraining features. We demonstrate empirically that the same initialization parameters impact generalization in nonlinear networks trained on CIFAR-100. Overall, our results demonstrate analytically how data and network initialization interact to shape fine-tuning generalization, highlighting an important role for the relative scale of initialization across different layers in enabling continued feature learning during fine-tuning.
Njaradi, V., Carrasco-Davis, R., Latham, P. E., & Saxe, A. (2026). Optimal Learning Rate Schedule for Balancing Effort and Performance. https://arxiv.org/abs/2601.07830
Abstract | arXiv

Learning how to learn efficiently is a fundamental challenge for biological agents and a growing concern for artificial ones. To learn effectively, an agent must regulate its learning speed, balancing the benefits of rapid improvement against the costs of effort, instability, or resource use. We introduce a normative framework that formalizes this problem as an optimal control process in which the agent maximizes cumulative performance while incurring a cost of learning. From this objective, we derive a closed-form solution for the optimal learning rate, which has the form of a closed-loop controller that depends only on the agent’s current and expected future performance. Under mild assumptions, this solution generalizes across tasks and architectures and reproduces numerically optimized schedules in simulations. In simple learning models, we can mathematically analyze how agent and task parameters shape learning-rate scheduling as an open-loop control solution. Because the optimal policy depends on expectations of future performance, the framework predicts how overconfidence or underconfidence influence engagement and persistence, linking the control of learning speed to theories of self-regulated learning. We further show how a simple episodic memory mechanism can approximate the required performance expectations by recalling similar past learning experiences, providing a biologically plausible route to near-optimal behaviour. Together, these results provide a normative and biologically plausible account of learning speed control, linking self-regulated learning, effort allocation, and episodic memory estimation within a unified and tractable mathematical framework.
Jarvis, D., Lee, S., Dominé, C. C. J., Saxe, A. M., & Sarao Mannelli, S. (2025). A theory of initialisation’s impact on specialisation. Journal of Statistical Mechanics: Theory and Experiment, 2025(11), 114001. https://doi.org/10.1088/1742-5468/ae1214
Abstract | DOI

Prior work has demonstrated a consistent tendency in neural networks engaged in continual learning tasks, wherein intermediate task similarity results in the highest levels of catastrophic interference. This phenomenon is attributed to the network’s tendency to reuse learned features across tasks. However, this explanation heavily relies on the premise that neuron specialisation occurs, i.e. the emergence of localised representations. Our investigation challenges the validity of this assumption. Using theoretical frameworks for the analysis of neural networks, we show a strong dependence of specialisation on the initial condition. More precisely, we show that weight imbalance and high weight entropy can favour specialised solutions. We then apply these insights in the context of continual learning, first showing the emergence of a monotonic relation between task-similarity and forgetting in non-specialised networks. Finally, we show that specialisation by weight imbalance is beneficial on the commonly employed elastic weight consolidation regularisation technique.
Dragutinović, S., Saxe, A. M., & Singh, A. K. (2025). Softmax ≥ Linear: Transformers may learn to classify in-context by kernel gradient descent. https://arxiv.org/abs/2510.10425
Abstract | arXiv

The remarkable ability of transformers to learn new concepts solely by reading examples within the input prompt, termed in-context learning (ICL), is a crucial aspect of intelligent behavior. Here, we focus on understanding the learning algorithm transformers use to learn from context. Existing theoretical work, often based on simplifying assumptions, has primarily focused on linear self-attention and continuous regression tasks, finding transformers can learn in-context by gradient descent. Given that transformers are typically trained on discrete and complex tasks, we bridge the gap from this existing work to the setting of classification, with non-linear (importantly, softmax) activation. We find that transformers still learn to do gradient descent in-context, though on functionals in the kernel feature space and with a context-adaptive learning rate in the case of softmax transformer. These theoretical findings suggest a greater adaptability to context for softmax attention, which we empirically verify and study through ablations. Overall, we hope this enhances theoretical understanding of in-context learning algorithms in more realistic settings, pushes forward our intuitions and enables further theory bridging to larger models.
Grossman, S., Hedrich, N., Saxe, A., & Schuck, N. W. (2025). Dynamics and structure of generalization during reinforcement learning in human brains and artificial networks. CCN 2025: 8th Annual Conference on Cognitive Computational Neuroscience. https://hdl.handle.net/21.11116/0000-0011-8C28-3
Abstract | pdf

Goal-directed decision making amidst an overwhelming stream of sensory input requires learning internal representations that capture a task’s underlying structure. Importantly, such internal abstractions enable generalization. Representing an object’s shape but ignoring its color, for instance, means that anything learned about a green triangle will generalize to red triangles. Here, we investigate this dynamic interaction of task representation learning and generalization. Human participants and artificial neural networks were trained with the same contextual reinforcement learning task. Analyses of human data reveal that participants learned an abstract task structure, which becomes detectable in the orbitofrontal cortex (OFC) after learning. Recurrent neural networks trained on the same learning curriculum exhibit similar abstractions of task representations over time. Notably, we find that the similarity structure of the networks’ internal task representations affects how weight updates after a single example alter network behavior and representations on other trials. The network’s progressing context differentiation in its internal layers hence leads to generalization of single experiences to other events within the same context. Ongoing work aims to gain a mechanistic understanding of model observations and contrast them with learning dynamics in the human brain.
Lee, J. H., Lampinen, A. K., Singh, A. K., & Saxe, A. M. (2025). Distinct Computations Emerge From Compositional Curricula in In-Context Learning. https://arxiv.org/abs/2506.13253
Abstract | arXiv

In-context learning (ICL) research often considers learning a function in-context through a uniform sample of input-output pairs. Here, we investigate how presenting a compositional subtask curriculum in context may alter the computations a transformer learns. We design a compositional algorithmic task based on the modular exponential-a double exponential task composed of two single exponential subtasks and train transformer models to learn the task in-context. We compare (a) models trained using an in-context curriculum consisting of single exponential subtasks and, (b) models trained directly on the double exponential task without such a curriculum. We show that models trained with a subtask curriculum can perform zero-shot inference on unseen compositional tasks and are more robust given the same context length. We study how the task and subtasks are represented across the two training regimes. We find that the models employ diverse strategies modulated by the specific curriculum design.
Confavreux, B., Harrington, Z. P. M., Kania, M., Ramesh, P., Krouglova, A. N., Bozelos, P. A., Macke, J. H., Saxe, A. M., Gonçalves, P. J., & Vogels, T. P. (2025). Memory by a thousand rules: Automated discovery of multi-type plasticity rules reveals variety & degeneracy at the heart of learning. BioRxiv. https://doi.org/10.1101/2025.05.28.656584
Abstract | DOI

Synaptic plasticity is the basis of learning and memory, but the link between synaptic changes and neural function remains elusive. Here, we used automated search algorithms to obtain thousands of strikingly diverse quadruplets of excitatory(E)-to-E, E-to-inhibitory(I), IE, and II plasticity rules, cooperating to stabilize recurrent spiking networks. Despite the fact that quadruplets were selected for homeostasis, more than 90% of them performed well in simple and more difficult memory tasks such as novelty detection, contextual novelty and sequence replay. Co-activity was crucial, i.e., most rules failed in isolation. Our purely local, unsupervised plasticity rules could also help solve computer games such as pong. Our work showcases automated discovery augmenting human intuition to find en masse solutions for high dimensional problems.
Jarvis, D., Klein, R., Rosman, B., & Saxe, A. M. (2025). Make Haste Slowly: A Theory of Emergent Structured Mixed Selectivity in Feature Learning ReLU Networks. International Conference on Learning Representations. https://openreview.net/forum?id=27SSnLl85x
Abstract | pdf | arXiv

In spite of finite dimension ReLU neural networks being a consistent factor behind recent deep learning successes, a theory of feature learning in these models remains elusive. Currently, insightful theories still rely on assumptions including the linearity of the network computations, unstructured input data and architectural constraints such as infinite width or a single hidden layer. To begin to address this gap we establish an equivalence between ReLU networks and Gated Deep Linear Networks, and use their greater tractability to derive dynamics of learning. We then consider multiple variants of a core task reminiscent of multi-task learning or contextual control which requires both feature learning and nonlinearity. We make explicit that, for these tasks, the ReLU networks possess an inductive bias towards latent representations which are not strictly modular or disentangled but are still highly structured and reusable between contexts. This effect is amplified with the addition of more contexts and hidden layers. Thus, we take a step towards a theory of feature learning in finite ReLU networks and shed light on how structured mixed-selective latent representations can emerge due to a bias for node-reuse and learning speed.
Jarvis, D., Klar, V., Klein, R., Rosman, B., & Saxe, A. (2025). Revisiting the Role of Relearning in Semantic Dementia. https://arxiv.org/abs/2503.03545
Abstract | arXiv

Patients with semantic dementia (SD) present with remarkably consistent atrophy of neurons in the anterior temporal lobe and behavioural impairments, such as graded loss of category knowledge. While relearning of lost knowledge has been shown in acute brain injuries such as stroke, it has not been widely supported in chronic cognitive diseases such as SD. Previous research has shown that deep linear artificial neural networks exhibit stages of semantic learning akin to humans. Here, we use a deep linear network to test the hypothesis that relearning during disease progression rather than particular atrophy cause the specific behavioural patterns associated with SD. After training the network to generate the common semantic features of various hierarchically organised objects, neurons are successively deleted to mimic atrophy while retraining the model. The model with relearning and deleted neurons reproduced errors specific to SD, including prototyping errors and cross-category confusions. This suggests that relearning is necessary for artificial neural networks to reproduce the behavioural patterns associated with SD in the absence of output non-linearities. Our results support a theory of SD progression that results from continuous relearning of lost information. Future research should revisit the role of relearning as a contributing factor to cognitive diseases.
Boyd-Meredith, J. T., Holobetz, C., & Saxe, A. M. (2025). Stage-like Emergence of Task Strategies in Animals and in Neural Networks Trained by Gradient Descent. CCN 2025: 8th Annual Conference on Cognitive Computational Neuroscience. https://2025.ccneuro.org/abstract_pdf/Boyd-Meredith_2025_Stage-like_Emergence_Task_Strategies_Animals_Neural.pdf
Abstract | pdf

Humans and animals learning a task often appear to adopt a series of distinct strategies before reaching expert performance. This progression could result from deliberately testing distinct hypotheses about task contingencies. However, stage-like strategy changes can also be produced by artificial neural networks (ANN) learning by gradient descent (GD) without any explicit notion of task strategy. In this setting, apparent strategies correspond to saddle points in the loss dynamics around which learning slows before accelerating toward the next fixed point. We trained mice to perform a previously developed discrimination task, which they acquired in a series of stage-like behavioral transitions. We then developed an ANN model that recapitulated these transitions. By measuring the magnitude of the gradients during learning, we determined when the network approached (decreasing norm) and escaped saddle points (increasing norm) before reaching expert performance. Our modeling results show that even simple connectionist models without explicit hypotheses can be tailored to produce stages of learning that match what we observe in animals. We propose to develop and apply a method to identify saddle points of the loss and the likely transitions between them by performing gradient descent, not on the loss function, but on the magnitude of its gradient. For this abstract, we show how this tool identifies saddle points and their connections in a toy example.
Confavreux, B., Dorrell, W., Patel, N., & Saxe, A. (2025). Memory by accident: a theory of learning as a byproduct of network stabilization. In D. Belgrave, C. Zhang, H. Lin, R. Pascanu, P. Koniusz, M. Ghassemi, & N. Chen (Eds.), Advances in Neural Information Processing Systems (Vol. 38, pp. 81885–81912). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2025/file/75ab7573bf85c4baa82c8cd2016567b8-Paper-Conference.pdf
Abstract | pdf

Synaptic plasticity is widely considered to be crucial to the brain’s ability to learn throughout life. Decades of theoretical work have therefore been invested in deriving and designing biologically plausible learning rules capable of granting various memory abilities to neural networks. Most of these theoretical approaches optimize directly for a desired memory function; but this procedure can lead to complex, finely-tuned rules, rendering them brittle to perturbations and difficult to implement in practice. Instead, we build on recent work that automatically discovers large numbers of candidate plasticity rules operating in recurrent spiking neural networks. Surprisingly, despite the fact that these rules are selected solely to achieve network stabilization, we observe across a range of network models - feedforward, recurrent; rate and spikingthat almost all these rules endow the network with simple forms of memory such as familiarity detection - seemingly by accident. To understand this phenomenon, we study an analytic toy model. We observe that memory arises from the degeneracy of weight matrices that stabilize a network: where the network lands in this space of stable weights depends on its past inputs—that is, memory. Even simple Hebbian plasticity rules can utilize this degeneracy, creating a zoo of memory abilities with various lifetimes. In practice, the larger the network and the more co-active plasticity rules in the system, the stronger the memory-by-accident phenomenon becomes. Overall, our findings suggest that activity-silent memory is a near-unavoidable consequence of stabilization. Simple forms of memory, such as familiarity or novelty detection, appear to be widely available resources for plastic brain networks, suggesting that they could form the raw materials that were later sculpted into higher-order cognitive abilities.
Liebana, S., Laffere, A., Toschi, C., Schilling, L., Moretti, J., Podlaski, J., Fritsche, M., Zatka-Haas, P., Li, Y., Bogacz, R., & others. (2025). Dopamine encodes deep network teaching signals for individual learning trajectories. Cell. https://doi.org/10.1016/j.cell.2025.05.025
Abstract | DOI

Striatal dopamine plays fundamental roles in fine-tuning learned decisions. However, when learning from naive to expert, individuals often exhibit diverse learning trajectories, defying understanding of its underlying dopaminergic mechanisms. Here, we longitudinally measure and manipulate dorsal striatal dopamine signals in mice learning a decision task from naive to expert. Mice learning trajectories transitioned through sequences of strategies, showing substantial individual diversity. Remarkably, the transitions were systematic; each mouse?s early strategy determined its strategy weeks later. Dopamine signals reflected strategies each animal transitioned through, encoding a subset of stimulus-choice associations. Optogenetic manipulations selectively updated these associations, leading to learning effects distinct from that of reward. A deep neural network using heterogeneous teaching signals, each updating a subset of network association weights, captured our results. Analyzing the model?s fixed points explained learning diversity and systematicity. Altogether, this work provides insights into the biological and mathematical principles underlying individual long-term learning trajectories.
Van Rossem, L., & Saxe, A. M. (2025). Algorithm Development in Neural Networks: Insights from the Streaming Parity Task. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, & J. Zhu (Eds.), Proceedings of the 42nd International Conference on Machine Learning (Vol. 267, pp. 60764–60791). PMLR. https://proceedings.mlr.press/v267/van-rossem25a.html
Abstract | pdf

Even when massively overparameterized, deep neural networks show a remarkable ability to generalize. Research on this phenomenon has focused on generalization within distribution, via smooth interpolation. Yet in some settings neural networks also learn to extrapolate to data far beyond the bounds of the original training set, sometimes even allowing for infinite generalization, implying that an algorithm capable of solving the task has been learned. Here we undertake a case study of the learning dynamics of recurrent neural networks trained on the streaming parity task in order to develop an effective theory of algorithm development. The streaming parity task is a simple but nonlinear task defined on sequences up to arbitrary length. We show that, with sufficient finite training experience, RNNs exhibit a phase transition to perfect infinite generalization. Using an effective theory for the representational dynamics, we find an implicit representational merger effect which can be interpreted as the construction of a finite automaton that reproduces the task. Overall, our results disclose one mechanism by which neural networks can generalize infinitely from finite training experience.
Singh, A. K., Moskovitz, T., Dragutinović, S., Hill, F., Chan, S. C. Y., & Saxe, A. M. (2025). Strategy Coopetition Explains the Emergence and Transience of In-Context Learning. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, & J. Zhu (Eds.), Proceedings of the 42nd International Conference on Machine Learning (Vol. 267, pp. 55720–55739). PMLR. https://proceedings.mlr.press/v267/singh25c.html
Abstract | pdf

In-context learning (ICL) is a powerful ability that emerges in transformer models, enabling them to learn from context without weight updates. Recent work has established emergent ICL as a transient phenomenon that can sometimes disappear after long training times. In this work, we sought a mechanistic understanding of these transient dynamics. Firstly, we find that—after the disappearance of ICL—the asymptotic strategy is a remarkable hybrid between in-weights and in-context learning, which we term “context-constrained in-weights learning” (CIWL). CIWL is in competition with ICL, and eventually replaces it as the dominant strategy of the model (thus leading to ICL transience). However, we also find that the two competing strategies actually share sub-circuits, which gives rise to cooperative dynamics as well. For example, in our setup, ICL is unable to emerge quickly on its own, and can only be enabled through the simultaneous slow development of asymptotic CIWL. CIWL thus both cooperates and competes with ICL, a phenomenon we term “strategy coopetition”. We propose a minimal mathematical model that reproduces these key dynamics and interactions. Informed by this model, we were able to identify a setup where ICL is truly emergent and persistent.
Zhang, Y., Singh, A. K., Latham, P. E., & Saxe, A. M. (2025). Training Dynamics of In-Context Learning in Linear Attention. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, & J. Zhu (Eds.), Proceedings of the 42nd International Conference on Machine Learning (Vol. 267, pp. 76047–76087). PMLR. https://proceedings.mlr.press/v267/zhang25br.html
Abstract | pdf

While attention-based models have demonstrated the remarkable ability of in-context learning (ICL), the theoretical understanding of how these models acquired this ability through gradient descent training is still preliminary. Towards answering this question, we study the gradient descent dynamics of multi-head linear self-attention trained for in-context linear regression. We examine two parametrizations of linear self-attention: one with the key and query weights merged as a single matrix (common in theoretical studies), and one with separate key and query matrices (closer to practical settings). For the merged parametrization, we show that the training dynamics has two fixed points and the loss trajectory exhibits a single, abrupt drop. We derive an analytical time-course solution for a certain class of datasets and initialization. For the separate parametrization, we show that the training dynamics has exponentially many fixed points and the loss exhibits saddle-to-saddle dynamics, which we reduce to scalar ordinary differential equations. During training, the model implements principal component regression in context with the number of principal components increasing over training time. Overall, we provide a theoretical description of how ICL abilities evolve during gradient descent training of linear attention, revealing abrupt acquisition or progressive improvements depending on how the key and query are parametrized.
Braun, L., Grant, E., & Saxe, A. M. (2025). Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks. In A. Singh, M. Fazel, D. Hsu, S. Lacoste-Julien, F. Berkenkamp, T. Maharaj, K. Wagstaff, & J. Zhu (Eds.), Proceedings of the 42nd International Conference on Machine Learning (Vol. 267, pp. 5355–5382). PMLR. https://proceedings.mlr.press/v267/braun25a.html
Abstract | pdf

A foundational principle of connectionism is that perception, action, and cognition emerge from parallel computations among simple, interconnected units that generate and rely on neural representations. Accordingly, researchers employ multivariate pattern analysis to decode and compare the neural codes of artificial and biological networks, aiming to uncover their functions. However, there is limited analytical understanding of how a network’s representation and function relate, despite this being essential to any quantitative notion of underlying function or functional similarity. We address this question using analysable two-layer linear networks and numerical simulations in nonlinear networks. We find that function and representation are dissociated, allowing representational similarity without functional similarity and vice versa. Further, we show that neither robustness to input noise nor the level of generalisation error constrain representations to the task. In contrast, networks robust to parameter noise have limited representational flexibility and must employ task-specific representations. Our findings suggest that representational alignment reflects computational advantages beyond functional alignment alone, with significant implications for interpreting and comparing the representations of connectionist systems
Patel, N., Lee, S., Sarao Mannelli, S., Goldt, S., & Saxe, A. (2025). RL Perceptron: Generalization Dynamics of Policy Learning in High Dimensions. Phys. Rev. X, 15(2), 021051. https://doi.org/10.1103/PhysRevX.15.021051
Abstract | DOI

Reinforcement learning (RL) algorithms have transformed many domains of machine learning. To tackle real-world problems, RL often relies on neural networks to learn policies directly from pixels or other high-dimensional sensory input. By contrast, many theories of RL have focused on discrete state spaces or worst-case analysis, and fundamental questions remain about the dynamics of policy learning in high-dimensional settings. Here, we propose a solvable high-dimensional RL model that can capture a variety of learning protocols, and we derive its typical policy learning dynamics as a set of closed-form ordinary differential equations. We obtain optimal schedules for the learning rates and task difficulty—analogous to annealing schemes and curricula during training in RL—and show that the model exhibits rich behavior, including delayed learning under sparse rewards, a variety of learning regimes depending on reward baselines, and a speed-accuracy trade-off driven by reward stringency. Experiments on variants of the Procgen game “Bossfight” and Arcade Learning Environment game “Pong” also show such a speed-accuracy trade-off in practice. Together, these results take a step toward closing the gap between theory and practice in high-dimensional RL.
Zhang, Y., Saxe, A. M., & Latham, P. E. (2025). When Are Bias-Free ReLU Networks Effectively Linear Networks? Transactions on Machine Learning Research. https://openreview.net/forum?id=Ucpfdn66k2
Abstract | pdf

We investigate the implications of removing bias in ReLU networks regarding their expressivity and learning dynamics. We first show that two-layer bias-free ReLU networks have limited expressivity: the only odd function two-layer bias-free ReLU networks can express is a linear one. We then show that, under symmetry conditions on the data, these networks have the same learning dynamics as linear networks. This enables us to give analytical time-course solutions to certain two-layer bias-free (leaky) ReLU networks outside the lazy learning regime. While deep bias-free ReLU networks are more expressive than their two-layer counterparts, they still share a number of similarities with deep linear networks. These similarities enable us to leverage insights from linear networks to understand certain ReLU networks. Overall, our results show that some properties previously established for bias-free ReLU networks arise due to equivalence to linear networks.
Dominé, C. C. J., Anguita, N., Proca, A. M., Braun, L., Kunin, D., Mediano, P. A. M., & Saxe, A. M. (2025). From lazy to rich: Exact learning dynamics in deep linear networks. International Conference on Learning Representations. https://openreview.net/forum?id=ZXaocmXc6d
Abstract | pdf

Biological and artificial neural networks develop internal representations that enable them to perform complex tasks. In artificial networks, the effectiveness of these models relies on their ability to build task specific representation, a process influenced by interactions among datasets, architectures, initialization strategies, and optimization algorithms. Prior studies highlight that different initializations can place networks in either a lazy regime, where representations remain static, or a rich/feature learning regime, where representations evolve dynamically. Here, we examine how initialization influences learning dynamics in deep linear neural networks, deriving exact solutions for lambda-balanced initializations-defined by the relative scale of weights across layers. These solutions capture the evolution of representations and the Neural Tangent Kernel across the spectrum from the rich to the lazy regimes. Our findings deepen the theoretical understanding of the impact of weight initialization on learning regimes, with implications for continual learning, reversal learning, and transfer learning, relevant to both neuroscience and practical applications.
Kunin, D., Raventós, A., Dominé, C., Chen, F., Klindt, D., Saxe, A., & Ganguli, S. (2024). Get rich quick: exact solutions reveal how unbalanced initializations promote rapid feature learning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, & C. Zhang (Eds.), Advances in Neural Information Processing Systems (Vol. 37, pp. 81157–81203). Curran Associates, Inc. https://doi.org/10.52202/079017-2580
Abstract | pdf | DOI

While the impressive performance of modern neural networks is often attributed to their capacity to efficiently extract task-relevant features from data, the mechanisms underlying this rich feature learning regime remain elusive, with much of our theoretical understanding stemming from the opposing lazy regime. In this work, we derive exact solutions to a minimal model that transitions between lazy and rich learning, precisely elucidating how unbalanced layer-specific initialization variances and learning rates determine the degree of feature learning. Our analysis reveals that they conspire to influence the learning regime through a set of conserved quantities that constrain and modify the geometry of learning trajectories in parameter and function space. We extend our analysis to more complex linear models with multiple neurons, outputs, and layers and to shallow nonlinear networks with piecewise linear activation functions. In linear networks, rapid feature learning only occurs from balanced initializations, where all layers learn at similar speeds. While in nonlinear networks, unbalanced initializations that promote faster learning in earlier layers can accelerate rich learning. Through a series of experiments, we provide evidence that this unbalanced rich regime drives feature learning in deep finite-width networks, promotes interpretability of early layers in CNNs, reduces the sample complexity of learning hierarchical data, and decreases the time to grokking in modular arithmetic. Our theory motivates further exploration of unbalanced initializations to enhance efficient feature learning.
Sandbrink, K., Bauer, J. P., Proca, A. M., Saxe, A. M., Summerfield, C., & Hummos, A. (2024). Flexible task abstractions emerge in linear networks with fast and bounded units. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, & C. Zhang (Eds.), Advances in Neural Information Processing Systems (Vol. 37, pp. 6938–6978). Curran Associates, Inc. https://doi.org/10.52202/079017-0223
Abstract | pdf | DOI

Animals survive in dynamic environments changing at arbitrary timescales, but such data distribution shifts are a challenge to neural networks. To adapt to change, neural systems may change a large number of parameters, which is a slow process involving forgetting past information. In contrast, animals leverage distribution changes to segment their stream of experience into tasks and associate them with internal task abstracts. Animals can then respond flexibly by selecting the appropriate task abstraction. However, how such flexible task abstractions may arise in neural systems remains unknown. Here, we analyze a linear gated network where the weights and gates are jointly optimized via gradient descent, but with neuron-like constraints on the gates including a faster timescale, non-negativity, and bounded activity. We observe that the weights self-organize into modules specialized for tasks or sub-tasks encountered, while the gates layer forms unique representations that switch the appropriate weight modules (task abstractions). We analytically reduce the learning dynamics to an effective eigenspace, revealing a virtuous cycle: fast adapting gates drive weight specialization by protecting previous knowledge, while weight specialization in turn increases the update rate of the gating layer. Task switching in the gating layer accelerates as a function of curriculum block size and task training, mirroring key findings in cognitive neuroscience. We show that the discovered task abstractions support generalization through both task and subtask composition, and we extend our findings to a non-linear network switching between two tasks. Overall, our work offers a theory of cognitive flexibility in animals as arising from joint gradient descent on synaptic and neural gating in a neural network architecture.
Dominé, C. C. J., Carrasco-Davis, R., Hollingsworth, L., Sirmpilatze, N., Tyson, A. L., Jarvis, D., Barry, C., & Saxe, A. M. (2024). NeuralPlayground: A Standardised Environment for Evaluating Models of Hippocampus and Entorhinal Cortex. BioRxiv, 2024.03.06.583699. https://doi.org/10.1101/2024.03.06.583699
Abstract | DOI

Neural processes in the hippocampus and entorhinal cortex are thought to be crucial for spatial cognition. A growing variety of theoretical models have been proposed to capture the rich neural and behavioral phenomena associated with these circuits. However, systematic comparison of these theories, both against each other and against empirical data, remains challenging. To address this gap, we present NeuralPlayground, an open-source standardised software framework for comparisons between theory and experiment in the domain of spatial cognition. This Python software package offers a reproducible way to compare models against a centralised library of published experimental results, including neural recordings and animal behavior. The framework implements three Agents embodying different computational models; three Experiments comprising publicly available neural and behavioral datasets; a customisable 2-dimensional Arena (continuous and discrete) able to generate common and novel spatial layouts; and a Comparison tool that facilitates systematic comparisons between models and data. Each module can also be used separately, allowing standardised and flexible access to influential models and data sets. We hope NeuralPlayground, available on GitHub 3, provides a starting point for a shared, standardized, open, and reproducible computational understanding of the role of the hippocampus and entorhinal cortex in spatial cognition.
Lufkin, L., Saxe, A., & Grant, E. (2024). Nonlinear dynamics of localization in neural receptive fields. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, & C. Zhang (Eds.), Advances in Neural Information Processing Systems (Vol. 37, pp. 25938–25960). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2024/file/2dab2f94544f9297d01a46a5453b93cd-Paper-Conference.pdf
Abstract | pdf

Localized receptive fields—neurons that are selective for certain contiguous spatiotemporal features of their input—populate early sensory regions of the mammalian brain. Unsupervised learning algorithms that optimize explicit sparsity or independence criteria replicate features of these localized receptive fields, but fail to explain directly how localization arises through learning without efficient coding, as occurs in early layers of deep neural networks and might occur in early sensory regions of biological systems. We consider an alternative model in which localized receptive fields emerge without explicit top-down efficiency constraints—a feed-forward neural network trained on a data model inspired by the structure of natural images. Previous work identified the importance of non-Gaussian statistics to localization in this setting but left open questions about the mechanisms driving dynamical emergence. We address these questions by deriving the effective learning dynamics for a single nonlinear neuron, making precise how higher-order statistical properties of the input data drive emergent localization, and we demonstrate that the predictions of these effective dynamics extend to the many-neuron setting. Our analysis provides an alternative explanation for the ubiquity of localization as resulting from the nonlinear dynamics of learning in neural circuits.
Löwe, A. T., Touzo, L., Muhle-Karbe, P. S., Saxe, A. M., Summerfield, C., & Schuck, N. W. (2024). Abrupt and spontaneous strategy switches emerge in simple regularised neural networks. PLOS Computational Biology, 20(10), 1–29. https://doi.org/10.1371/journal.pcbi.1012505
Abstract | DOI

Humans sometimes have an insight that leads to a sudden and drastic performance improvement on the task they are working on. Sudden strategy adaptations are often linked to insights, considered to be a unique aspect of human cognition tied to complex processes such as creativity or meta-cognitive reasoning. Here, we take a learning perspective and ask whether insight-like behaviour can occur in simple artificial neural networks, even when the models only learn to form input-output associations through gradual gradient descent. We compared learning dynamics in humans and regularised neural networks in a perceptual decision task that included a hidden regularity to solve the task more efficiently. Our results show that only some humans discover this regularity, and that behaviour is marked by a sudden and abrupt strategy switch that reflects an aha-moment. Notably, we find that simple neural networks with a gradual learning rule and a constant learning rate closely mimicked behavioural characteristics of human insight-like switches, exhibiting delay of insight, suddenness and selective occurrence in only some networks. Analyses of network architectures and learning dynamics revealed that insight-like behaviour crucially depended on a regularised gating mechanism and noise added to gradient updates, which allowed the networks to accumulate “silent knowledge” that is initially suppressed by regularised gating. This suggests that insight-like behaviour can arise from gradual learning in simple neural networks, where it reflects the combined influences of noise, gating and regularisation. These results have potential implications for more complex systems, such as the brain, and guide the way for future insight research.
Rubruck, J., Bauer, J. P., Saxe, A., & Summerfield, C. (2024). Early learning of the optimal constant solution in neural networks and humans. https://arxiv.org/abs/2406.17467
Abstract | arXiv

Deep neural networks learn increasingly complex functions over the course of training. Here, we show both empirically and theoretically that learning of the target function is preceded by an early phase in which networks learn the optimal constant solution (OCS) - that is, initial model responses mirror the distribution of target labels, while entirely ignoring information provided in the input. Using a hierarchical category learning task, we derive exact solutions for learning dynamics in deep linear networks trained with bias terms. Even when initialized to zero, this simple architectural feature induces substantial changes in early dynamics. We identify hallmarks of this early OCS phase and illustrate how these signatures are observed in deep linear networks and larger, more complex (and nonlinear) convolutional neural networks solving a hierarchical learning task based on MNIST and CIFAR10. We explain these observations by proving that deep linear networks necessarily learn the OCS during early learning. To further probe the generality of our results, we train human learners over the course of three days on the category learning task. We then identify qualitative signatures of this early OCS phase in terms of the dynamics of true negative (correct-rejection) rates. Surprisingly, we find the same early reliance on the OCS in the behaviour of human learners. Finally, we show that learning of the OCS can emerge even in the absence of bias terms and is equivalently driven by generic correlations in the input data. Overall, our work suggests the OCS as a universal learning principle in supervised, error-corrective learning, and the mechanistic reasons for its prevalence.
Mannelli, S. S., Ivashynka, Y., Saxe, A. M., & Saglietti, L. (2024). Tilting the Odds at the Lottery: the Interplay of Overparameterisation and Curricula in Neural Networks. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, & F. Berkenkamp (Eds.), Proceedings of the 41st International Conference on Machine Learning (Vol. 235, pp. 34586–34602). PMLR. https://proceedings.mlr.press/v235/mannelli24a.html
Abstract | pdf

A wide range of empirical and theoretical works have shown that overparameterisation can amplify the performance of neural networks. According to the lottery ticket hypothesis, overparameterised networks have an increased chance of containing a sub-network that is well-initialised to solve the task at hand. A more parsimonious approach, inspired by animal learning, consists in guiding the learner towards solving the task by curating the order of the examples, ie. providing a curriculum. However, this learning strategy seems to be hardly beneficial in deep learning applications. In this work, we propose a theoretical analysis that connects curriculum learning and overparameterisation. In particular, we investigate their interplay in the online learning setting for a 2-layer network in the XOR-like Gaussian Mixture problem. Our results show that a high degree of overparameterisation—while simplifying the problem—can limit the benefit from curricula, providing a theoretical account of the ineffectiveness of curricula in deep learning.
Singh, A. K., Moskovitz, T., Hill, F., Chan, S. C. Y., & Saxe, A. M. (2024). What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, & F. Berkenkamp (Eds.), Proceedings of the 41st International Conference on Machine Learning (Vol. 235, pp. 45637–45662). PMLR. https://proceedings.mlr.press/v235/singh24c.html
Abstract | pdf

In-context learning is a powerful emergent ability in transformer models. Prior work in mechanistic interpretability has identified a circuit element that may be critical for in-context learning – the induction head (IH), which performs a match-and-copy operation. During training of large transformers on natural language data, IHs emerge around the same time as a notable phase change in the loss. Despite the robust evidence for IHs and this interesting coincidence with the phase change, relatively little is known about the diversity and emergence dynamics of IHs. Why is there more than one IH, and how are they dependent on each other? Why do IHs appear all of a sudden, and what are the subcircuits that enable them to emerge? We answer these questions by studying IH emergence dynamics in a controlled setting by training on synthetic data. In doing so, we develop and share a novel optogenetics-inspired causal framework for modifying activations throughout training. Using this framework, we delineate the diverse and additive nature of IHs. By "clamping" subsets of activations throughout training, we then identify three underlying subcircuits that interact to drive IH formation, yielding the phase change. Furthermore, these subcircuits shed light on data-dependent properties of formation, such as phase change timing, already showing the promise of this more in-depth understanding of subcircuits that need to "go right" for an induction head.
Lee, J. H., Mannelli, S. S., & Saxe, A. M. (2024). Why Do Animals Need Shaping? A Theory of Task Composition and Curriculum Learning. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, & F. Berkenkamp (Eds.), Proceedings of the 41st International Conference on Machine Learning (Vol. 235, pp. 26837–26855). PMLR. https://proceedings.mlr.press/v235/lee24r.html
Abstract | pdf

Diverse studies in systems neuroscience begin with extended periods of curriculum training known as ‘shaping’ procedures. These involve progressively studying component parts of more complex tasks, and can make the difference between learning a task quickly, slowly or not at all. Despite the importance of shaping to the acquisition of complex tasks, there is as yet no theory that can help guide the design of shaping procedures, or more fundamentally, provide insight into its key role in learning. Modern deep reinforcement learning systems might implicitly learn compositional primitives within their multilayer policy networks. Inspired by these models, we propose and analyse a model of deep policy gradient learning of simple compositional reinforcement learning tasks. Using the tools of statistical physics, we solve for exact learning dynamics and characterise different learning strategies including primitives pre-training, in which task primitives are studied individually before learning compositional tasks. We find a complex interplay between task complexity and the efficacy of shaping strategies. Overall, our theory provides an analytical understanding of the benefits of shaping in a class of compositional tasks and a quantitative account of how training protocols can disclose useful task primitives, ultimately yielding faster and more robust learning.
Van Rossem, L., & Saxe, A. M. (2024). When Representations Align: Universality in Representation Learning Dynamics. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, & F. Berkenkamp (Eds.), Proceedings of the 41st International Conference on Machine Learning (Vol. 235, pp. 49098–49121). PMLR. https://proceedings.mlr.press/v235/van-rossem24a.html
Abstract | pdf

Deep neural networks come in many sizes and architectures. The choice of architecture, in conjunction with the dataset and learning algorithm, is commonly understood to affect the learned neural representations. Yet, recent results have shown that different architectures learn representations with striking qualitative similarities. Here we derive an effective theory of representation learning under the assumption that the encoding map from input to hidden representation and the decoding map from representation to output are arbitrary smooth functions. This theory schematizes representation learning dynamics in the regime of complex, large architectures, where hidden representations are not strongly constrained by the parametrization. We show through experiments that the effective theory describes aspects of representation learning dynamics across a range of deep networks with different activation functions and architectures, and exhibits phenomena similar to the “rich” and “lazy” regime. While many network behaviors depend quantitatively on architecture, our findings point to certain behaviors that are widely conserved once models are sufficiently flexible.
Zhang, Y., Latham, P. E., & Saxe, A. M. (2024). Understanding Unimodal Bias in Multimodal Deep Linear Networks. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, & F. Berkenkamp (Eds.), Proceedings of the 41st International Conference on Machine Learning (Vol. 235, pp. 59100–59125). PMLR. https://proceedings.mlr.press/v235/zhang24aa.html
Abstract | pdf

Using multiple input streams simultaneously to train multimodal neural networks is intuitively advantageous but practically challenging. A key challenge is unimodal bias, where a network overly relies on one modality and ignores others during joint training. We develop a theory of unimodal bias with multimodal deep linear networks to understand how architecture and data statistics influence this bias. This is the first work to calculate the duration of the unimodal phase in learning as a function of the depth at which modalities are fused within the network, dataset statistics, and initialization. We show that the deeper the layer at which fusion occurs, the longer the unimodal phase. A long unimodal phase can lead to a generalization deficit and permanent unimodal bias in the overparametrized regime. Our results, derived for multimodal linear networks, extend to nonlinear networks in certain settings. Taken together, this work illuminates pathologies of multimodal learning under joint training, showing that late and intermediate fusion architectures can give rise to long unimodal phases and permanent unimodal bias. Our code is available at: https://yedizhang.github.io/unimodal-bias.html.
Carrasco-Davis, R., Masís, J., & Saxe, A. M. (2024). Meta-Learning Strategies through Value Maximization in Neural Networks. arXiv. http://arxiv.org/abs/2310.19919
Abstract | arXiv

Biological and artificial learning agents face numerous choices about how to learn, ranging from hyperparameter selection to aspects of task distributions like curricula. Understanding how to make these meta-learning choices could offer normative accounts of cognitive control functions in biological learners and improve engineered systems. Yet optimal strategies remain challenging to compute in modern deep networks due to the complexity of optimizing through the entire learning process. Here we theoretically investigate optimal strategies in a tractable setting. We present a learning effort framework capable of efficiently optimizing control signals on a fully normative objective: discounted cumulative performance throughout learning. We obtain computational tractability by using average dynamical equations for gradient descent, available for simple neural network architectures. Our framework accommodates a range of meta-learning and automatic curriculum learning methods in a unified normative setting. We apply this framework to investigate the effect of approximations in common meta-learning algorithms; infer aspects of optimal curricula; and compute optimal neuronal resource allocation in a continual learning setting. Across settings, we find that control effort is most beneficial when applied to easier aspects of a task early in learning; followed by sustained effort on harder aspects. Overall, the learning effort framework provides a tractable theoretical test bed to study normative benefits of interventions in a variety of learning systems, as well as a formal account of optimal cognitive control strategies over learning trajectories posited by established theories in cognitive neuroscience.
Jarvis, D., Klein, R., Rosman, B., & Saxe, A. M. (2024). On The Specialization of Neural Modules. arXiv. http://arxiv.org/abs/2409.14981
Abstract | arXiv

A number of machine learning models have been proposed with the goal of achieving systematic generalization: the ability to reason about new situations by combining aspects of previous experiences. These models leverage compositional architectures which aim to learn specialized modules dedicated to structures in a task that can be composed to solve novel problems with similar structures. While the compositionality of these architectures is guaranteed by design, the modules specializing is not. Here we theoretically study the ability of network modules to specialize to useful structures in a dataset and achieve systematic generalization. To this end we introduce a minimal space of datasets motivated by practical systematic generalization benchmarks. From this space of datasets we present a mathematical definition of systematicity and study the learning dynamics of linear neural modules when solving components of the task. Our results shed light on the difficulty of module specialization, what is required for modules to successfully specialize, and the necessity of modular architectures to achieve systematicity. Finally, we confirm that the theoretical results in our tractable setting generalize to more complex datasets and non-linear architectures.
Sun, W., Advani, M., Spruston, N., Saxe, A., & Fitzgerald, J. E. (2023). Organizing memories for generalization in complementary learning systems. Nature Neuroscience, 26(8), 1438–1448. https://doi.org/10.1038/s41593-023-01382-9
Abstract | DOI

Memorization and generalization are complementary cognitive processes that jointly promote adaptive behavior. For example, animals should memorize safe routes to specific water sources and generalize from these memories to discover environmental features that predict new ones. These functions depend on systems consolidation mechanisms that construct neocortical memory traces from hippocampal precursors, but why systems consolidation only applies to a subset of hippocampal memories is unclear. Here we introduce a new neural network formalization of systems consolidation that reveals an overlooked tension-unregulated neocortical memory transfer can cause overfitting and harm generalization in an unpredictable world. We resolve this tension by postulating that memories only consolidate when it aids generalization. This framework accounts for partial hippocampal-cortical memory transfer and provides a normative principle for reconceptualizing numerous observations in the field. Generalization-optimized systems consolidation thus provides new insight into how adaptive behavior benefits from complementary learning systems specialized for memorization and generalization.
Singh, A., Chan, S., Moskovitz, T., Grant, E., Saxe, A., & Hill, F. (2023). The Transient Nature of Emergent In-Context Learning in Transformers. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, & S. Levine (Eds.), Advances in Neural Information Processing Systems (Vol. 36, pp. 27801–27819). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2023/file/58692a1701314e09cbd7a5f5f3871cc9-Paper-Conference.pdf
Abstract | pdf

Transformer neural networks can exhibit a surprising capacity for in-context learning (ICL) despite not being explicitly trained for it. Prior work has provided a deeper understanding of how ICL emerges in transformers, e.g. through the lens of mechanistic interpretability, Bayesian inference, or by examining the distributional properties of training data. However, in each of these cases, ICL is treated largely as a persistent phenomenon; namely, once ICL emerges, it is assumed to persist asymptotically. Here, we show that the emergence of ICL during transformer training is, in fact, often transient. We train transformers on synthetic data designed so that both ICL and in-weights learning (IWL) strategies can lead to correct predictions. We find that ICL first emerges, then disappears and gives way to IWL, all while the training loss decreases, indicating an asymptotic preference for IWL. The transient nature of ICL is observed in transformers across a range of model sizes and datasets, raising the question of how much to "overtrain" transformers when seeking compact, cheaper-to-run models. We find that L2 regularization may offer a path to more persistent ICL that removes the need for early stopping based on ICL-style validation tasks. Finally, we present initial evidence that ICL transience may be caused by competition between ICL and IWL circuits.
Flesch, T., Nagy, D. G., Saxe, A., & Summerfield, C. (2023). Modelling continual learning in humans with Hebbian context gating and exponentially decaying task signals. PLoS Computational Biology, 19(1), e1010808. https://doi.org/10.1371/journal.pcbi.1010808
Abstract | DOI

Humans can learn several tasks in succession with minimal mutual interference but perform more poorly when trained on multiple tasks at once. The opposite is true for standard deep neural networks. Here, we propose novel computational constraints for artificial neural networks, inspired by earlier work on gating in the primate prefrontal cortex, that capture the cost of interleaved training and allow the network to learn two tasks in sequence without forgetting. We augment standard stochastic gradient descent with two algorithmic motifs, so-called "sluggish" task units and a Hebbian training step that strengthens connections between task units and hidden units that encode task-relevant information. We found that the "sluggish" units introduce a switch-cost during training, which biases representations under interleaved training towards a joint representation that ignores the contextual cue, while the Hebbian step promotes the formation of a gating scheme from task units to the hidden layer that produces orthogonal representations which are perfectly guarded against interference. Validating the model on previously published human behavioural data revealed that it matches performance of participants who had been trained on blocked or interleaved curricula, and that these performance differences were driven by misestimation of the true category boundary.
Nelli, S., Braun, L., Dumbalska, T., Saxe, A., & Summerfield, C. (2023). Neural knowledge assembly in humans and neural networks. Neuron, 111(9), 1504–1516.e9. https://doi.org/10.1016/j.neuron.2023.02.014
Abstract | DOI

Human understanding of the world can change rapidly when new information comes to light, such as when a plot twist occurs in a work of fiction. This flexible "knowledge assembly" requires few-shot reorganization of neural codes for relations among objects and events. However, existing computational theories are largely silent about how this could occur. Here, participants learned a transitive ordering among novel objects within two distinct contexts before exposure to new knowledge that revealed how they were linked. Blood-oxygen-level-dependent (BOLD) signals in dorsal frontoparietal cortical areas revealed that objects were rapidly and dramatically rearranged on the neural manifold after minimal exposure to linking information. We then adapt online stochastic gradient descent to permit similar rapid knowledge assembly in a neural network model.
Flesch, T., Saxe, A., & Summerfield, C. (2023). Continual task learning in natural and artificial agents. Trends in Neurosciences, 46(3), 199–210. https://doi.org/10.1016/j.tins.2022.12.006
Abstract | DOI

How do humans and other animals learn new tasks? A wave of brain recording studies has investigated how neural representations change during task learning, with a focus on how tasks can be acquired and coded in ways that minimise mutual interference. We review recent work that has explored the geometry and dimensionality of neural task representations in neocortex, and computational models that have exploited these findings to understand how the brain may partition knowledge between tasks. We discuss how ideas from machine learning, including those that combine supervised and unsupervised learning, are helping neuroscientists understand how natural tasks are learned and coded in biological brains.
Masís, J., Chapman, T., Rhee, J. Y., Cox, D. D., & Saxe, A. M. (2023). Strategically managing learning during perceptual decision making. ELife, 12, e64978. https://doi.org/10.7554/eLife.64978
Abstract | DOI

Making optimal decisions in the face of noise requires balancing short-term speed and accuracy. But a theory of optimality should account for the fact that short-term speed can influence long-term accuracy through learning. Here, we demonstrate that long-term learning is an important dynamical dimension of the speed-accuracy trade-off. We study learning trajectories in rats and formally characterize these dynamics in a theory expressed as both a recurrent neural network and an analytical extension of the drift-diffusion model that learns over time. The model reveals that choosing suboptimal response times to learn faster sacrifices immediate reward, but can lead to greater total reward. We empirically verify predictions of the theory, including a relationship between stimulus exposure and learning speed, and a modulation of reaction time by future learning prospects. We find that rats’ strategies approximately maximize total reward over the full learning epoch, suggesting cognitive control over the learning process.
Shamash, P., Lee, S., Saxe, A. M., & Branco, T. (2023). Mice identify subgoal locations through an action-driven mapping process. Neuron, 111(12), 1966–1978.e8. https://doi.org/10.1016/j.neuron.2023.03.034
Abstract | DOI

Mammals form mental maps of the environments by exploring their surroundings. Here, we investigate which elements of exploration are important for this process. We studied mouse escape behavior, in which mice are known to memorize subgoal locations-obstacle edges-to execute efficient escape routes to shelter. To test the role of exploratory actions, we developed closed-loop neural-stimulation protocols for interrupting various actions while mice explored. We found that blocking running movements directed at obstacle edges prevented subgoal learning; however, blocking several control movements had no effect. Reinforcement learning simulations and analysis of spatial data show that artificial agents can match these results if they have a region-level spatial representation and explore with object-directed movements. We conclude that mice employ an action-driven process for integrating subgoals into a hierarchical cognitive map. These findings broaden our understanding of the cognitive toolkit that mammals use to acquire spatial knowledge.
Flesch, T., Mante, V., Newsome, W., Saxe, A., Summerfield, C., & Sussillo, D. (2023). Are task representations gated in macaque prefrontal cortex? arXiv. http://arxiv.org/abs/2306.16733
Abstract | arXiv

A recent paper (Flesch et al, 2022) describes behavioural and neural data suggesting that task representations are gated in the prefrontal cortex in both humans and macaques. This short note proposes an alternative explanation for the reported results from the macaque data.
Saglietti, L., Mannelli, S., & Saxe, A. (2022). An analytical theory of curriculum learning in teacher-student networks. Advances in Neural Information Processing Systems, 35, 21113–21127. https://proceedings.neurips.cc/paper_files/paper/2022/hash/84bad835faaf48f24d990072bb5b80ee-Abstract-Conference.html
Abstract | pdf

In animals and humans, curriculum learning—presenting data in a curated order—is critical to rapid learning and effective pedagogy. A long history of experiments has demonstrated the impact of curricula in a variety of animals but, despite its ubiquitous presence, a theoretical understanding of the phenomenon is still lacking. Surprisingly, in contrast to animal learning, curricula strategies are not widely used in machine learning and recent simulation studies reach the conclusion that curricula are moderately effective or ineffective in most cases. This stark difference in the importance of curriculum raises a fundamental theoretical question: when and why does curriculum learning help? In this work, we analyse a prototypical neural network model of curriculum learning in the high-dimensional limit, employing statistical physics methods. We study a task in which a sparse set of informative features are embedded amidst a large set of noisy features. We analytically derive average learning trajectories for simple neural networks on this task, which establish a clear speed benefit for curriculum learning in the online setting. However, when training experiences can be stored and replayed (for instance, during sleep), the advantage of curriculum in standard neural networks disappears, in line with observations from the deep learning literature. Inspired by synaptic consolidation techniques developed to combat catastrophic forgetting, we investigate whether consolidating synapses at curriculum change points can boost the benefits of curricula. We derive generalisation performance as a function of consolidation strength (implemented as a Gaussian prior connecting learning phases), and show that this consolidation mechanism can yield a large improvement in test performance. Our reduced analytical descriptions help reconcile apparently conflicting empirical results, trace regimes where curriculum learning yields the largest gains, and provide experimentally-accessible predictions for the impact of task parameters on curriculum benefits. More broadly, our results suggest that fully exploiting a curriculum may require explicit consolidation at curriculum boundaries.
Saxe, A., Sodhani, S., & Lewallen, S. J. (2022). The Neural Race Reduction: Dynamics of Abstraction in Gated Networks. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Eds.), Proceedings of the 39th International Conference on Machine Learning (Vol. 162, pp. 19287–19309). PMLR. https://proceedings.mlr.press/v162/saxe22a.html
Abstract | pdf

Our theoretical understanding of deep learning has not kept pace with its empirical success. While network architecture is known to be critical, we do not yet understand its effect on learned representations and network behavior, or how this architecture should reflect task structure.In this work, we begin to address this gap by introducing the Gated Deep Linear Network framework that schematizes how pathways of information flow impact learning dynamics within an architecture. Crucially, because of the gating, these networks can compute nonlinear functions of their input. We derive an exact reduction and, for certain cases, exact solutions to the dynamics of learning. Our analysis demonstrates that the learning dynamics in structured networks can be conceptualized as a neural race with an implicit bias towards shared representations, which then govern the model’s ability to systematically generalize, multi-task, and transfer. We validate our key insights on naturalistic datasets and with relaxed assumptions. Taken together, our work gives rise to general hypotheses relating neural architecture to learning and provides a mathematical approach towards understanding the design of more complex architectures and the role of modularity and compositionality in solving real-world problems. The code and results are available at https://www.saxelab.org/gated-dln.
Lee, S., Mannelli, S. S., Clopath, C., Goldt, S., & Saxe, A. (2022). Maslow’s Hammer in Catastrophic Forgetting: Node Re-Use vs. Node Activation. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, & S. Sabato (Eds.), Proceedings of the 39th International Conference on Machine Learning (Vol. 162, pp. 12455–12477). PMLR. https://proceedings.mlr.press/v162/lee22g.html
Abstract | pdf

Continual learning—learning new tasks in sequence while maintaining performance on old tasks—remains particularly challenging for artificial neural networks. Surprisingly, the amount of forgetting does not increase with the dissimilarity between the learned tasks, but appears to be worst in an intermediate similarity regime. In this paper we theoretically analyse both a synthetic teacher-student framework and a real data setup to provide an explanation of this phenomenon that we name Maslow’s Hammer hypothesis. Our analysis reveals the presence of a trade-off between node activation and node re-use that results in worst forgetting in the intermediate regime. Using this understanding we reinterpret popular algorithmic interventions for catastrophic interference in terms of this trade-off, and identify the regimes in which they are most effective.
Singh, A. K., Ding, D., Saxe, A., Hill, F., & Lampinen, A. (2023). Know your audience: specializing grounded language models with listener subtraction. In A. Vlachos & I. Augenstein (Eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (pp. 3884–3911). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.eacl-main.279
Abstract | DOI

Effective communication requires adapting to the idiosyncrasies of each communicative context—such as the common ground shared with each partner. Humans demonstrate this ability to specialize to their audience in many contexts, such as the popular game Dixit. We take inspiration from Dixit to formulate a multi-agent image reference game where a (trained) speaker model is rewarded for describing a target image such that one (pretrained) listener model can correctly identify it among distractors, but another listener cannot. To adapt, the speaker must exploit differences in the knowledge it shares with the different listeners. We show that finetuning an attention-based adapter between a CLIP vision encoder and a large language model in this contrastive, multi-agent setting gives rise to context-dependent natural language specialization from rewards only, without direct supervision. Through controlled experiments, we show that training a speaker with two listeners that perceive differently, using our method, allows the speaker to adapt to the idiosyncracies of the listeners. Furthermore, we show zero-shot transfer of the specialization to real-world data. Our experiments demonstrate a method for specializing grounded language models without direct supervision and highlight the interesting research challenges posed by complex multi-agent communication.
Lee, S., Goldt, S., & Saxe, A. (2021). Continual learning in the teacher-student setup: Impact of task similarity. International Conference on Machine Learning, 6109–6119. https://proceedings.mlr.press/v139/lee21e.html?ref=https://githubhelp.com
Abstract | pdf

Continual learning—the ability to learn many tasks in sequence—is critical for artificial learning systems. Yet standard training methods for deep networks often suffer from catastrophic forgetting, where learning new tasks erases knowledge of the earlier tasks. While catastrophic forgetting labels the problem, the theoretical reasons for interference between tasks remain unclear. Here, we attempt to narrow this gap between theory and practice by studying continual learning in the teacher-student setup. We extend previous analytical work on two-layer networks in the teacher-student setup to multiple teachers. Using each teacher to represent a different task, we investigate how the relationship between teachers affects the amount of forgetting and transfer exhibited by the student when the task switches. In line with recent work, we find that when tasks depend on similar features, intermediate task similarity leads to greatest forgetting. However, feature similarity is only one way in which tasks may be related. The teacher-student approach allows us to disentangle task similarity at the level of readouts (hidden-to-output weights) as well as features (input-to-hidden weights). We find a complex interplay between both types of similarity, initial transfer/forgetting rates, maximum transfer/forgetting, and the long-time (post-switch) amount of transfer/forgetting. Together, these results help illuminate the diverse factors contributing to catastrophic forgetting.
Flesch, T., Juechems, K., Dumbalska, T., Saxe, A., & Summerfield, C. (2022). Orthogonal representations for robust context-dependent task performance in brains and neural networks. Neuron, 110(7), 1258–1270. https://doi.org/10.1016/j.neuron.2022.01.005
Abstract | DOI

How do neural populations code for multiple, potentially conflicting tasks? Here we used computational simulations involving neural networks to define “lazy” and “rich” coding solutions to this context-dependent decision-making problem, which trade off learning speed for robustness. During lazy learning the input dimensionality is expanded by random projections to the network hidden layer, whereas in rich learning hidden units acquire structured representations that privilege relevant over irrelevant features. For context-dependent decision-making, one rich solution is to project task representations onto low-dimensional and orthogonal manifolds. Using behavioral testing and neuroimaging in humans and analysis of neural signals from macaque prefrontal cortex, we report evidence for neural coding patterns in biological brains whose dimensionality and neural geometry are consistent with the rich learning regime.
Saxe, A., Nelli, S., & Summerfield, C. (2021). If deep learning is the answer, what is the question? Nature Reviews Neuroscience, 22(1), 55–67. https://doi.org/10.1038/s41583-020-00395-8
Abstract | DOI

Neuroscience research is undergoing a minor revolution. Recent advances in machine learning and artificial intelligence research have opened up new ways of thinking about neural computation. Many researchers are excited by the possibility that deep neural networks may offer theories of perception, cognition and action for biological brains. This approach has the potential to radically reshape our approach to understanding neural systems, because the computations performed by deep networks are learned from experience, and not endowed by the researcher. If so, how can neuroscientists use deep networks to model and understand biological brains? What is the outlook for neuroscientists who seek to characterize computations or neural codes, or who wish to understand perception, attention, memory and executive functions? In this Perspective, our goal is to offer a road map for systems neuroscience research in the age of deep learning. We discuss the conceptual and methodological challenges of comparing behaviour, learning dynamics and neural representations in artificial and biological systems, and we highlight new research questions that have emerged for neuroscience as a direct consequence of recent advances in machine learning.
Advani, M. S., Saxe, A. M., & Sompolinsky, H. (2020). High-dimensional dynamics of generalization error in neural networks. Neural Networks, 132, 428–446. https://doi.org/10.1016/j.neunet.2020.08.022
Abstract | DOI

We perform an analysis of the average generalization dynamics of large neural networks trained using gradient descent. We study the practically-relevant "high-dimensional" regime where the number of free parameters in the network is on the order of or even larger than the number of examples in the dataset. Using random matrix theory and exact solutions in linear models, we derive the generalization error and training error dynamics of learning and analyze how they depend on the dimensionality of data and signal to noise ratio of the learning problem. We find that the dynamics of gradient descent learning naturally protect against overtraining and overfitting in large networks. Overtraining is worst at intermediate network sizes, when the effective number of free parameters equals the number of samples, and thus can be reduced by making a network smaller or larger. Additionally, in the high-dimensional regime, low generalization error requires starting with small initial weights. We then turn to non-linear neural networks, and show that making networks very large does not harm their generalization performance. On the contrary, it can in fact reduce overtraining, even without early stopping or regularization of any sort. We identify two novel phenomena underlying this behavior in overcomplete models: first, there is a frozen subspace of the weights in which no learning occurs under gradient descent; and second, the statistical properties of the high-dimensional regime yield better-conditioned input correlations which protect against overtraining. We demonstrate that standard application of theories such as Rademacher complexity are inaccurate in predicting the generalization performance of deep neural networks, and derive an alternative bound which incorporates the frozen subspace and conditioning effects and qualitatively matches the behavior observed in simulation.
Saxe, A. M., McClelland, J. L., & Ganguli, S. (2019). A mathematical theory of semantic development in deep neural networks. Proceedings of the National Academy of Sciences, 116(23), 11537–11546. https://doi.org/10.1073/pnas.1820226116
Abstract | DOI

An extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: What are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.
Goldt, S., Advani, M. S., Saxe, A. M., Krzakala, F., & Zdeborová, L. (2019). Generalisation dynamics of online learning in over-parameterised neural networks. arXiv. http://arxiv.org/abs/1901.09085
Abstract | arXiv

Deep neural networks achieve stellar generalisation on a variety of problems, despite often being large enough to easily fit all their training data. Here we study the generalisation dynamics of two-layer neural networks in a teacher-student setup, where one network, the student, is trained using stochastic gradient descent (SGD) on data generated by another network, called the teacher. We show how for this problem, the dynamics of SGD are captured by a set of differential equations. In particular, we demonstrate analytically that the generalisation error of the student increases linearly with the network size, with other relevant parameters held constant. Our results indicate that achieving good generalisation in neural networks depends on the interplay of at least the algorithm, its learning rate, the model architecture, and the data set.
Goldt, S., Advani, M., Saxe, A. M., Krzakala, F., & Zdeborová, L. (2019). Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. Advances in Neural Information Processing Systems, 32. https://proceedings.neurips.cc/paper_files/paper/2019/hash/cab070d53bd0d200746fb852a922064a-Abstract.html
Abstract | pdf

Deep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher-student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this description is asymptotically exact in the limit of large inputs. Using this framework, we calculate the final generalisation error of student networks that have more parameters than their teachers. We find that the final generalisation error of the student increases with network size when training only the first layer, but stays constant or even decreases with size when training both layers. We show that these different behaviours have their root in the different solutions SGD finds for different activation functions. Our results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.
Zhang, Y., Saxe, A. M., Advani, M. S., & Lee, A. A. (2018). Energy–entropy competition and the effectiveness of stochastic gradient descent in machine learning. Molecular Physics, 116(21-22), 3214–3223. https://doi.org/10.1080/00268976.2018.1483535
Abstract | DOI

Finding parameters that minimise a loss function is at the core of many machine learning methods. The Stochastic Gradient Descent (SGD) algorithm is widely used and delivers state-of-the-art results for many problems. Nonetheless, SGD typically cannot find the global minimum, thus its empirical effectiveness is hitherto mysterious. We derive a correspondence between parameter inference and free energy minimisation in statistical physics. The degree of undersampling plays the role of temperature. Analogous to the energy–entropy competition in statistical physics, wide but shallow minima can be optimal if the system is undersampled, as is typical in many applications. Moreover, we show that the stochasticity in the algorithm has a non-trivial correlation structure which systematically biases it towards wide minima. We illustrate our argument with two prototypical models: image classification using deep learning and a linear neural network where we can analytically reveal the relationship between entropy and out-of-sample error.
Nye, M., & Saxe, A. (2018). Are Efficient Deep Representations Learnable? arXiv. http://arxiv.org/abs/1807.06399
Abstract | arXiv

Many theories of deep learning have shown that a deep network can require dramatically fewer resources to represent a given function compared to a shallow network. But a question remains: can these efficient representations be learned using current deep learning techniques? In this work, we test whether standard deep learning methods can in fact find the efficient representations posited by several theories of deep representation. Specifically, we train deep neural networks to learn two simple functions with known efficient solutions: the parity function and the fast Fourier transform. We find that using gradient-based optimization, a deep network does not learn the parity function, unless initialized very close to a hand-coded exact solution. We also find that a deep linear neural network does not learn the fast Fourier transform, even in the best-case scenario of infinite training data, unless the weights are initialized very close to the exact hand-coded solution. Our results suggest that not every element of the class of compositional functions can be learned efficiently by a deep network, and further restrictions are necessary to understand what functions are both efficiently representable and learnable.
Earle, A. C., Saxe, A. M., & Rosman, B. (2017). Hierarchical Subtask Discovery With Non-Negative Matrix Factorization. arXiv. http://arxiv.org/abs/1708.00463
Abstract | arXiv

Hierarchical reinforcement learning methods offer a powerful means of planning flexible behavior in complicated domains. However, learning an appropriate hierarchical decomposition of a domain into subtasks remains a substantial challenge. We present a novel algorithm for subtask discovery, based on the recently introduced multitask linearly-solvable Markov decision process (MLMDP) framework. The MLMDP can perform never-before-seen tasks by representing them as a linear combination of a previously learned basis set of tasks. In this setting, the subtask discovery problem can naturally be posed as finding an optimal low-rank approximation of the set of tasks the agent will face in a domain. We use non-negative matrix factorization to discover this minimal basis set of tasks, and show that the technique learns intuitive decompositions in a variety of domains. Our method has several qualitatively desirable features: it is not limited to learning subtasks with single goal states, instead learning distributed patterns of preferred states; it learns qualitatively different hierarchical decompositions in the same domain depending on the ensemble of tasks the agent will face; and it may be straightforwardly iterated to obtain deeper hierarchical decompositions.
McClelland, J. L., Sadeghi, Z., & Saxe, A. M. (2016). A Critique of Pure Hierarchy: Uncovering Cross-Cutting Structure in a Natural Dataset. Neurocomputational Models of Cognitive Development and Processing, 51–68. https://doi.org/10.1142/9789814699341_0004
Abstract | DOI

How best can we understand – and visualize – the structure in multi-dimensional data? One common approach is to rely on hierarchical cluster analysis, either for theoretical or for more descriptive reasons. Here, we point out that an apparently revealing hierarchical clustering solution may well be compatible with structure that is not well characterized as a hierarchy. In particular, a hierarchical description can be equally consistent with crosscutting rather than strictly hierarchical, or nested, structure. We offer an alternative approach, based on inspection of the feature vectors provided by a singular value decomposition (SVD) which allows a flexible mixture of hierarchical and cross-cutting dimensions and which can reveal whether dimensions are cross-cutting or nested. The SVD offers a more flexible representation than a hierarchy in that it can capture either hierarchical or cross-cutting structure or blends of these two structure types, or, indeed, many other structure types. We then introduce a refinement of the SVD approach based on sparse principal component analysis that leads to more easily interpretable dimensions. In our dataset, these dimensions correspond to aquatic vs. land animals, large vs. small animals, predators vs prey animals, and primates vs. other mammals.
Saxe, A. M., Earle, A., & Rosman, B. (2016). Hierarchy through Composition with Linearly Solvable Markov Decision Processes. arXiv. http://arxiv.org/abs/1612.02757
Abstract | arXiv

Hierarchical architectures are critical to the scalability of reinforcement learning methods. Current hierarchical frameworks execute actions serially, with macro-actions comprising sequences of primitive actions. We propose a novel alternative to these control hierarchies based on concurrent execution of many actions in parallel. Our scheme uses the concurrent compositionality provided by the linearly solvable Markov decision process (LMDP) framework, which naturally enables a learning agent to draw on several macro-actions simultaneously to solve new tasks. We introduce the Multitask LMDP module, which maintains a parallel distributed representation of tasks and may be stacked to form deep hierarchies abstracted in space and time.
Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv. http://arxiv.org/abs/1312.6120
Abstract | arXiv

Despite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.
Monajemi, H., Jafarpour, S., Gavish, M., Collaboration, S. 330/C. M. E. 362, Donoho, D. L., Ambikasaran, S., Bacallado, S., Bharadia, D., Chen, Y., Choi, Y., Chowdhury, M., Chowdhury, S., Damle, A., Fithian, W., Goetz, G., Grosenick, L., Gross, S., Hills, G., Hornstein, M., … Zhu, Z. (2013). Deterministic matrices matching the compressed sensing phase transitions of Gaussian random matrices. Proceedings of the National Academy of Sciences, 110(4), 1181–1186. https://doi.org/10.1073/pnas.1219540110
Abstract | DOI

In compressed sensing, one takes samples of an N-dimensional vector using an matrix A, obtaining undersampled measurements . For random matrices with independent standard Gaussian entries, it is known that, when is k-sparse, there is a precisely determined phase transition: for a certain region in the (,)-phase diagram, convex optimization typically finds the sparsest solution, whereas outside that region, it typically fails. It has been shown empirically that the same property—with the same phase transition location—holds for a wide range of non-Gaussian random matrix ensembles. We report extensive experiments showing that the Gaussian phase transition also describes numerous deterministic matrices, including Spikes and Sines, Spikes and Noiselets, Paley Frames, Delsarte-Goethals Frames, Chirp Sensing Matrices, and Grassmannian Frames. Namely, for each of these deterministic matrices in turn, for a typical k-sparse object, we observe that convex optimization is successful over a region of the phase diagram that coincides with the region known for Gaussian random matrices. Our experiments considered coefficients constrained to for four different sets , and the results establish our finding for each of the four associated phase transitions.
Furlanello, T., Zhao, J., Saxe, A. M., Itti, L., & Tjan, B. S. (2016). Active Long Term Memory Networks. arXiv. http://arxiv.org/abs/1606.02355
Abstract | arXiv

Continual Learning in artificial neural networks suffers from interference and forgetting when different tasks are learned sequentially. This paper introduces the Active Long Term Memory Networks (A-LTM), a model of sequential multi-task deep learning that is able to maintain previously learned association between sensory input and behavioral output while acquiring knew knowledge. A-LTM exploits the non-convex nature of deep neural networks and actively maintains knowledge of previously learned, inactive tasks using a distillation loss. Distortions of the learned input-output map are penalized but hidden layers are free to transverse towards new local optima that are more favorable for the multi-task objective. We re-frame the McClelland’s seminal Hippocampal theory with respect to Catastrophic Inference (CI) behavior exhibited by modern deep architectures trained with back-propagation and inhomogeneous sampling of latent factors across epochs. We present empirical results of non-trivial CI during continual learning in Deep Linear Networks trained on the same task, in Convolutional Neural Networks when the task shifts from predicting semantic to graphical factors and during domain adaptation from simple to complex environments. We present results of the A-LTM model’s ability to maintain viewpoint recognition learned in the highly controlled iLab-20M dataset with 10 object categories and 88 camera viewpoints, while adapting to the unstructured domain of Imagenet with 1,000 object categories.
Goodfellow, I. J., Vinyals, O., & Saxe, A. M. (2015). Qualitatively characterizing neural network optimization problems. arXiv. http://arxiv.org/abs/1412.6544
Abstract | arXiv

Training neural networks involves solving large-scale non-convex optimization problems. This task has long been believed to be extremely difficult, with fear of local minima and other obstacles motivating a variety of schemes to improve optimization, such as unsupervised pretraining. However, modern neural networks are able to achieve negligible training error on complex tasks, using only direct training with stochastic gradient descent. We introduce a simple analysis technique to look for evidence that such networks are overcoming local optima. We find that, in fact, on a straight path from initialization to solution, a variety of state of the art neural networks never encounter any significant obstacles.
Saxe, A., Bhand, M., Mudur, R., Suresh, B., & Ng, A. (2011). Modeling cortical representational plasticity with unsupervised feature learning. Poster Presented at COSYNE, 24–27. http://bipinsuresh.info/papers/ModelingCorticalRepresentationalPlasticityWithUnsupervisedFeatureLearning.pdf
Abstract | pdf

The efficient coding hypothesis holds that neural receptive fields are adapted to the statistics of the environment, but is agnostic to the timescale of adaptation. In this work we consider the possibility that neural receptive fields are adapted to the statistics during an organism’s lifetime. In particular, we test whether some shared plasticity mechanism can account for normal receptive field properties across multiple primary sensory cortices. Furthermore, we test whether the same mechanism can account for altered receptive field properties when the statistics of the environment are altered experimentally. We find that unsupervised feature learning algorithms are able to capture several receptive field properties across sensory modalities, and also allow us to model receptive field plasticity experiments. The consistent correspondences and discrepancies between these algorithms and experimental data may provide insight into plasticity mechanisms and aid theoretical efforts to develop new learning algorithms.
Balci, F., Simen, P., Niyogi, R., Saxe, A., Hughes, J. A., Holmes, P., & Cohen, J. D. (2011). Acquisition of decision making criteria: reward rate ultimately beats accuracy. Attention, Perception, & Psychophysics, 73(2), 640–657. https://doi.org/10.3758/s13414-010-0049-7
Abstract | DOI

Speed–accuracy trade-offs strongly influence the rate of reward that can be earned in many decision-making tasks. Previous reports suggest that human participants often adopt suboptimal speed–accuracy trade-offs in single session, two-alternative forced-choice tasks. We investigated whether humans acquired optimal speed–accuracy trade-offs when extensively trained with multiple signal qualities. When performance was characterized in terms of decision time and accuracy, our participants eventually performed nearly optimally in the case of higher signal qualities. Rather than adopting decision criteria that were individually optimal for each signal quality, participants adopted a single threshold that was nearly optimal for most signal qualities. However, setting a single threshold for different coherence conditions resulted in only negligible decrements in the maximum possible reward rate. Finally, we tested two hypotheses regarding the possible sources of suboptimal performance: (1) favoring accuracy over reward rate and (2) misestimating the reward rate due to timing uncertainty. Our findings provide support for both hypotheses, but also for the hypothesis that participants can learn to approach optimality. We find specifically that an accuracy bias dominates early performance, but diminishes greatly with practice. The residual discrepancy between optimal and observed performance can be explained by an adaptive response to uncertainty in time estimation.
Saxe, A. M. (2013). Precis of deep linear neural networks: A theory of learning in the brain and mind. https://cognitivesciencesociety.org/wp-content/uploads/2019/01/SaxePrecis.pdf
Abstract | pdf

Abstract not available.