Google Scholar

Selected publications

Saxe, A. M., McClelland, J. L., & Ganguli, S. (2019). A mathematical theory of semantic development in deep neural networks.

*Proceedings of the National Academy of Sciences*,*116*(23), 11537–11546. https://doi.org/10.1073/pnas.1820226116

Abstract | arXiv | DOIAn extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: What are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.

Advani*, M., & Saxe*, A. M. (2017). High-dimensional dynamics of generalization error in neural networks.

*ArXiv*.

pdf | arXivSaxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Y. Bengio & Y. LeCun (Eds.),

*International Conference on Learning Representations*.

Abstract | pdf | arXivDespite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.

All publications

Nelli, S., Braun, L., Dumbalska, T., Saxe, A., & Summerfield, C. (2023). Neural knowledge assembly in humans and neural networks.

*Neuron*. https://doi.org/https://doi.org/10.1016/j.neuron.2023.02.014

Abstract | pdf | DOISummary Human understanding of the world can change rapidly when new information comes to light, such as when a plot twist occurs in a work of fiction. This flexible “knowledge assembly” requires few-shot reorganization of neural codes for relations among objects and events. However, existing computational theories are largely silent about how this could occur. Here, participants learned a transitive ordering among novel objects within two distinct contexts before exposure to new knowledge that revealed how they were linked. Blood-oxygen-level-dependent (BOLD) signals in dorsal frontoparietal cortical areas revealed that objects were rapidly and dramatically rearranged on the neural manifold after minimal exposure to linking information. We then adapt online stochastic gradient descent to permit similar rapid knowledge assembly in a neural network model.

Löwe, A. T., Touzo, L., Muhle-Karbe, P. S., Saxe, A. M., Summerfield, C., & Schuck, N. W. (2023).

*Regularised neural networks mimic human insight*.

pdfMasís, J., Chapman, T., Rhee, J. Y., Cox, D. D., & Saxe, A. M. (2023). Strategically managing learning during perceptual decision making.

*ELife*,*12*, e64978. https://doi.org/10.7554/eLife.64978

Abstract | pdf | DOIMaking optimal decisions in the face of noise requires balancing short-term speed and accuracy. But a theory of optimality should account for the fact that short-term speed can influence long-term accuracy through learning. Here, we demonstrate that long-term learning is an important dynamical dimension of the speed-accuracy trade-off. We study learning trajectories in rats and formally characterize these dynamics in a theory expressed as both a recurrent neural network and an analytical extension of the drift-diffusion model that learns over time. The model reveals that choosing suboptimal response times to learn faster sacrifices immediate reward, but can lead to greater total reward. We empirically verify predictions of the theory, including a relationship between stimulus exposure and learning speed, and a modulation of reaction time by future learning prospects. We find that rats’ strategies approximately maximize total reward over the full learning epoch, suggesting cognitive control over the learning process.

Flesch, T., Saxe, A., & Summerfield, C. (2023). Continual task learning in natural and artificial agents.

*Trends in Neurosciences*,*46*(3), 199–210. https://doi.org/https://doi.org/10.1016/j.tins.2022.12.006

Abstract | pdf | DOIHow do humans and other animals learn new tasks? A wave of brain recording studies has investigated how neural representations change during task learning, with a focus on how tasks can be acquired and coded in ways that minimise mutual interference. We review recent work that has explored the geometry and dimensionality of neural task representations in neocortex, and computational models that have exploited these findings to understand how the brain may partition knowledge between tasks. We discuss how ideas from machine learning, including those that combine supervised and unsupervised learning, are helping neuroscientists understand how natural tasks are learned and coded in biological brains.

Flesch, T., Nagy, D. G., Saxe, A., & Summerfield, C. (2023). Modelling continual learning in humans with Hebbian context gating and exponentially decaying task signals.

*PLOS Computational Biology*,*19*(1), 1–32. https://doi.org/10.1371/journal.pcbi.1010808

Abstract | pdf | DOIHumans can learn several tasks in succession with minimal mutual interference but perform more poorly when trained on multiple tasks at once. The opposite is true for standard deep neural networks. Here, we propose novel computational constraints for artificial neural networks, inspired by earlier work on gating in the primate prefrontal cortex, that capture the cost of interleaved training and allow the network to learn two tasks in sequence without forgetting. We augment standard stochastic gradient descent with two algorithmic motifs, so-called “sluggish” task units and a Hebbian training step that strengthens connections between task units and hidden units that encode task-relevant information. We found that the “sluggish” units introduce a switch-cost during training, which biases representations under interleaved training towards a joint representation that ignores the contextual cue, while the Hebbian step promotes the formation of a gating scheme from task units to the hidden layer that produces orthogonal representations which are perfectly guarded against interference. Validating the model on previously published human behavioural data revealed that it matches performance of participants who had been trained on blocked or interleaved curricula, and that these performance differences were driven by misestimation of the true category boundary.

Braun, L., Dominé, C. C. J., Fitzgerald, J. E., & Saxe, A. M. (2022). Exact learning dynamics of deep linear networks with prior knowledge. In A. H. Oh, A. Agarwal, D. Belgrave, & K. Cho (Eds.),

*Advances in Neural Information Processing Systems*. https://openreview.net/forum?id=lJx2vng-KiC

pdfSaxe, A. M., Sodhani, S., & Lewallen, S. (2022). The Pathway Race Reduction: Dynamics of Abstraction in Gated Networks.

*International Conference on Machine Learning*.

pdfLee, S., Mannelli, S. S., Clopath, C., Goldt, S., & Saxe, A. M. (2022). Maslow’s Hammer for Catastrophic Forgetting: Node Re-Use vs Node Activation.

*ICML*.

pdfSingh, A. K., Ding, D., Saxe, A., Hill, F., & Lampinen, A. K. (2022).

*Know your audience: specializing grounded language models with the game of Dixit*.

pdfFlesch, T., Juechems, K., Dumbalska, T., Saxe*, A., & Summerfield*, C. (2022). Orthogonal representations for robust context-dependent task performance in brains and neural networks.

*Neuron*,*110*, *Equal contributions. https://doi.org/10.1016/j.neuron.2022.01.005

pdf | DOILee, S., Goldt, S., & Saxe, A. (2021). Continual Learning in the Teacher-Student Setup: Impact of Task Similarity.

*Proceedings of the 38th International Conference on Machine Learning*. https://proceedings.mlr.press/v139/lee21e.html

Abstract | pdfContinual learning{—}the ability to learn many tasks in sequence{—}is critical for artificial learning systems. Yet standard training methods for deep networks often suffer from catastrophic forgetting, where learning new tasks erases knowledge of the earlier tasks. While catastrophic forgetting labels the problem, the theoretical reasons for interference between tasks remain unclear. Here, we attempt to narrow this gap between theory and practice by studying continual learning in the teacher-student setup. We extend previous analytical work on two-layer networks in the teacher-student setup to multiple teachers. Using each teacher to represent a different task, we investigate how the relationship between teachers affects the amount of forgetting and transfer exhibited by the student when the task switches. In line with recent work, we find that when tasks depend on similar features, intermediate task similarity leads to greatest forgetting. However, feature similarity is only one way in which tasks may be related. The teacher-student approach allows us to disentangle task similarity at the level of }emph{readouts} (hidden-to-output weights) as well as }emph{features} (input-to-hidden weights). We find a complex interplay between both types of similarity, initial transfer/forgetting rates, maximum transfer/forgetting, and the long-time (post-switch) amount of transfer/forgetting. Together, these results help illuminate the diverse factors contributing to catastrophic forgetting.

Saglietti, L., Mannelli, S. S., & Saxe, A. (2021). An Analytical Theory of Curriculum Learning in Teacher-Student Networks.

*ArXiv:2106.08068 [Cond-Mat, Stat]*. http://arxiv.org/abs/2106.08068

Abstract | pdfIn humans and animals, curriculum learning – presenting data in a curated order - is critical to rapid learning and effective pedagogy. Yet in machine learning, curricula are not widely used and empirically often yield only moderate benefits. This stark difference in the importance of curriculum raises a fundamental theoretical question: when and why does curriculum learning help? In this work, we analyse a prototypical neural network model of curriculum learning in the high-dimensional limit, employing statistical physics methods. Curricula could in principle change both the learning speed and asymptotic performance of a model. To study the former, we provide an exact description of the online learning setting, confirming the long-standing experimental observation that curricula can modestly speed up learning. To study the latter, we derive performance in a batch learning setting, in which a network trains to convergence in successive phases of learning on dataset slices of varying difficulty. With standard training losses, curriculum does not provide generalisation benefit, in line with empirical observations. However, we show that by connecting different learning phases through simple Gaussian priors, curriculum can yield a large improvement in test performance. Taken together, our reduced analytical descriptions help reconcile apparently conflicting empirical results and trace regimes where curriculum learning yields the largest gains. More broadly, our results suggest that fully exploiting a curriculum may require explicit changes to the loss function at curriculum boundaries.

Gerace, F., Saglietti, L., Mannelli, S. S., Saxe, A., & Zdeborová, L. (2021). Probing transfer learning with a model of synthetic correlated datasets.

*ArXiv:2106.05418 [Cond-Mat]*. http://arxiv.org/abs/2106.05418

Abstract | pdfTransfer learning can significantly improve the sample efficiency of neural networks, by exploiting the relatedness between a data-scarce target task and a data-abundant source task. Despite years of successful applications, transfer learning practice often relies on ad-hoc solutions, while theoretical understanding of these procedures is still limited. In the present work, we re-think a solvable model of synthetic data as a framework for modeling correlation between data-sets. This setup allows for an analytic characterization of the generalization performance obtained when transferring the learned feature map from the source to the target task. Focusing on the problem of training two-layer networks in a binary classification setting, we show that our model can capture a range of salient features of transfer learning with real data. Moreover, by exploiting parametric control over the correlation between the two data-sets, we systematically investigate under which conditions the transfer of features is beneficial for generalization.

Nelli, S., Braun, L., Dumbalska, T., Saxe, A., & Summerfield, C. (2021). Neural knowledge assembly in humans and deep networks.

*BioRxiv*. https://www.biorxiv.org/content/10.1101/2021.10.21.465374v2

Abstract | pdfHuman understanding of the world can change rapidly when new information comes to light, such as when a plot twist occurs in a work of fiction. This flexible “knowledge assembly” requires few-shot reorganisation of neural codes for relations among objects and events. However, existing computational theories are largely silent about how this could occur. Here, participants learned a transitive ordering among novel objects within two distinct contexts, before exposure to new knowledge that revealed how they were linked. BOLD signals in dorsal frontoparietal cortical areas revealed that objects were rapidly and dramatically rearranged on the neural manifold after minimal exposure to linking information. We then adapt stochastic online gradient descent to permit similar rapid knowledge assembly in a neural network model.

Sun, W., Advani, M., Spruston, N., Saxe*, A., & Fitzgerald*, J. E. (2021). Organizing memories for generalization in complementary learning systems.

*BioRxiv*, *Equal contribution. https://www.biorxiv.org/content/10.1101/2021.10.13.463791v1

Abstract | pdfOur ability to remember the past is essential for guiding our future behavior. Psychological and neurobiological features of declarative memories are known to transform over time in a process known as systems consolidation. While many theories have sought to explain the time-varying role of hippocampal and neocortical brain areas, the computational principles that govern these transformations remain unclear. Here we propose a theory of systems consolidation in which hippocampal-cortical interactions serve to optimize generalizations that guide future adaptive behavior. We use mathematical analysis of neural network models to characterize fundamental performance tradeoffs in systems consolidation, revealing that memory components should be organized according to their predictability. The theory shows that multiple interacting memory systems can outperform just one, normatively unifying diverse experimental observations and making novel experimental predictions. Our results suggest that the psychological taxonomy and neurobiological organization of declarative memories reflect a system optimized for behaving well in an uncertain future.

Juechems, K., & Saxe, A. (2021). Inferring Actions, Intentions, and Causal Relations in a Deep Neural Network.

*Proceedings of the Annual Meeting of the Cognitive Science Society*,*43*. https://escholarship.org/uc/item/2mp5t991

Abstract | pdfFrom a young age, we can select actions to achieve desired goals, infer the goals of other agents, and learn causal relations in our environment through social interactions. Crucially, these abilities are productive and generative: we can impute desires to others that we have never held ourselves. These abilities are often captured by only partially overlapping models, each requiring substantial changes to fit combinations of abilities. Here, in an attempt to unify previous models, we present a neural network underpinned by the linearly solvable Markov Decision Process (LMDP) framework which permits a distributed representation of tasks. The network contains two pathways: one captures the desirability of states, and another encodes the passive dynamics of state transitions in the absence of control. Interactions between pathways are bound by a principle of rational action, enabling generative inference of actions, goals, and causal relations supported by gradient updates to parts of the network.

Saxe, A., Nelli, S., & Summerfield, C. (2021). If deep learning is the answer, what is the question?

*Nature Reviews Neuroscience*,*22*(1), 55–67. https://doi.org/10.1038/s41583-020-00395-8

Abstract | pdf | DOINeuroscience research is undergoing a minor revolution. Recent advances in machine learning and artificial intelligence research have opened up new ways of thinking about neural computation. Many researchers are excited by the possibility that deep neural networks may offer theories of perception, cognition and action for biological brains. This approach has the potential to radically reshape our approach to understanding neural systems, because the computations performed by deep networks are learned from experience, and not endowed by the researcher. If so, how can neuroscientists use deep networks to model and understand biological brains? What is the outlook for neuroscientists who seek to characterize computations or neural codes, or who wish to understand perception, attention, memory and executive functions? In this Perspective, our goal is to offer a road map for systems neuroscience research in the age of deep learning. We discuss the conceptual and methodological challenges of comparing behaviour, learning dynamics and neural representations in artificial and biological systems, and we highlight new research questions that have emerged for neuroscience as a direct consequence of recent advances in machine learning.

Goldt, S., Advani, M. S., Saxe, A. M., Krzakala, F., & Zdeborová, L. (2020). Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup.

*Journal of Statistical Mechanics: Theory and Experiment*,*2020*(12), 124010. https://doi.org/10.1088/1742-5468/abc61e

Abstract | pdf | DOIDeep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher–student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this description is asymptotically exact in the limit of large inputs. Using this framework, we calculate the final generalisation error of student networks that have more parameters than their teachers. We find that the final generalisation error of the student increases with network size when training only the first layer, but stays constant or even decreases with size when training both layers. We show that these different behaviours have their root in the different solutions SGD finds for different activation functions. Our results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.

Musslick, S., Saxe, A., Hoskin, A. N., Reichman, D., & Cohen, J. D. (2020).

*On the Rational Boundedness of Cognitive Control: Shared Versus Separated Representations*. PsyArXiv. https://doi.org/10.31234/osf.io/jkhdf

Abstract | pdf | DOIOne of the most fundamental and striking limitations of human cognition appears to be a constraint in the number of control-dependent processes that can be executed at one time. This constraint motivates one of the most influential tenets of cognitive psychology: that cognitive control relies on a central, limited capacity processing mechanism that imposes a seriality constraint on processing. Here we provide a formally explicit challenge to this view. We argue that the causality is reversed: the constraints on control-dependent behavior reflect a rational bound that control mechanisms impose on processing, to prevent processing interference that arises if two or more tasks engage the same resource to be executed. We use both mathematical and numerical analyses of shared representations in neural network architectures to articulate the theory, and demonstrate its ability to explain a wide range of phenomena associated with control-dependent behavior. Furthermore, we argue that the need for control, arising from the shared use of the same resources by different tasks, reflects the optimization of a fundamental tradeoff intrinsic to network architectures: the increase in learning efficacy associated with the use of shared representations, versus the efficiency of parallel processing (i.e., multitasking) associated with task-dedicated representations. The theory helps frame a formally rigorous, normative approach to the tradeoff between control-dependent processing versus automaticity, and relates to a number of other fundamental principles and phenomena concerning cognitive function, and computation more generally.

Cao, Y., Summerfield, C., & Saxe, A. (2020). Characterizing emergent representations in a space of candidate learning rules for deep networks.

*Advances in Neural Information Processing Systems 33*. https://proceedings.neurips.cc/paper/1995/file/feab05aa91085b7a8012516bc3533958-Paper.pdf

pdfMasis, J. A., Chapman, T., Rhee, J. Y., Cox, D. D., & Saxe, A. M. (2020). Rats strategically manage learning during perceptual decision making.

*BioRxiv*, 1–48.

Abstract | pdfBalancing the speed and accuracy of decisions is crucial for survival, but how organisms manage this trade-off during learning is largely unknown. Here, we track this trade-off during perceptual learning in rats and simulated agents. At the start of learning, rats chose long reaction times that did not optimize instantaneous reward rate, but by the end of learning chose near-optimal reaction times. To understand this behavior, we analyzed learning dynamics in a recurrent neural network model of the task. The model reveals a fundamental trade-off between instantaneous reward rate and perceptual learning speed, putting the goals of learning quickly and accruing immediate reward in tension. We find that the rats’ strategy of long initial responses can dramatically expedite learning, yielding higher total reward over task engagement. Our results demonstrate that prioritizing learning can be advantageous from a total reward perspective, and suggest that rats engage in cognitive control of learning.

Saxe, A., Nelli, S., & Summerfield, C. (2020). If deep learning is the answer, then what is the question?

*ArXiv*. http://arxiv.org/abs/2004.07580

Abstract | pdf | arXivNeuroscience research is undergoing a minor revolution. Recent advances in machine learning and artificial intelligence (AI) research have opened up new ways of thinking about neural computation. Many researchers are excited by the possibility that deep neural networks may offer theories of perception, cognition and action for biological brains. This perspective has the potential to radically reshape our approach to understanding neural systems, because the computations performed by deep networks are learned from experience, not endowed by the researcher. If so, how can neuroscientists use deep networks to model and understand biological brains? What is the outlook for neuroscientists who seek to characterise computations or neural codes, or who wish to understand perception, attention, memory, and executive functions? In this Perspective, our goal is to offer a roadmap for systems neuroscience research in the age of deep learning. We discuss the conceptual and methodological challenges of comparing behaviour, learning dynamics, and neural representation in artificial and biological systems. We highlight new research questions that have emerged for neuroscience as a direct consequence of recent advances in machine learning.

Richards, B. A., Lillicrap, T. P., Beaudoin, P., Bengio, Y., Bogacz, R., Christensen, A., Clopath, C., Costa, R. P., de Berker, A., Ganguli, S., Gillon, C. J., Hafner, D., Kepecs, A., Kriegeskorte, N., Latham, P., Lindsay, G. W., Miller, K. D., Naud, R., Pack, C. C., … Kording, K. P. (2019). A deep learning framework for neuroscience.

*Nature Neuroscience*,*22*(11), 1761–1770. https://doi.org/10.1038/s41593-019-0520-2

Abstract | DOISystems neuroscience seeks explanations for how the brain implements a wide variety of perceptual, cognitive and motor tasks. Conversely, artificial intelligence attempts to design computational systems based on the tasks they will have to solve. In artificial neural networks, the three components specified by design are the objective functions, the learning rules and the architectures. With the growing success of deep learning, which utilizes brain-inspired architectures, these three designed components have increasingly become central to how we model, engineer and optimize complex artificial learning systems. Here we argue that a greater focus on these components would also benefit systems neuroscience. We give examples of how this optimization-based framework can drive theoretical and experimental progress in neuroscience. We contend that this principled perspective on systems neuroscience will help to generate more rapid progress.

Goldt, S., Advani, M. S., Saxe, A. M., Krzakala, F., & Zdeborová, L. (2019, June). Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup.

*NeurIPS*.

Abstract | pdf | arXivDeep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher-student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this description is asymptotically exact in the limit of large inputs. Using this framework, we calculate the final generalisation error of student networks that have more parameters than their teachers. We find that the final generalisation error of the student increases with network size when training only the first layer, but stays constant or even decreases with size when training both layers. We show that these different behaviours have their root in the different solutions SGD finds for different activation functions. Our results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.

Saxe, A. M., McClelland, J. L., & Ganguli, S. (2019). A mathematical theory of semantic development in deep neural networks.

*Proceedings of the National Academy of Sciences*,*116*(23), 11537–11546. https://doi.org/10.1073/pnas.1820226116

Abstract | arXiv | DOIAn extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: What are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.

Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., & Cox, D. D. (2019). On the information bottleneck theory of deep learning.

*Journal of Statistical Mechanics: Theory and Experiment*,*2019*(12), 124020. https://doi.org/10.1088/1742-5468/ab3985

Abstract | DOIThe practical successes of deep neural networks have not been matched by theoretical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks; and third, that the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent. Here we show that none of these claims hold true in the general case. Through a combination of analytical results and simulation, we demonstrate that the information plane trajectory is predominantly a function of the neural nonlinearity employed: double-sided saturating nonlinearities like tanh yield a compression phase as neural activations enter the saturation regime, but linear activation functions and single-sided saturating nonlinearities like the widely used ReLU in fact do not. Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent. Finally, we show that when an input domain consists of a subset of task-relevant and task-irrelevant information, hidden representations do compress the task-irrelevant information, although the overall information about the input may monotonically increase with training time, and that this compression happens concurrently with the fitting process rather than during a subsequent compression period.

Goldt, S., Advani, M. S., Saxe, A. M., Krzakala, F., & Zdeborová, L. (2019). Generalisation dynamics of online learning in over-parameterised neural networks.

*ICML Workshop on Theoretical Physics for Deep Learning Theory*. http://arxiv.org/abs/1901.09085

Abstract | pdf | arXivDeep neural networks achieve stellar generalisation on a variety of problems, despite often being large enough to easily fit all their training data. Here we study the generalisation dynamics of two-layer neural networks in a teacher-student setup, where one network, the student, is trained using stochastic gradient descent (SGD) on data generated by another network, called the teacher. We show how for this problem, the dynamics of SGD are captured by a set of differential equations. In particular, we demonstrate analytically that the generalisation error of the student increases linearly with the network size, with other relevant parameters held constant. Our results indicate that achieving good generalisation in neural networks depends on the interplay of at least the algorithm, its learning rate, the model architecture, and the data set.

Zhang, Y., Saxe, A. M., Advani, M. S., & Lee, A. A. (2018). Energy-entropy competition and the effectiveness of stochastic gradient descent in machine learning.

*Molecular Physics*, 1–10. https://doi.org/10.1080/00268976.2018.1483535

Abstract | pdf | arXiv | DOIFinding parameters that minimise a loss function is at the core of many machine learning methods. The Stochastic Gradient Descent algorithm is widely used and delivers state of the art results for many problems. Nonetheless, Stochastic Gradient Descent typically cannot find the global minimum, thus its empirical effectiveness is hitherto mysterious. We derive a correspondence between parameter inference and free energy minimisation in statistical physics. The degree of undersampling plays the role of temperature. Analogous to the energy-entropy competition in statistical physics, wide but shallow minima can be optimal if the system is undersampled, as is typical in many applications. Moreover, we show that the stochasticity in the algorithm has a non-trivial correlation structure which systematically biases it towards wide minima. We illustrate our argument with two prototypical models: image classification using deep learning, and a linear neural network where we can analytically reveal the relationship between entropy and out-of-sample error.

Nye, M., & Saxe, A. (2018). Are Efficient Deep Representations Learnable? In Y. Bengio & Y. LeCun (Eds.),

*Workshop Track at the International Conference on Learning Representations*. https://doi.org/10.1051/0004-6361/201527329

Abstract | pdf | arXiv | DOIMany theories of deep learning have shown that a deep network can require dra- matically fewer resources to represent a given function compared to a shallow network. But a question remains: can these efficient representations be learned using current deep learning techniques? In this work, we test whether standard deep learning methods can in fact find the efficient representations posited by sev- eral theories of deep representation. Specifically, we train deep neural networks to learn two simple functions with known efficient solutions: the parity function and the fast Fourier transform. We find that using gradient-based optimization, a deep network does not learn the parity function, unless initialized very close to a hand-coded exact solution. We also find that a deep linear neural network does not learn the fast Fourier transform, even in the best-case scenario of infinite training data, unless the weights are initialized very close to the exact hand-coded solution. Our results suggest that not every element of the class of compositional functions can be learned efficiently by a deep network, and further restrictions are necessary to understand what functions are both efficiently representable and learnable.

Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., & Cox, D. D. (2018). On the Information Bottleneck Theory of Deep Learning. In Y. Bengio & Y. LeCun (Eds.),

*International Conference on Learning Representations*.

pdfBansal, Y., Advani, M., Cox, D. D., & Saxe, A. M. (2018). Minnorm training: an algorithm for training over-parameterized deep neural networks.

*ArXiv*.

Abstract | pdf | arXivIn this work, we propose a new training method for finding minimum weight norm solutions in over-parameterized neural networks (NNs). This method seeks to improve training speed and generalization performance by framing NN training as a constrained optimization problem wherein the sum of the norm of the weights in each layer of the network is minimized, under the constraint of exactly fitting training data. It draws inspiration from support vector machines (SVMs), which are able to generalize well, despite often having an infinite number of free parameters in their primal form, and from recent theoretical generalization bounds on NNs which suggest that lower norm solutions generalize better. To solve this constrained optimization problem, our method employs Lagrange multipliers that act as integrators of error over training and identify ‘support vector’-like examples. The method can be implemented as a wrapper around gradient based methods and uses standard back-propagation of gradients from the NN for both regression and classification versions of the algorithm. We provide theoretical justifications for the effectiveness of this algorithm in comparison to early stopping and }L_2}-regularization using simple, analytically tractable settings. In particular, we show faster convergence to the max-margin hyperplane in a shallow network (compared to vanilla gradient descent); faster convergence to the minimum-norm solution in a linear chain (compared to }L_2}-regularization); and initialization-independent generalization performance in a deep linear network. Finally, using the MNIST dataset, we demonstrate that this algorithm can boost test accuracy and identify difficult examples in real-world datasets.

Saxe*, A. M., & Advani*, M. (2018). A theory of memory replay and generalization performance in neural networks.

*Computational and Systems Neuroscience Conference*.

pdfMasís, J., Saxe, A. M., & Cox, D. D. (2018). Rats optimize reward rate and learning speed in a 2-AFC task.

*Computational and Systems Neuroscience Conference*.

pdfEarle, A. C., Saxe, A. M., & Rosman, B. (2018). Hierarchical Subtask Discovery with Non-Negative Matrix Factorization. In Y. Bengio & Y. LeCun (Eds.),

*International Conference on Learning Representations*.

pdfAdvani*, M., & Saxe*, A. M. (2017). High-dimensional dynamics of generalization error in neural networks.

*ArXiv*.

pdf | arXivMusslick, S., Saxe, A. M., Ozcimder, K., Dey, B., Henselman, G., & Cohen, J. D. (2017). Multitasking Capability Versus Learning Efficiency in Neural Network Architectures.

*Annual Meeting of the Cognitive Science Society*, 829–834.

pdfSaxe, A. M., Earle, A. C., & Rosman, B. (2017). Hierarchy Through Composition with Multitask LMDPs.

*International Conference on Machine Learning*.

pdfEarle, A. C., Saxe, A. M., & Rosman, B. (2017). Hierarchical Subtask Discovery With Non-Negative Matrix Factorization.

*Workshop on Lifelong Learning: A Reinforcement Learning Approach at ICML*.

pdf | arXivBaldassano*, C., & Saxe*, A. M. (2016). A theory of learning dynamics in perceptual decision-making.

*Computational and Systems Neuroscience Conference*.

pdfSaxe, A. M., & Norman, K. (2016). Optimal storage capacity associative memories exhibit retrieval-induced forgetting.

*Computational and Systems Neuroscience Conference*.

pdfTsai*, C. Y., Saxe*, A., & Cox, D. (2016). Tensor Switching Networks.

*Advances in Neural Information Processing Systems 29*.

Abstract | pdf | arXivWe present a novel neural network algorithm, the Tensor Switching (TS) network, which generalizes the Rectified Linear Unit (ReLU) nonlinearity to tensor-valued hidden units. The TS network copies its entire input vector to different locations in an expanded representation, with the location determined by its hidden unit activity. In this way, even a simple linear readout from the TS representation can implement a highly expressive deep-network-like function. The TS network hence avoids the vanishing gradient problem by construction, at the cost of larger representation size. We develop several methods to train the TS network, including equivalent kernels for infinitely wide and deep TS networks, a one-pass linear learning algorithm, and two backpropagation-inspired representation learning algorithms. Our experimental results demonstrate that the TS network is indeed more expressive and consistently learns faster than standard ReLU networks.

McClelland, J. L., Sadeghi, Z., & Saxe, A. M. (2016). A Critique of Pure Hierarchy: Uncovering Cross-Cutting Structure in a Natural Dataset.

*Neurocomputational Models of Cognitive Development and Processing*, 51–68.

Saxe, A. M. (2016). Inferring actions, intentions, and causal relations in a neural network.

*Annual Meeting of the Cognitive Science Society*.

pdfSaxe, A. M. (2015). A deep learning theory of perceptual learning dynamics.

*Computational and Systems Neuroscience Conference*.

pdfGoodfellow, I. J., Vinyals, O., & Saxe, A. M. (2015). Qualitatively Characterizing Neural Network Optimization Problems.

*International Conference on Learning Representations*.

pdf | arXivLee, R., & Saxe, A. M. (2015). The Effect of Pooling in a Deep Learning Model of Perceptual Learning.

*Computational and Systems Neuroscience Conference*.

pdfSaxe, A. M. (2014). Multitask Model-free Reinforcement Learning.

*Annual Meeting of the Cognitive Science Society*.

pdfLee, R., Saxe, A. M., & McClelland, J. (2014). Modeling Perceptual Learning with Deep Networks. In

*Annual meeting of the Cognitive Science Society*.

pdfSaxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Y. Bengio & Y. LeCun (Eds.),

*International Conference on Learning Representations*. Oral presentation.

Abstract | pdf | arXivDespite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.

Saxe, A. M., McClelland, J. L., & Ganguli, S. (2013). Learning hierarchical category structure in deep neural networks. In M. Knauff, M. Paulen, N. Sebanz, & I. Wachsmuth (Eds.),

*Annual meeting of the Cognitive Science Society*(pp. 1271–1276). Cognitive Science Society.

Abstract | pdfPsychological experiments have revealed remarkable regularities in the developmental time course of cognition. Infants gen- erally acquire broad categorical distinctions (i.e., plant/animal) before finer ones (i.e., bird/fish), and periods of little change are often punctuated by stage-like transitions. This pattern of progressive differentiation has also been seen in neural network models as they learn from exposure to training data. Our work explains why the networks exhibit these phenomena. We find solutions to the dynamics of error-correcting learning in linear three layer neural networks. These solutions link the statistics of the training set and the dynamics of learning in the network, and characterize formally how learning leads to the emergence of structured representations for arbitrary training environments. We then consider training a neural network on data generated by a hierarchically structured probabilistic gen- erative process. Our results reveal that, for a broad class of such structures, the learning dynamics must exhibit progressive, coarse-to-fine differentiation with stage-like transitions punctuating longer dormant periods.

Saxe, A. M., McClelland, J. L., & Ganguli, S. (2013). Dynamics of learning in deep linear neural networks.

*NIPS Workshop on Deep Learning*.

pdfSaxe, A. M., McClelland, J. L., & Ganguli, S. (2013). A Mathematical Theory of Semantic Development.

*Computational and Systems Neuroscience Conference (COSYNE)*.

pdfSaxe, A. M., Bhand, M., Mudur, R., Suresh, B., & Ng, A. Y. (2011). Modeling Cortical Representational Plasticity With Unsupervised Feature Learning.

*Computational and Systems Neuroscience Conference (COSYNE)*.

pdfBalci, F., Simen, P., Niyogi, R., Saxe, A., Hughes, J. A., Holmes, P., & Cohen, J. D. (2011). Acquisition of decision making criteria: reward rate ultimately beats accuracy.

*Attention, Perception, & Psychophysics*,*73*(2), 640–657. https://doi.org/10.3758/s13414-010-0049-7

pdf | DOISaxe, A., Bhand, M., Mudur, R., Suresh, B., & Ng, A. Y. (2011). Unsupervised learning models of primary cortical receptive fields and receptive field plasticity.

*Advances in Neural Information Processing Systems 25*.

Abstract | pdfThe efficient coding hypothesis holds that neural receptive fields are adapted to the statistics of the environment, but is agnostic to the timescale of this adaptation, which occurs on both evolutionary and developmental timescales. In this work we focus on that component of adaptation which occurs during an organism’s life- time, and show that a number of unsupervised feature learning algorithms can account for features of normal receptive field properties across multiple primary sensory cortices. Furthermore, we show that the same algorithms account for altered receptive field properties in response to experimentally altered environ- mental statistics. Based on these modeling results we propose these models as phenomenological models of receptive field plasticity during an organism’s life- time. Finally, due to the success of the same models in multiple sensory areas, we suggest that these algorithms may provide a constructive realization of the theory, first proposed by Mountcastle [1], that a qualitatively similar learning algorithm acts throughout primary sensory cortices.

Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., & Ng, A. Y. (2011). On Random Weights and Unsupervised Feature Learning.

*Proceedings of the 28th International Conference on Machine Learning*.

Abstract | pdfRecently two anomalous results in the literature have shown that certain feature learning architectures can yield useful features for object recognition tasks even with untrained, random weights. In this paper we pose the question: why do random weights sometimes do so well? Our answer is that certain convolutional pooling architectures can be inherently frequency selective and translation invariant, even with random weights. Based on this we demonstrate the viability of extremely fast architecture search by using random weights to evaluate candidate architectures, thereby sidestepping the time-consuming learning process. We then show that a surprising fraction of the performance of certain state-of-the-art methods can be attributed to the architecture alone.

Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., & Ng, A. Y. (2010). On Random Weights and Unsupervised Feature Learning.

*NIPS Workshop on Deep Learning and Unsupervised Feature Learning*.

pdfBaldassano, C. A., Franken, G. H., Mayer, J. R., Saxe, A. M., & Yu, D. D. (2009). Kratos: Princeton University’s entry in the 2008 Intelligent Ground Vehicle Competition.

*Proceedings of SPIE*. https://doi.org/10.1117/12.810509

pdf | DOIGoodfellow, I. J., Le, Q. V., Saxe, A. M., Lee, H., & Ng, A. Y. (2009). Measuring Invariances in Deep Networks. In Y. Bengio & D. Schuurmans (Eds.),

*Advances in Neural Information Processing Systems 24*.

pdfAtreya, A. R., Cattle, B. C., Collins, B. M., Essenburg, B., Franken, G. H., Saxe, A. M., Schiffres, S. N., & Kornhauser, A. L. (2006). Prospect Eleven: Princeton University’s entry in the 2005 DARPA Grand Challenge.

*Journal of Field Robotics*,*23*(9), 745–753. https://doi.org/10.1002/rob.20141

pdf | DOI