Deep Learning, Neuroscience, and Psychology

Sir Henry Dale Fellow & Associate Professor

Theory of Learning Lab

Gatsby Computational Neuroscience Unit & Sainsbury Wellcome Centre

University College London

CIFAR Azrieli Global Scholar, CIFAR Program on Learning in Machines & Brains

Visiting Scientist, Facebook AI Research

PhD in Electrical Engineering, Stanford University

Thesis: Deep linear neural networks: A theory of learning in the brain and mind

Advisers: Jay McClelland (primary), Andrew Ng, Christoph Schreiner, and Surya Ganguli

BSE in Electrical Engineering, Princeton University (summa cum laude) Curriculum Vitae

The theory of deep learning and its applications to phenomena in neuroscience and psychology.

Masis, J. A., Chapman, T., Rhee, J. Y., Cox, D. D., & Saxe, A. M. (2020). Rats strategically manage learning during perceptual decision making.

*BioRxiv*, 1–48.

Abstract | pdfBalancing the speed and accuracy of decisions is crucial for survival, but how organisms manage this trade-off during learning is largely unknown. Here, we track this trade-off during perceptual learning in rats and simulated agents. At the start of learning, rats chose long reaction times that did not optimize instantaneous reward rate, but by the end of learning chose near-optimal reaction times. To understand this behavior, we analyzed learning dynamics in a recurrent neural network model of the task. The model reveals a fundamental trade-off between instantaneous reward rate and perceptual learning speed, putting the goals of learning quickly and accruing immediate reward in tension. We find that the rats’ strategy of long initial responses can dramatically expedite learning, yielding higher total reward over task engagement. Our results demonstrate that prioritizing learning can be advantageous from a total reward perspective, and suggest that rats engage in cognitive control of learning.

Saxe, A., Nelli, S., & Summerfield, C. (2020). If deep learning is the answer, then what is the question?

*ArXiv*. http://arxiv.org/abs/2004.07580

Abstract | pdf | arXivNeuroscience research is undergoing a minor revolution. Recent advances in machine learning and artificial intelligence (AI) research have opened up new ways of thinking about neural computation. Many researchers are excited by the possibility that deep neural networks may offer theories of perception, cognition and action for biological brains. This perspective has the potential to radically reshape our approach to understanding neural systems, because the computations performed by deep networks are learned from experience, not endowed by the researcher. If so, how can neuroscientists use deep networks to model and understand biological brains? What is the outlook for neuroscientists who seek to characterise computations or neural codes, or who wish to understand perception, attention, memory, and executive functions? In this Perspective, our goal is to offer a roadmap for systems neuroscience research in the age of deep learning. We discuss the conceptual and methodological challenges of comparing behaviour, learning dynamics, and neural representation in artificial and biological systems. We highlight new research questions that have emerged for neuroscience as a direct consequence of recent advances in machine learning.

Richards, B. A., Lillicrap, T. P., Beaudoin, P., Bengio, Y., Bogacz, R., Christensen, A., Clopath, C., Costa, R. P., de Berker, A., Ganguli, S., Gillon, C. J., Hafner, D., Kepecs, A., Kriegeskorte, N., Latham, P., Lindsay, G. W., Miller, K. D., Naud, R., Pack, C. C., … Kording, K. P. (2019). A deep learning framework for neuroscience.

*Nature Neuroscience*,*22*(11), 1761–1770. https://doi.org/10.1038/s41593-019-0520-2

Abstract | DOISystems neuroscience seeks explanations for how the brain implements a wide variety of perceptual, cognitive and motor tasks. Conversely, artificial intelligence attempts to design computational systems based on the tasks they will have to solve. In artificial neural networks, the three components specified by design are the objective functions, the learning rules and the architectures. With the growing success of deep learning, which utilizes brain-inspired architectures, these three designed components have increasingly become central to how we model, engineer and optimize complex artificial learning systems. Here we argue that a greater focus on these components would also benefit systems neuroscience. We give examples of how this optimization-based framework can drive theoretical and experimental progress in neuroscience. We contend that this principled perspective on systems neuroscience will help to generate more rapid progress.

Goldt, S., Advani, M. S., Saxe, A. M., Krzakala, F., & Zdeborová, L. (2019, June). Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup.

*NeurIPS*.

Abstract | pdf | arXivDeep neural networks achieve stellar generalisation even when they have enough parameters to easily fit all their training data. We study this phenomenon by analysing the dynamics and the performance of over-parameterised two-layer neural networks in the teacher-student setup, where one network, the student, is trained on data generated by another network, called the teacher. We show how the dynamics of stochastic gradient descent (SGD) is captured by a set of differential equations and prove that this description is asymptotically exact in the limit of large inputs. Using this framework, we calculate the final generalisation error of student networks that have more parameters than their teachers. We find that the final generalisation error of the student increases with network size when training only the first layer, but stays constant or even decreases with size when training both layers. We show that these different behaviours have their root in the different solutions SGD finds for different activation functions. Our results indicate that achieving good generalisation in neural networks goes beyond the properties of SGD alone and depends on the interplay of at least the algorithm, the model architecture, and the data set.

Saxe, A. M., McClelland, J. L., & Ganguli, S. (2019). A mathematical theory of semantic development in deep neural networks.

*Proceedings of the National Academy of Sciences*,*116*(23), 11537–11546. https://doi.org/10.1073/pnas.1820226116

Abstract | arXiv | DOIAn extensive body of empirical research has revealed remarkable regularities in the acquisition, organization, deployment, and neural representation of human semantic knowledge, thereby raising a fundamental conceptual question: What are the theoretical principles governing the ability of neural networks to acquire, organize, and deploy abstract knowledge by integrating across many individual experiences? We address this question by mathematically analyzing the nonlinear dynamics of learning in deep linear networks. We find exact solutions to this learning dynamics that yield a conceptual explanation for the prevalence of many disparate phenomena in semantic cognition, including the hierarchical differentiation of concepts through rapid developmental transitions, the ubiquity of semantic illusions between such transitions, the emergence of item typicality and category coherence as factors controlling the speed of semantic processing, changing patterns of inductive projection over development, and the conservation of semantic similarity in neural representations across species. Thus, surprisingly, our simple neural model qualitatively recapitulates many diverse regularities underlying semantic development, while providing analytic insight into how the statistical structure of an environment can interact with nonlinear deep-learning dynamics to give rise to these regularities.

Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., & Cox, D. D. (2019). On the information bottleneck theory of deep learning.

*Journal of Statistical Mechanics: Theory and Experiment*,*2019*(12), 124020. https://doi.org/10.1088/1742-5468/ab3985

Abstract | DOIThe practical successes of deep neural networks have not been matched by theoretical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks; and third, that the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent. Here we show that none of these claims hold true in the general case. Through a combination of analytical results and simulation, we demonstrate that the information plane trajectory is predominantly a function of the neural nonlinearity employed: double-sided saturating nonlinearities like tanh yield a compression phase as neural activations enter the saturation regime, but linear activation functions and single-sided saturating nonlinearities like the widely used ReLU in fact do not. Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent. Finally, we show that when an input domain consists of a subset of task-relevant and task-irrelevant information, hidden representations do compress the task-irrelevant information, although the overall information about the input may monotonically increase with training time, and that this compression happens concurrently with the fitting process rather than during a subsequent compression period.

Goldt, S., Advani, M. S., Saxe, A. M., Krzakala, F., & Zdeborová, L. (2019). Generalisation dynamics of online learning in over-parameterised neural networks.

*ICML Workshop on Theoretical Physics for Deep Learning Theory*. http://arxiv.org/abs/1901.09085

Abstract | pdf | arXivDeep neural networks achieve stellar generalisation on a variety of problems, despite often being large enough to easily fit all their training data. Here we study the generalisation dynamics of two-layer neural networks in a teacher-student setup, where one network, the student, is trained using stochastic gradient descent (SGD) on data generated by another network, called the teacher. We show how for this problem, the dynamics of SGD are captured by a set of differential equations. In particular, we demonstrate analytically that the generalisation error of the student increases linearly with the network size, with other relevant parameters held constant. Our results indicate that achieving good generalisation in neural networks depends on the interplay of at least the algorithm, its learning rate, the model architecture, and the data set.

Zhang, Y., Saxe, A. M., Advani, M. S., & Lee, A. A. (2018). Energy-entropy competition and the effectiveness of stochastic gradient descent in machine learning.

*Molecular Physics*, 1–10. https://doi.org/10.1080/00268976.2018.1483535

Abstract | pdf | arXiv | DOIFinding parameters that minimise a loss function is at the core of many machine learning methods. The Stochastic Gradient Descent algorithm is widely used and delivers state of the art results for many problems. Nonetheless, Stochastic Gradient Descent typically cannot find the global minimum, thus its empirical effectiveness is hitherto mysterious. We derive a correspondence between parameter inference and free energy minimisation in statistical physics. The degree of undersampling plays the role of temperature. Analogous to the energy-entropy competition in statistical physics, wide but shallow minima can be optimal if the system is undersampled, as is typical in many applications. Moreover, we show that the stochasticity in the algorithm has a non-trivial correlation structure which systematically biases it towards wide minima. We illustrate our argument with two prototypical models: image classification using deep learning, and a linear neural network where we can analytically reveal the relationship between entropy and out-of-sample error.

Nye, M., & Saxe, A. (2018). Are Efficient Deep Representations Learnable? In Y. Bengio & Y. LeCun (Eds.),

*Workshop Track at the International Conference on Learning Representations*. https://doi.org/10.1051/0004-6361/201527329

Abstract | pdf | arXiv | DOIMany theories of deep learning have shown that a deep network can require dra- matically fewer resources to represent a given function compared to a shallow network. But a question remains: can these efficient representations be learned using current deep learning techniques? In this work, we test whether standard deep learning methods can in fact find the efficient representations posited by sev- eral theories of deep representation. Specifically, we train deep neural networks to learn two simple functions with known efficient solutions: the parity function and the fast Fourier transform. We find that using gradient-based optimization, a deep network does not learn the parity function, unless initialized very close to a hand-coded exact solution. We also find that a deep linear neural network does not learn the fast Fourier transform, even in the best-case scenario of infinite training data, unless the weights are initialized very close to the exact hand-coded solution. Our results suggest that not every element of the class of compositional functions can be learned efficiently by a deep network, and further restrictions are necessary to understand what functions are both efficiently representable and learnable.

Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B. D., & Cox, D. D. (2018). On the Information Bottleneck Theory of Deep Learning. In Y. Bengio & Y. LeCun (Eds.),

*International Conference on Learning Representations*.

pdfBansal, Y., Advani, M., Cox, D. D., & Saxe, A. M. (2018). Minnorm training: an algorithm for training over-parameterized deep neural networks.

*ArXiv*.

Abstract | pdf | arXivIn this work, we propose a new training method for finding minimum weight norm solutions in over-parameterized neural networks (NNs). This method seeks to improve training speed and generalization performance by framing NN training as a constrained optimization problem wherein the sum of the norm of the weights in each layer of the network is minimized, under the constraint of exactly fitting training data. It draws inspiration from support vector machines (SVMs), which are able to generalize well, despite often having an infinite number of free parameters in their primal form, and from recent theoretical generalization bounds on NNs which suggest that lower norm solutions generalize better. To solve this constrained optimization problem, our method employs Lagrange multipliers that act as integrators of error over training and identify ‘support vector’-like examples. The method can be implemented as a wrapper around gradient based methods and uses standard back-propagation of gradients from the NN for both regression and classification versions of the algorithm. We provide theoretical justifications for the effectiveness of this algorithm in comparison to early stopping and }L_2}-regularization using simple, analytically tractable settings. In particular, we show faster convergence to the max-margin hyperplane in a shallow network (compared to vanilla gradient descent); faster convergence to the minimum-norm solution in a linear chain (compared to }L_2}-regularization); and initialization-independent generalization performance in a deep linear network. Finally, using the MNIST dataset, we demonstrate that this algorithm can boost test accuracy and identify difficult examples in real-world datasets.

Saxe*, A. M., & Advani*, M. (2018). A theory of memory replay and generalization performance in neural networks.

*Computational and Systems Neuroscience Conference*.

pdfMasís, J., Saxe, A. M., & Cox, D. D. (2018). Rats optimize reward rate and learning speed in a 2-AFC task.

*Computational and Systems Neuroscience Conference*.

pdfEarle, A. C., Saxe, A. M., & Rosman, B. (2018). Hierarchical Subtask Discovery with Non-Negative Matrix Factorization. In Y. Bengio & Y. LeCun (Eds.),

*International Conference on Learning Representations*.

pdfAdvani*, M., & Saxe*, A. M. (2017). High-dimensional dynamics of generalization error in neural networks.

*ArXiv*.

pdf | arXivMusslick, S., Saxe, A. M., Ozcimder, K., Dey, B., Henselman, G., & Cohen, J. D. (2017). Multitasking Capability Versus Learning Efficiency in Neural Network Architectures.

*Annual Meeting of the Cognitive Science Society*, 829–834.

pdfSaxe, A. M., Earle, A. C., & Rosman, B. (2017). Hierarchy Through Composition with Multitask LMDPs.

*International Conference on Machine Learning*.

pdfEarle, A. C., Saxe, A. M., & Rosman, B. (2017). Hierarchical Subtask Discovery With Non-Negative Matrix Factorization.

*Workshop on Lifelong Learning: A Reinforcement Learning Approach at ICML*.

pdf | arXivBaldassano*, C., & Saxe*, A. M. (2016). A theory of learning dynamics in perceptual decision-making.

*Computational and Systems Neuroscience Conference*.

pdfSaxe, A. M., & Norman, K. (2016). Optimal storage capacity associative memories exhibit retrieval-induced forgetting.

*Computational and Systems Neuroscience Conference*.

pdfTsai*, C. Y., Saxe*, A., & Cox, D. (2016). Tensor Switching Networks.

*Advances in Neural Information Processing Systems 29*.

Abstract | pdf | arXivWe present a novel neural network algorithm, the Tensor Switching (TS) network, which generalizes the Rectified Linear Unit (ReLU) nonlinearity to tensor-valued hidden units. The TS network copies its entire input vector to different locations in an expanded representation, with the location determined by its hidden unit activity. In this way, even a simple linear readout from the TS representation can implement a highly expressive deep-network-like function. The TS network hence avoids the vanishing gradient problem by construction, at the cost of larger representation size. We develop several methods to train the TS network, including equivalent kernels for infinitely wide and deep TS networks, a one-pass linear learning algorithm, and two backpropagation-inspired representation learning algorithms. Our experimental results demonstrate that the TS network is indeed more expressive and consistently learns faster than standard ReLU networks.

McClelland, J. L., Sadeghi, Z., & Saxe, A. M. (2016). A Critique of Pure Hierarchy: Uncovering Cross-Cutting Structure in a Natural Dataset.

*Neurocomputational Models of Cognitive Development and Processing*, 51–68.

Saxe, A. M. (2016). Inferring actions, intentions, and causal relations in a neural network.

*Annual Meeting of the Cognitive Science Society*.

pdfSaxe, A. M. (2015). A deep learning theory of perceptual learning dynamics.

*Computational and Systems Neuroscience Conference*.

pdfGoodfellow, I. J., Vinyals, O., & Saxe, A. M. (2015). Qualitatively Characterizing Neural Network Optimization Problems.

*International Conference on Learning Representations*.

pdf | arXivLee, R., & Saxe, A. M. (2015). The Effect of Pooling in a Deep Learning Model of Perceptual Learning.

*Computational and Systems Neuroscience Conference*.

pdfSaxe, A. M. (2014). Multitask Model-free Reinforcement Learning.

*Annual Meeting of the Cognitive Science Society*.

pdfLee, R., Saxe, A. M., & McClelland, J. (2014). Modeling Perceptual Learning with Deep Networks. In

*Annual meeting of the Cognitive Science Society*.

pdfSaxe, A. M., McClelland, J. L., & Ganguli, S. (2014). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In Y. Bengio & Y. LeCun (Eds.),

*International Conference on Learning Representations*. Oral presentation.

Abstract | pdf | arXivDespite the widespread practical success of deep learning methods, our theoretical understanding of the dynamics of learning in deep neural networks remains quite sparse. We attempt to bridge the gap between the theory and practice of deep learning by systematically analyzing learning dynamics for the restricted case of deep linear neural networks. Despite the linearity of their input-output map, such networks have nonlinear gradient descent dynamics on weights that change with the addition of each new hidden layer. We show that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions. We provide an analytical description of these phenomena by finding new exact solutions to the nonlinear dynamics of deep learning. Our theoretical analysis also reveals the surprising finding that as the depth of a network approaches infinity, learning speed can nevertheless remain finite: for a special class of initial conditions on the weights, very deep networks incur only a finite, depth independent, delay in learning speed relative to shallow networks. We show that, under certain conditions on the training data, unsupervised pretraining can find this special class of initial conditions, while scaled random Gaussian initializations cannot. We further exhibit a new class of random orthogonal initial conditions on weights that, like unsupervised pre-training, enjoys depth independent learning times. We further show that these initial conditions also lead to faithful propagation of gradients even in deep nonlinear networks, as long as they operate in a special regime known as the edge of chaos.

Saxe, A. M., McClelland, J. L., & Ganguli, S. (2013). Learning hierarchical category structure in deep neural networks. In M. Knauff, M. Paulen, N. Sebanz, & I. Wachsmuth (Eds.),

*Annual meeting of the Cognitive Science Society*(pp. 1271–1276). Cognitive Science Society.

Abstract | pdfPsychological experiments have revealed remarkable regularities in the developmental time course of cognition. Infants gen- erally acquire broad categorical distinctions (i.e., plant/animal) before finer ones (i.e., bird/fish), and periods of little change are often punctuated by stage-like transitions. This pattern of progressive differentiation has also been seen in neural network models as they learn from exposure to training data. Our work explains why the networks exhibit these phenomena. We find solutions to the dynamics of error-correcting learning in linear three layer neural networks. These solutions link the statistics of the training set and the dynamics of learning in the network, and characterize formally how learning leads to the emergence of structured representations for arbitrary training environments. We then consider training a neural network on data generated by a hierarchically structured probabilistic gen- erative process. Our results reveal that, for a broad class of such structures, the learning dynamics must exhibit progressive, coarse-to-fine differentiation with stage-like transitions punctuating longer dormant periods.

Saxe, A. M., McClelland, J. L., & Ganguli, S. (2013). Dynamics of learning in deep linear neural networks.

*NIPS Workshop on Deep Learning*.

pdfSaxe, A. M., McClelland, J. L., & Ganguli, S. (2013). A Mathematical Theory of Semantic Development.

*Computational and Systems Neuroscience Conference (COSYNE)*.

pdfSaxe, A. M., Bhand, M., Mudur, R., Suresh, B., & Ng, A. Y. (2011). Modeling Cortical Representational Plasticity With Unsupervised Feature Learning.

*Computational and Systems Neuroscience Conference (COSYNE)*.

pdfBalci, F., Simen, P., Niyogi, R., Saxe, A., Hughes, J. A., Holmes, P., & Cohen, J. D. (2011). Acquisition of decision making criteria: reward rate ultimately beats accuracy.

*Attention, Perception, & Psychophysics*,*73*(2), 640–657. https://doi.org/10.3758/s13414-010-0049-7

pdf | DOISaxe, A., Bhand, M., Mudur, R., Suresh, B., & Ng, A. Y. (2011). Unsupervised learning models of primary cortical receptive fields and receptive field plasticity.

*Advances in Neural Information Processing Systems 25*.

Abstract | pdfThe efficient coding hypothesis holds that neural receptive fields are adapted to the statistics of the environment, but is agnostic to the timescale of this adaptation, which occurs on both evolutionary and developmental timescales. In this work we focus on that component of adaptation which occurs during an organism’s life- time, and show that a number of unsupervised feature learning algorithms can account for features of normal receptive field properties across multiple primary sensory cortices. Furthermore, we show that the same algorithms account for altered receptive field properties in response to experimentally altered environ- mental statistics. Based on these modeling results we propose these models as phenomenological models of receptive field plasticity during an organism’s life- time. Finally, due to the success of the same models in multiple sensory areas, we suggest that these algorithms may provide a constructive realization of the theory, first proposed by Mountcastle [1], that a qualitatively similar learning algorithm acts throughout primary sensory cortices.

Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., & Ng, A. Y. (2011). On Random Weights and Unsupervised Feature Learning.

*Proceedings of the 28th International Conference on Machine Learning*.

Abstract | pdfRecently two anomalous results in the literature have shown that certain feature learning architectures can yield useful features for object recognition tasks even with untrained, random weights. In this paper we pose the question: why do random weights sometimes do so well? Our answer is that certain convolutional pooling architectures can be inherently frequency selective and translation invariant, even with random weights. Based on this we demonstrate the viability of extremely fast architecture search by using random weights to evaluate candidate architectures, thereby sidestepping the time-consuming learning process. We then show that a surprising fraction of the performance of certain state-of-the-art methods can be attributed to the architecture alone.

Saxe, A. M., Koh, P. W., Chen, Z., Bhand, M., Suresh, B., & Ng, A. Y. (2010). On Random Weights and Unsupervised Feature Learning.

*NIPS Workshop on Deep Learning and Unsupervised Feature Learning*.

pdfBaldassano, C. A., Franken, G. H., Mayer, J. R., Saxe, A. M., & Yu, D. D. (2009). Kratos: Princeton University’s entry in the 2008 Intelligent Ground Vehicle Competition.

*Proceedings of SPIE*. https://doi.org/10.1117/12.810509

pdf | DOIGoodfellow, I. J., Le, Q. V., Saxe, A. M., Lee, H., & Ng, A. Y. (2009). Measuring Invariances in Deep Networks. In Y. Bengio & D. Schuurmans (Eds.),

*Advances in Neural Information Processing Systems 24*.

pdfAtreya, A. R., Cattle, B. C., Collins, B. M., Essenburg, B., Franken, G. H., Saxe, A. M., Schiffres, S. N., & Kornhauser, A. L. (2006). Prospect Eleven: Princeton University’s entry in the 2005 DARPA Grand Challenge.

*Journal of Field Robotics*,*23*(9), 745–753. https://doi.org/10.1002/rob.20141

pdf | DOI