Selected Topics in Artificial Intelligence 2

Deep learning architectures have many hyperparameters. Considering only the paradigmatic multi-layer perceptron (MLP) architecture, we can change the number of layers, their width, activation functions, and initial parameter distributions. In the last two decades, these hyperparameters have been experimentally optimized towards their optimal values. In this course, we will build a rigorous effective theory that quantitatively describes various choices of hyperparameters, namely the height-to-width ratio, initialization distribution, and activation functions. Our theory will use Gaussian perturbation theory to describe the correlations of activations through various layers of the MLP network. Our rigorous results will provide a firm theoretical ground for many well-established deep-learning practices. Considered techniques can be extended to various other architectures, e.g., transformers. Importantly, they can also be used to derive the scaling laws for large models, which is increasingly important due to the huge cost of their training.