Math 608D–Spring 2019

Probability in high dimensions

Instructor: Yaniv Plan
Office: 1219 Math Annex
Email: yaniv (at) math (dot) ubc (dot) ca

Lectures: TuTh, 11:00 am to 12:30 pm, in MATX 1102.

Office hours: Mo 1:00 pm to 2:00 pm, MATX 1219 and by request.

Prerequisites: The course will assume basic knowledge of probability, linear algebra, and functional analysis. For example, I will assume you have familiarity with stochastic processes, norms, singular values, and Lipschitz functions.

Overview: See here. Given student interest, we will delve into machine learning theory in the later part of the class (VC dimension and Rademacher complexity). We will apply this theory to deep learning, discuss why it is not explaining the success of deep learning, and pursue alternate routes.

Detailed course outline: See here. (We probably won’t cover all of those topics.)

Textbook: This course is based on a similar course from Roman Vershynin. There is no required textbook. The following references cover some of the material, and they are available online:

R. Vershynin, High-dimensional probability. This book has the most overlap with our course.
T. Tao, Topics in random matrix theory.
D. Chafai, O. Guedon, G. Lecue, A. Pajor, Interactions between compressed sensing, random matrices and high dimensional geometry, preprint.
R. Vershynin, Lectures in geometric funcitonal analysis.
R. Adler, J. Taylor, Random fields and geometry.
S. Foucart, H. Rauhut, A mathematical introduction to compressive sensing.
J. Lee, nicely presented proof of majorizing measures theorem, which is the lower bound for generic chaining.
Theory of deep learning class at UdeM.
Berkeley 2-month program on theory of deep learning, summer 2019.
Earlier version of this course, which contains a series of notes. For the beginning of the course, we will roughly follow the same notes.

Grading: 50% bi-weekly homework, 50% project. The project may be a literature review or a mini-research problem.

Homework 1. Due Tuesday, February 5. Also see Nick Harvey’s notes on a stochastic gradient descent question.

Homework 2. Due Thursday, March 14. Please feel free to come talk to me for more perspective on the open questions.

Project: First, please read over this paper. Let’s concentrate on ReLU activation functions as in Theorem 1.1. For simplicity, let’s first assume \epsilon = \eta = 0. Do you think that the requirement m > k d log(n) is tight? What would be the right bound if the activation function was the identity?

Potential related papers to present:

Compressed sensing using generative models. This is the original paper.
Fast Convex Pruning of Deep Neural Networks. This one could be useful if deciding to add sparsity as an extra structure to the neural net. If the weight matrices are sparsified, does this effect the Gaussian mean width of the range?
Global guarantees for enforcing deep generative priors by empirical risk. Assumes Gaussian weights, which we might do. Shows that optimization can be successful despite non-convexity.