Geometry and Data Science

Wednesday, 14:15, Ziskind 155. Lecturer: Ronen Eldan. TA: Hester Pieters.

The final assignement (home-exam) sheet is here

Recommended readings and supplementary material:
1. This expository paper by Donoho is a bit outdated, but still it's a very good read and works as an exposition to the course.
2. Most of the course's material is found in the following references:
A. High dimensional probability with applications to data science by Roman Vershynin.
B. Foundations of Data Science by Blum, Hopcroft and Kannan.
C. These lecture notes by Afonso Bandeira.
D. Notes on statistical learning theory by Shahar Mendelson.
E. Notes on estimation in high dimensions by Roman Vershynin.
F. Yet more notes on Four lectures on probabilistic methods in data science by Roman Vershynin.

Time and location: Tuesdays at 14:00, Room 155

Tentative Syllabus:
1. Dimension reduction, the Johnson-Lindenstrauss lemma and Compressed sensing and sparse recovery (Sections 2.7 and 10.3 here).
2. PCA, the Marcenko-Pastur theorem, the spike model and separating Gaussians.
3. Approximation algorithms on Graphs - Spectral clustering and sparsest cut, the Max-Cut problem and the Grothendieck constant.
4. Some foundations of statistical learning theory: Glivenko-Cantelli classes, VC theory and Rademacher complexity.
5. Estimation in high dimensions.
6. Low rank matrix completion.

Following is a more detailed outline (with refs to relevant lecture notes) from a previous instance of the class given in 2017. This will be updated as we go. Some of the subjects may change.
Lecture 1: Dimension reduction, proof of The Johnson-Lindenstrauss Lemma (Ref A, section 5.3 / Ref B, section 2.7 / Ref C, lecture 5).
Lecture 2: Compressed sensing, the restricted isometry property and exact recovery (Ref A, section 10.3; We're almost done proving Theorem 10.5).
Lecture 3: Compressed sensing, continued. We finished the proof of the exact recovery theorem, discussed the improvement by Candes and Tao (found in this paper) and proved that a matrix of independent Gaussians satisfied the restricted isometry property.
Lectures 4+5: Spiked covariance model and the Marcenko-Pastur theorem (The background and application are in Section 1.2, 1.3 in Ref C, the proof of the theorem is here).
Lecture 6: We taled about low-rank matrix completion, section 3.3 here.
Lecture 7: The plan is to teach the sparsest cut via geometric embedding algorithm, in this paper.
Lecture 8: We reviewed some basic concepts in statistical learning theory including Vapnik-Chervonenkis theory, Glivenko-Cantelli complexity classes. The next 2-3 weeks will be based mainly on these two references: 1 and 2.
Lecture 9: We proved the Sauer-Shelah lemma regarding VC dimension. We'll continue discussing bounds on the estimation risk via complexity of the function class. In particular, we'll talk about Rademacher complexity and discuss an application to Reproducing Kernel Hilbert Spaces, based on sections 2.3 and 2.4 here.
Lecture 10: We'll discuss Gaussian-width complexity of sets and its relation to estimation based on random linear samples, (Sections 5 and 6 here).
Lecture 11: We discussed Gaussian processes, sub-Gaussian processes and the Dudley bound. We saw how that can be applied to Monte-Carlo integration, as in Chapters 7 and 8 here.
Lecture 12: We use Gaussian processes towards the MAX-CUT problem. In particular, we'll discuss the Goemans-Williamson algorithm and the relation to the Grothendieck inequality after N. Alon and A.Naor.