﻿ Geometry and Data Science

# Geometry and Data Science

Lecturer: Ronen Eldan.

Recommended readings and supplementary material:
1. This expository paper by Donoho is a bit outdated, but still it's a very good read and works as an exposition to the course.
2. Most of the course's material is found in the following references:
A. Foundations of Data Science by Blum, Hopcroft and Kannan.
B. These lecture notes by Afonso Bandeira.
C. Notes on statistical learning theory by Shahar Mendelson.
D. Notes on estimation in high dimensions by Roman Vershynin.
E. Yet more notes on Four lectures on probabilistic methods in data science by Roman Vershynin.

Time and location: Tuesdays at 14:00, Room 155

Syllabus:
1. Dimension reduction, the Johnson-Lindenstrauss lemma and Compressed sensing and sparse recovery (Sections 2.7 and 10.3 here).
2. PCA, the Marcenko-Pastur theorem, the spike model and separating Gaussians.
3. Approximation algorithms on Graphs - Spectral clustering and sparsest cut, the Max-Cut problem and the Grothendieck constant.
4. Some foundations of statistical learning theory: Glivenko-Cantelli classes, VC theory and Rademacher complexity.
5. Estimation in high dimensions.
6. Low rank matrix completion.

Homework assignment exercises: Sheet 1 Sheet 2 Sheet 3.

Final handout assignment sheet.

So what did we do so far? (+ what is planned for the next 1-2 weeks)
Lecture 1: Dimension reduction, proof of The Johnson-Lindenstrauss Lemma (Ref A, section 2.7 / Ref B, lecture 5).
Lecture 2: Compressed sensing, the restricted isometry property and exact recovery (Ref A, section 10.3; We're almost done proving Theorem 10.5).
Lecture 3: Compressed sensing, continued. We finished the proof of the exact recovery theorem, discussed the improvement by Candes and Tao (found in this paper) and proved that a matrix of independent Gaussians satisfied the restricted isometry property.
Lectures 4+5: Spiked covariance model and the Marcenko-Pastur theorem (The background and application are in Section 1.2, 1.3 in Ref B, the proof of the theorem is here).
Lecture 6: The plan is to teach the sparsest cut via geometric embedding algorithm, in this paper.
Lecture 7: We'll continue with sparsest cut, and move on to the Max-Cut problem. In particular, we'll discuss the Goemans-Williamson algorithm and the relation to the Grothendieck inequality after N. Alon and A.Naor.
Lecture 8: We'll review some basic concepts in statistical learning theory including Vapnik-Chervonenkis theory, Glivenko-Cantelli complexity classes. The next 2-3 weeks will be based mainly on these two references: 1 and 2.
Lecture 9: We'll prove the Sauer-Shelah lemma regarding VC dimension. We'll continue discussing bounds on the estimation risk via complexity of the function class. In particular, we'll talk about Rademacher complexity and discuss an application to Reproducing Kernel Hilbert Spaces, based on sections 2.3 and 2.4 here.
Lecture 10: We'll discuss mean-width complexity of sets and its relation to estimation based on random linear samples, (Sections 5 and 6 here).
Lecture 11: We continued with an application of the low M^* estimate to sparse recovery (sections 6,7 here) and started discussing low-rank matrix completion, based on Section 3.3 here.
Lecture 12: We'll continue proving Theorem 3.3.2 here, for which we'll need to prove the matrix Bernstein inequlity (same reference, Section 2).