Statistics of Big Data

Semester 1 2023-24

Thursday, 15-18, Dan-David 303

Home page on http://www.tau.ac.il/~saharon/BigData.html

Lecturer: Saharon Rosset, Schreiber 203, saharon@tauex.tau.ac.il
Office hrs: By email coordination

Announcements and handouts

Signup link for take home final (deadline 15 March)

HW Submission Instructions: Submission is through link you will find on Moodle. Submit a PDF file (and any accompanying code files for practical work) with names of submitters (at most two) clearly printed in the PDF. If you have trouble connecting to Moodle, you can send the submission by email to Saharon instead.

(4 January 2024)
Zoom link.
Class notes.
Warmup homework is now available. Due in class on 18/1, submission in pairs is encouraged.
This homework uses the Nature paper from 2009 introducing Google Flu Trends (GFT), and the Science paper from 2014 describing the failure of GFT since 2011.
If you cannot access these links, you probably need to define your TAU (or other university) proxy to gain access through your university subscription.

(11 January 2024)
Zoom link.
Class notes.
code for demonstrating privacy violation in releasing GWAS summaries.

(18 January 2024)
Zoom link.
Class notes.
Homework 1 is now available, due before class on 8/2. Submission in pairs is encouraged. This 2009 paper by Jacobs et al. may be used as reference for problem 1.

(25 January 2024)
Zoom link.
In class we will first discuss HW0 solutions, then follow the Class notes.
The first part is a review of the previous two weeks, then we will discuss the paper A Statistical Framework for Differential Privacy by Wasserman and Zhou.
This paper will not be in the material for the homework or the final, however it is super interesting and gives a new statistical perspective on differential privacy, one that can also help as a “review” of the topic, helping to understand it better.

(1 February)
Zoom link.
The next classes will deal with high dimensional modeling (large p, p >> n). We will discuss the statistical and computational challenges that are unique to this setting and some of the most popular solutions. Relevant reading materials include chapters 2-3 of ESL, this review I wrote on sparse modeling, and the papers on LARS by Efron et al. and its generalization by Rosset and Zhu. We will also discuss compressed sensing, with the most relevant references being Candes et al. (2005) and Meinshausen and Yu (2009)
Class 5 notes.

(7 February)
Homework 2 is now available. Due on 25 February (Sunday).
Problem 1 uses train.csv and test.csv datasets, and there is also sample code in sparse.r.
Problem 2 extra credit uses this paper.

(8 February)
Zoom link.
Class 6 notes.

(15 February)
Class today is Zoom only!!!
Zoom link.
Slides to be used today (thanks to Boaz Nadler who prepared them originally!): Introduction, Parts 1&2, Part 3

(22 February)
Zoom link.
Class 8 notes.
Code for fitting models to the Sampson monks and E-Coli networks

(29 February)
Zoom link.
Class 9 notes.
Code for fitting models to the Sampson monks and E-Coli networks

(6 March)
Full Homework 3 is now available. Due on 14 March before class.
Problem 1 uses covtrain.csv and covtest.csv datasets, and there is also sample code in pca.r.

(7 March)
Zoom link.
Class 10 notes.
QPD presentation.
Code for HW2p1.

(14 March)
Zoom link.
Class 11 notes
The notes include a problem that refers to the Science paper by Zeggini et al.

(21 March)
Zoom link.
MCMC Introduction

Syllabus

The goal of this course is to present some of the unique statistical challenges that the new era of Big Data brings, and discuss their solutions. The course will be a topics course, meaning we will discuss various aspects that may not necessarily be related or linearly organized. Our challenge will be to cover a wide range of topics, while being specific and concrete in describing the statistical aspects of the problems and the proposed solutions, and discussing these solutions critically. We will also keep in mind other practical aspects like computation and demonstrate the ideas and results on real data when possible. Accordingly, the homework and the final project will include a combination of hands-on programming and modeling with theoretical analysis.

Big Data is a general and rather vague term, typically referring to data and problems that share some of the following characteristics:

It is big (obviously): this could mean having many observations (large n), many features/variables (large p), or both
It has additional structure information: temporal, spatial, graph structure (like network data), etc.
It leads to non-traditional modeling problems, like network evolution, collaborative filtering, structured learning, etc.
It presents significant practical challenges in handling the data and modeling it, including:
- The need to maintain privacy and security of the data while sharing it and extracting information from it
- The difficulty in storing and performing calculations at scale
- The difficulty in correctly interpreting the data and generating valid statistical modeling problems from it
The full extent of its potential utility is unclear and subject to research

Some examples of typical Big Data domains gaining importance in recent years:

Internet usage data, including social network information, search and advertising information, etc.
Health records and related information
Scientific databases, including areas like particle physics, electron microscopy and genetics
Images and video surveillance data

A key topic in data modeling in general and Big Data in particular is predictive modeling (regression, classification). Since the course Statistical Learning deals mainly with exposition and statistical analysis of algorithms in this area, it will not be a focus of this course. However, some aspects of this area that are not covered in that course, in particular the p ≫ n case, efficient computation, and deep learning, will be discussed in some detail.

Tentative list of topics to be covered during the semester:

Network modeling: Probabilistic models of network evolution; Parameter estimation and inference
Privacy: Differential privacy; Algorithms to guarantee privacy in different settings; Examples of privacy breaches
Efficient computation in predictive modeling: Regularization path algorithms; Stochastic gradient descent
Statistical validity of scientific research on modern data: Replicability; Sequential testing on public databases
Spectral analysis of large random matrices: statistical and computational issues
p ≫ n: Sparsity and computation
Deep learning: theory and methodology
Turning data into modeling: Competitions and proof of concept projects; Leakage in data mining

We will have 3-5 guest lectures during the semester, but they will be treated as regular classes rather than enrichment classes (specifically, their material will be included in the homework and the final).

Expected background

1.

Basic knowledge of mathematical foundations:

Calculus: Integration; Sums of series; Extrema, etc.
Linear algebra/geometry: Basic properties of matrices: inverse, trace, determinant; SVD and eigen decompositions: PCA and its geometrical and statistical interpretations

2.

Solid fundamentals in Probability: Discrete/continuous probability definitions; Important distributions: Bernoulli/Binomial, Poisson, Geometric, Hypergeometric, Negative Binomial, Normal, Exponential/Double Exponential (Laplace), Uniform, Beta, Gamma, etc.; Limit laws: large numbers and CLT; Inequalities: Markov, Chebyshev, Hoeffding

Conditional distributions and moments: Basic definitions and Bayes rules; Definitions and properties of conditional expectation and variance; Laws: Iterated expectation, total variation; Intuition vs mathematics in conditional probabilities: Simpson’s paradox etc.

3.

Solid fundamentals in Statistics:

The equivalent of a course in Statistical Theory:
Basic definitions: Estimation, confidence intervals, hypothesis testing, basic properties of statistical tests and estimators: Level, power, p-values, bias, variance, consistency etc.; Basic theoretical results: Neyman-Pearson Lemma, Rao-Blackwell, Cramer-Rao, Wilks; Important families of tests: Z,t,F,χ², GLRT; Bayesian inference: Basics and uses
The equivalent of a course in Regression / Analysis of Variance: Algebra and geometry of multivariate regression; Inference in linear regression; Error decompositions: Bias + Variance; Generalizations: Logistic regression, auto-regression, model selection (c_p/AIC); Basic ANOVA; Familiarity with mixed/random effects models
Advantage: Some knowledge in modern/nonparametric statistics and/or statistical learning; Practical experience with R

Books and resources

The course does not have a specific textbook, and most lectures will be on the board and not using slides. Some of the material will closely follow chapters from books or published papers, and when this is the case it will be announced. However, it is critical that all students have all the material presented in class. If you miss classes, make sure to get the material from someone!

Relevant books:
Elements of Statistical Learning by Hastie, Tibshirani & Friedman. Including freely available pdf, data and errata)
Modern Applied Statistics with Splus by Venables and Ripley
Frontiers in Massive Data Analysis report from the National Research Council
Computer Age Statistical Inference by Efron and Hastie

Grading

There will be four-five homework assignments, which will count for about 30% of the final grade, and a final take-home project. Both the homework and the project will combine theoretical analysis with hands-on data analysis.

Computing

The course will require use of statistical modeling software. You can use R or Python, or anything else, but class demonstrations and starter code given in HW will be mostly in R (freely available for PC/Unix/Mac).
R Project website also contains extensive documentation.
A basic “getting you started in R” tutorial. Uses the Boston Housing Data (thanks to Giles Hooker).
Modern Applied Statistics with Splus by Venables and Ripley is an excellent source for statistical computing help for R/Splus.