Statistics of Big Data

Semester 2 2015
Sunday 12-15, Shenkar 204
Home page on ∼ saharon/BigData.html
Lecturer: Saharon Rosset
Schreiber 022
Office hrs: Thursday 16-18 (with email coordination)

Announcements and handouts

(8 March) Class is canceled for 15/3 (second class of the semester), replacement class will be given on Friday 22/5, 9-12 (room TBD).
(8 March) Homework 0 (warmup) is now available. Due 22/3 in class. Submission in pairs is encouraged (but not in triplets or larger, please).
This homework uses the Nature paper from 2009 introducing Google Flu Trends (GFT), and the Science paper from 2014 describing the failure of GFT since 2011.
(22 March) R code for investigating privacy of summary statistics release in GWAS.
(29 March) Paper by Wasserman and Zhou on statistical theory of differential privacy.
(1 April) Homework 1 (Slightly updated 7 April, older versions: 1 2) is now available, due on 19 April in class. Submission in pairs is encouraged. This 2009 paper by Jacobs et al. may be used as reference for problem 1.
(11 April) The next two classes (12/4 and 19/4) will deal with high dimensional modeling (large p, p >> n). We will discuss the statistical and computational challenges that are unique to this setting and some of the most popular solutions. Relevant reading materials include chapters 2-3 of ESL, this recent review I wrote on sparse modeling, and the papers on LARS by Efron et al. and its generalization by Rosset and Zhu.
(23 April) Homework 2 is now available. Due 10 May in class.
Problem 1 uses train.csv and test.csv datasets, and there is also sample code in sparse.r.
Problem 2 extra credit uses this paper.
On 26 April Yoav Benjamini will give the class, using slides: Part 1 (pdf), Part 2 (pptx)
(2 May) Slides on quality preserving databases, to accompany tomorrow's lecture.
(10 May) Slides1, slides2 on estimating covariance matrices, from Boaz Nadler's lecture today. The last part of the talk was on the board, about sparse covariance estimation, mainly based on Bickel and Levina (2008), referenced from the slides.
(15 May) Slides (ppt) for Sunday's class.
(19 May) HW3 (part 2.(b)ii. updated 22/5, previous version: 1). Problem 1 uses covtrain.csv and covtest.csv datasets, and there is some code help in pca.r. Problem 2 refers to the Science paper by Zeggini et al.
(22 May) Extra class: Friday, 22 May, 9-12 in Kaplun 118. Final version of class notes prepared by Liran Katzir, including a proof of variance reduction for the last algorithm presented.
(28 May) On 31 May the class will be given by Yoel Shkolinsky and will from 12:10-13:45 only. It will deal with electron microscopy data and its analysis. Preliminary version of slides from Yoel.
(5 June) HW4 is now available, due date is 2 July.
Sunday's class will be based on the survey by Goldberg et al. (that appeared in 2010 in the Foundations and Trends in Machine Learning series).
(7 June) Code from class for fitting models to the Sampson monks and E-Coli networks.
The E-Coli models did finish running eventually. You can see an embedding of ecoli1 into two dimensions here. As can be seen, there is a lot of structure (dense area in the middle, sparse sub-networks around), and perhaps some clustering structure as well.
(14 June) Code I showed in class today: PCA projections, MCMC on circles in square.


The goal of this course is to present some of the unique statistical challenges that the new era of Big Data brings, and discuss their solutions. The course will be a topics course, meaning we will discuss various aspects that may not necessarily be related or linearly organized. Our challenge will be to cover a wide range of topics, while being specific and concrete in describing the statistical aspects of the problems and the proposed solutions, and discussing these solutions critically. We will also keep in mind other practical aspects like computation and demonstrate the ideas and results on real data when possible. Accordingly, the homework and the final exam will include a combination of hands-on programming and modeling with theoretical analysis.
Big Data is a general and rather vague term, typically referring to data and problems that share some of the following characteristics:
Some examples of typical Big Data domains gaining importance in recent years:
A key topic in data modeling in general and Big Data in particular is predictive modeling (regression, classification). Since the course Statistical Learning in the previous semester dealt mainly with exposition and statistical analysis of algorithms in this area, it will not be a focus of this course. However, some aspects of this area that were not covered in the previous course, in particular the p >> n case and efficient computation, will be discussed.
Tentative list of topics to be covered during the semester:
We will have 3-5 guest lectures during the semester, but they will be treated as regular classes rather than enrichment classes (specifically, their material will be included in the homework and the final).

Expected background

  1. Basic knowledge of mathematical foundations:
  2. Solid fundamentals in Probability: Discrete/continuous probability definitions; Important distributions: Bernoulli/Binomial, Poisson, Geometric, Hypergeometric, Negative Binomial, Normal, Exponential/Double Exponential (Laplace), Uniform, Beta, Gamma, etc.; Limit laws: large numbers and CLT; Inequalities: Markov, Chebyshev, Hoeffding
  3. Solid fundamentals in Statistics:

Books and resources

The course does not have a specific textbook, and most lectures will be on the board and not using slides. Some of the material will closely follow chapters from books or published papers, and when this is the case it will be announced. However, it is critical that all students have all the material presented in class. If you miss classes, make sure to get the material from someone!
Relevant books:
Elements of Statistical Learning by Hastie, Tibshirani & Friedman. Including data and errata)
Modern Applied Statistics with Splus by Venables and Ripley
Frontiers in Massive Data Analysis report from the National Research Council
The Algorithmic Foundations of Differential Privacy by Dwork and Roth


There will be four-five homework assignments, which will count for about 30% of the final grade, and a final take-home exam. Both the homework and the exam will combine theoretical analysis with hands-on data analysis.


The course will require use of statistical modeling software. It is strongly recommended to use R (freely available for PC/Unix/Mac).
R Project website also contains extensive documentation.
A basic "getting you started in R" tutorial. Uses the Boston Housing Data (thanks to Giles Hooker).
Modern Applied Statistics with Splus by Venables and Ripley is an excellent source for statistical computing help for R/Splus.

File translated from TEX by TTHgold, version 4.00.
On 18 Jun 2015, 11:20.