(8 March) Class is canceled for 15/3 (second class of
the semester), replacement class will be given on Friday 22/5, 9-12
(8 March)Homework 0
(warmup) is now available. Due 22/3 in class. Submission in pairs
encouraged (but not in triplets or larger, please).
This homework uses the
paper from 2009 introducing Google Flu Trends (GFT), and the
paper from 2014 describing the failure of GFT since 2011.
code for investigating privacy of summary statistics release in
by Wasserman and Zhou on statistical theory of differential
1 (Slightly updated 7 April, older versions:
now available, due on 19 April in class. Submission in pairs is
2009 paper by Jacobs et al. may be used as reference for problem 1.
(11 April) The next two classes (12/4 and 19/4) will
deal with high dimensional modeling (large p, p >> n). We will
discuss the statistical and computational challenges that are unique
to this setting and some of the most popular solutions. Relevant
reading materials include chapters 2-3 of
review I wrote on sparse modeling, and the papers on
LARS by Efron
et al. and
generalization by Rosset and Zhu.
is now available. Due 10 May in class.
Problem 1 uses
datasets, and there is also sample code in
Problem 2 extra credit uses this paper.
On 26 April Yoav Benjamini will give the class, using slides:
2 (pptx)(2 May)Slides on
quality preserving databases, to accompany tomorrow's lecture.
on estimating covariance matrices, from Boaz Nadler's lecture today.
The last part of the talk was on the board, about sparse covariance
estimation, mainly based on Bickel and Levina (2008), referenced
from the slides.
(ppt) for Sunday's class.
(part 2.(b)ii. updated 22/5, previous version:
Problem 1 uses
datasets, and there is some code help in
Problem 2 refers to the
paper by Zeggini et al.
(22 May) Extra class: Friday, 22 May, 9-12 in Kaplun
version of class notes prepared by Liran Katzir, including a proof
of variance reduction for the last algorithm presented.
(28 May) On 31 May the class will be given by Yoel
Shkolinsky and will from 12:10-13:45 only. It will deal with
electron microscopy data and its analysis.
version of slides from Yoel.
(5 June)HW4 is now
available, due date is 2 July.
Sunday's class will be based on the
survey by Goldberg et
al. (that appeared in 2010 in the Foundations and Trends in Machine
from class for fitting models to the Sampson monks and E-Coli
The E-Coli models did finish running eventually. You can see an
embedding of ecoli1 into two dimensions
can be seen, there is a lot of structure (dense area in the middle,
sparse sub-networks around), and perhaps some clustering structure
(14 June) Code I showed in class today:
on circles in square.
The goal of this course is to present some of the unique
statistical challenges that the new era of Big Data brings, and
discuss their solutions. The course will be a topics course,
meaning we will discuss various aspects that may not necessarily be
related or linearly organized. Our challenge will be to cover a
wide range of topics, while being specific and concrete in
describing the statistical aspects of the problems and the proposed
solutions, and discussing these solutions critically.
We will also keep in mind other practical aspects like computation
and demonstrate the ideas and results on real data when possible.
Accordingly, the homework and the final exam will include a
combination of hands-on programming and modeling with theoretical
Big Data is a general and rather vague term, typically referring to
data and problems that share some of the following characteristics:
It is big (obviously): this could mean having many
observations (large n), many features/variables (large p), or
It has additional structure information: temporal, spatial,
graph structure (like network data), etc.
It leads to non-traditional modeling problems, like network
evolution, collaborative filtering, structured learning, etc.
It presents significant practical challenges in handling the
data and modeling it, including:
The need to maintain privacy and security of the data while
sharing it and extracting information from it
The difficulty in storing and performing calculations at scale
The difficulty in correctly interpreting the data and
generating valid statistical modeling problems from it
The full extent of its potential utility is unclear and subject to research
Some examples of typical Big Data domains gaining
importance in recent years:
Internet usage data, including social network information, search
and advertising information, etc.
Health records and related information
Scientific databases, including areas like particle physics,
electron microscopy and genetics
Images and video surveillance data
A key topic in data modeling in general and Big Data in particular
is predictive modeling (regression, classification). Since the
course Statistical Learning in the previous semester dealt mainly
with exposition and statistical analysis of algorithms in this area,
it will not be a focus of this course. However, some aspects of this
area that were not covered in the previous course, in particular the
p >> n case and efficient computation, will be discussed.
Tentative list of topics to be covered during the semester:
Network modeling: Probabilistic models of network evolution;
Parameter estimation and inference
Privacy: Differential privacy; Algorithms to guarantee privacy
in different settings; Examples of privacy breaches
Statistical validity of scientific research on modern data:
Replicability; Sequential testing on public databases
Streaming and online data analysis
Spectral analysis of large random matrices: statistical and
p >> n: Sparsity and computation
Turning data into modeling: Competitions and proof of concept
projects; Leakage in data mining
We will have 3-5 guest lectures during the semester, but they will
be treated as regular classes rather than enrichment classes
(specifically, their material will be included in the homework and
Basic knowledge of mathematical foundations:
Calculus: Integration; Sums of series; Extrema, etc.
Linear algebra/geometry: Basic properties of matrices:
inverse, trace, determinant; SVD and eigen decompositions: PCA and
its geometrical and statistical interpretations
Solid fundamentals in Probability: Discrete/continuous
probability definitions; Important distributions:
Bernoulli/Binomial, Poisson, Geometric, Hypergeometric, Negative
Binomial, Normal, Exponential/Double Exponential (Laplace), Uniform,
Beta, Gamma, etc.; Limit laws: large numbers and CLT; Inequalities:
Markov, Chebyshev, Hoeffding
Conditional distributions and moments: Basic definitions and
Bayes rules; Definitions and properties of conditional expectation
and variance; Laws: Iterated expectation, total variation; Intuition
vs mathematics in conditional probabilities: Simpson's paradox etc.
Solid fundamentals in Statistics:
The equivalent of a course in
Statistical Theory: Basic definitions: Estimation, confidence
intervals, hypothesis testing, basic properties of statistical tests
and estimators: Level, power, p-values, bias, variance, consistency
etc.; Basic theoretical results: Neyman-Pearson Lemma,
Rao-Blackwell, Cramer-Rao, Wilks; Important families of tests: Z,t, F, χ2, GLRT; Bayesian inference: Basics and uses
The equivalent of a course in Regression / Analysis of
Variance: Algebra and geometry of multivariate regression; Inference
in linear regression; Error decompositions: Bias + Variance;
Generalizations: Logistic regression, auto-regression, model
selection (cp/AIC); Basic ANOVA; Familiarity with mixed/random
Advantage: Some knowledge in modern/nonparametric statistics
and/or statistical learning; Practical experience with R
Books and resources
The course does not have a specific textbook, and most lectures will
be on the board and not using slides. Some of the material will
closely follow chapters from books or published papers, and when
this is the case it will be announced. However, it is critical that
all students have all the material presented in class. If you miss
classes, make sure to get the material from someone!
Relevant books: Elements of Statistical Learning by Hastie, Tibshirani & Friedman. Including data and errata) Modern Applied Statistics with Splus by Venables and Ripley Frontiers in Massive Data Analysis report from the National Research Council The
Algorithmic Foundations of Differential Privacy by Dwork and Roth
There will be four-five homework assignments, which will count for
about 30% of the final grade, and a final take-home exam. Both the
homework and the exam will combine theoretical analysis with
hands-on data analysis.