Semester 2 2018

Wednesday 13-16, Kaplun 118

Home page on http://www.tau.ac.il/ ∼ saharon/BigData.html

Wednesday 13-16, Kaplun 118

Home page on http://www.tau.ac.il/ ∼ saharon/BigData.html

Lecturer: | Saharon Rosset |

Schreiber 022 | |

saharon@post.tau.ac.il | |

Office hrs: | Thursday 16-18 (with email coordination) |

Signup link for take home final (deadline 1 June)

This homework uses the Nature paper from 2009 introducing Google Flu Trends (GFT), and the Science paper from 2014 describing the failure of GFT since 2011.

The differential privacy topic started this week and continuing next is based on The Algorithmic Foundations of Differential Privacy by Dwork and Roth

The next two classes (26/3 and 11/4) will deal with high dimensional modeling (large p, p >> n). We will discuss the statistical and computational challenges that are unique to this setting and some of the most popular solutions. Relevant reading materials include chapters 2-3 of ESL, this review I wrote on sparse modeling, and the papers on LARS by Efron et al. and its generalization by Rosset and Zhu.

Problem 1 uses train.csv and test.csv datasets, and there is also sample code in sparse.r.

Problem 2 extra credit uses this paper.

Recommended reading:

Visualizing and Understanding Convolutional Networks

Efficient Estimation of Word Representations in Vector Space

Generative Adversarial Networks (Important topic we did not get to discuss)

It uses the code HW3-1.r (which requires installing the Keras R package, and also Python if you don't have it) and HW3-2.r.

Today's class uses the survey by Goldberg et al. (that appeared in 2010 in the Foundations and Trends in Machine Learning series).

Code from class for fitting models to the Sampson monks and E-Coli networks.

Today's class focuses on multiple testing and selective inference in big data settings.

Slides on quality preserving databases.

Some slides from Yoav Benjamini that cover many aspects of the discussion: Part 1 (pdf), Part 2 (pptx)

Why most published research is wrong by Ioannidis

Signup link for take home final (deadline 1 June)

- It is big (obviously): this could mean having many observations (large n), many features/variables (large p), or both
- It has additional structure information: temporal, spatial, graph structure (like network data), etc.
- It leads to non-traditional modeling problems, like network evolution, collaborative filtering, structured learning, etc.
- It presents significant practical challenges in handling the
data and modeling it, including:
- The need to maintain privacy and security of the data while sharing it and extracting information from it
- The difficulty in storing and performing calculations at scale
- The difficulty in correctly interpreting the data and generating valid statistical modeling problems from it

- The full extent of its potential utility is unclear and subject to research

- Internet usage data, including social network information, search and advertising information, etc.
- Health records and related information
- Scientific databases, including areas like particle physics, electron microscopy and genetics
- Images and video surveillance data

- Network modeling: Probabilistic models of network evolution; Parameter estimation and inference
- Privacy: Differential privacy; Algorithms to guarantee privacy in different settings; Examples of privacy breaches
- Efficient computation in predictive modeling: Regularization path algorithms; Stochastic gradient descent
- Statistical validity of scientific research on modern data: Replicability; Sequential testing on public databases
- Spectral analysis of large random matrices: statistical and computational issues
- p >> n: Sparsity and computation
- Deep learning: theory and methodology
- Turning data into modeling: Competitions and proof of concept projects; Leakage in data mining

- Basic knowledge of mathematical foundations:
- Calculus: Integration; Sums of series; Extrema, etc.
- Linear algebra/geometry: Basic properties of matrices: inverse, trace, determinant; SVD and eigen decompositions: PCA and its geometrical and statistical interpretations

- Solid fundamentals in Probability: Discrete/continuous
probability definitions; Important distributions:
Bernoulli/Binomial, Poisson, Geometric, Hypergeometric, Negative
Binomial, Normal, Exponential/Double Exponential (Laplace), Uniform,
Beta, Gamma, etc.; Limit laws: large numbers and CLT; Inequalities:
Markov, Chebyshev, Hoeffding
- Conditional distributions and moments: Basic definitions and Bayes rules; Definitions and properties of conditional expectation and variance; Laws: Iterated expectation, total variation; Intuition vs mathematics in conditional probabilities: Simpson's paradox etc.

- Solid fundamentals in Statistics:
- The equivalent of a course in
Statistical Theory:

Basic definitions: Estimation, confidence intervals, hypothesis testing, basic properties of statistical tests and estimators: Level, power, p-values, bias, variance, consistency etc.; Basic theoretical results: Neyman-Pearson Lemma, Rao-Blackwell, Cramer-Rao, Wilks; Important families of tests: Z,t, F, χ^{2}, GLRT; Bayesian inference: Basics and uses - The equivalent of a course in Regression / Analysis of
Variance: Algebra and geometry of multivariate regression; Inference
in linear regression; Error decompositions: Bias + Variance;
Generalizations: Logistic regression, auto-regression, model
selection (c
_{p}/AIC); Basic ANOVA; Familiarity with mixed/random effects models - Advantage: Some knowledge in modern/nonparametric statistics and/or statistical learning; Practical experience with R

- The equivalent of a course in
Statistical Theory:

Elements of Statistical Learning by Hastie, Tibshirani & Friedman. Including freely available pdf, data and errata)

Frontiers in Massive Data Analysis report from the National Research Council

Computer Age Statistical Inference by Efron and Hastie

R Project website also contains extensive documentation.

A basic "getting you started in R" tutorial. Uses the Boston Housing Data (thanks to Giles Hooker).

File translated from T

On 16 May 2018, 13:06.