Semester 1 2021-22

Monday, 12-15

Schreiber 007

Home page on http://www.tau.ac.il/ ∼ saharon/BigData.html

Monday, 12-15

Schreiber 007

Home page on http://www.tau.ac.il/ ∼ saharon/BigData.html

Lecturer: | Saharon Rosset |

Schreiber 022 | |

saharon@tauex.tau.ac.il | |

Office hrs: | By email coordination |

Class 1 recording. Note it only includes the first hour and last half hour since unfortunately I forgot to turn on the recording for the middle portion (the jump is around minute 52 of the video). However the missing portion closely follows the class notes.

Class notes are available.

Homework 0 (warmup) is now available. Due 25/10 before class, submission is by email to me. Submission in pairs is encouraged (but not in triplets or larger, please).

This homework uses the Nature paper from 2009 introducing Google Flu Trends (GFT), and the Science paper from 2014 describing the failure of GFT since 2011.

Class 2 recording

Class notes are available.

code for demonstrating privacy violation in releasing GWAS summaries.

Class 3 recording

Class notes.

Main source for today's material: The Algorithmic Foundations of Differential Privacy by Dwork and Roth

Homework 1 is now available, due before class on 8/11. Submission in pairs is encouraged. This 2009 paper by Jacobs et al. may be used as reference for problem 1.

Class 4 recording

Class notes.

Sources for today's material: The Algorithmic Foundations of Differential Privacy by Dwork and Roth and A Statistical Framework for Differential Privacy by Wasserman and Zhou

The next 2-3 classes will deal with high dimensional modeling (large p, p >> n). We will discuss the statistical and computational challenges that are unique to this setting and some of the most popular solutions. Relevant reading materials include chapters 2-3 of ESL, this review I wrote on sparse modeling, and the papers on LARS by Efron et al. and its generalization by Rosset and Zhu. We will also discuss compressed sensing, with the most relevant references being Candes et al. (2005) and Meinshausen and Yu (2009) Class 5 recording

Class 5 notes.

Sources for today's material: The Algorithmic Foundations of Differential Privacy by Dwork and Roth and A Statistical Framework for Differential Privacy by Wasserman and Zhou

Homework 2 is now available. Due 29 November before in class.

Problem 1 uses train.csv and test.csv datasets, and there is also sample code in sparse.r.

Problem 2 extra credit uses this paper.

Class 6 recording

Class 6 notes.

Class 7 recording

Class 7 notes.

After completing the LARS-Lasso discussion today, we will start the topic of

Class 8 recording

Class 8 notes.

Code for fitting models to the Sampson monks and E-Coli networks

Class 9 recording

Class 9 notes

Class 10 recording

Presentation slides for Boaz Nadler's presentation: Introduction, Parts 1&2 of his talk, Part 3 of his talk

Class 11 recording

Class 11 notes

QPD presentation

Homework 3 is now available. The first problem uses the files covtrain.csv, covtest.csv with code hints in pca.r. The second problem relies on topics from Yoav Benjamini's talk, and refers to the Science paper by Zeggini et al.

Class 12 recording (unfortunately there are some problems with video in the beginning and the screen share did not work properly for most of the talk, however you have the full presentation available).

Yoav Benjamini's presentation

Class 13 recording

- It is big (obviously): this could mean having many observations (large n), many features/variables (large p), or both
- It has additional structure information: temporal, spatial, graph structure (like network data), etc.
- It leads to non-traditional modeling problems, like network evolution, collaborative filtering, structured learning, etc.
- It presents significant practical challenges in handling the
data and modeling it, including:
- The need to maintain privacy and security of the data while sharing it and extracting information from it
- The difficulty in storing and performing calculations at scale
- The difficulty in correctly interpreting the data and generating valid statistical modeling problems from it

- The full extent of its potential utility is unclear and subject to research

- Internet usage data, including social network information, search and advertising information, etc.
- Health records and related information
- Scientific databases, including areas like particle physics, electron microscopy and genetics
- Images and video surveillance data

- Network modeling: Probabilistic models of network evolution; Parameter estimation and inference
- Privacy: Differential privacy; Algorithms to guarantee privacy in different settings; Examples of privacy breaches
- Efficient computation in predictive modeling: Regularization path algorithms; Stochastic gradient descent
- Statistical validity of scientific research on modern data: Replicability; Sequential testing on public databases
- Spectral analysis of large random matrices: statistical and computational issues
- p >> n: Sparsity and computation
- Deep learning: theory and methodology
- Turning data into modeling: Competitions and proof of concept projects; Leakage in data mining

- Basic knowledge of mathematical foundations:
- Calculus: Integration; Sums of series; Extrema, etc.
- Linear algebra/geometry: Basic properties of matrices: inverse, trace, determinant; SVD and eigen decompositions: PCA and its geometrical and statistical interpretations

- Solid fundamentals in Probability: Discrete/continuous
probability definitions; Important distributions:
Bernoulli/Binomial, Poisson, Geometric, Hypergeometric, Negative
Binomial, Normal, Exponential/Double Exponential (Laplace), Uniform,
Beta, Gamma, etc.; Limit laws: large numbers and CLT; Inequalities:
Markov, Chebyshev, Hoeffding
- Conditional distributions and moments: Basic definitions and Bayes rules; Definitions and properties of conditional expectation and variance; Laws: Iterated expectation, total variation; Intuition vs mathematics in conditional probabilities: Simpson's paradox etc.

- Solid fundamentals in Statistics:
- The equivalent of a course in
Statistical Theory:

Basic definitions: Estimation, confidence intervals, hypothesis testing, basic properties of statistical tests and estimators: Level, power, p-values, bias, variance, consistency etc.; Basic theoretical results: Neyman-Pearson Lemma, Rao-Blackwell, Cramer-Rao, Wilks; Important families of tests: Z,t, F, χ^{2}, GLRT; Bayesian inference: Basics and uses - The equivalent of a course in Regression / Analysis of
Variance: Algebra and geometry of multivariate regression; Inference
in linear regression; Error decompositions: Bias + Variance;
Generalizations: Logistic regression, auto-regression, model
selection (c
_{p}/AIC); Basic ANOVA; Familiarity with mixed/random effects models - Advantage: Some knowledge in modern/nonparametric statistics and/or statistical learning; Practical experience with R

- The equivalent of a course in
Statistical Theory:

Elements of Statistical Learning by Hastie, Tibshirani & Friedman. Including freely available pdf, data and errata)

Frontiers in Massive Data Analysis report from the National Research Council

Computer Age Statistical Inference by Efron and Hastie

R Project website also contains extensive documentation.

A basic "getting you started in R" tutorial. Uses the Boston Housing Data (thanks to Giles Hooker).

File translated from T

On 03 Jan 2022, 15:11.