Sunday 12-15, Shenkar 204

Home page on http://www.tau.ac.il/ ∼ saharon/BigData.html

Lecturer: | Saharon Rosset |

Schreiber 022 | |

saharon@post.tau.ac.il | |

Office hrs: | Thursday 16-18 (with email coordination) |

This homework uses the Nature paper from 2009 introducing Google Flu Trends (GFT), and the Science paper from 2014 describing the failure of GFT since 2011.

Problem 1 uses train.csv and test.csv datasets, and there is also sample code in sparse.r.

Problem 2 extra credit uses this paper.

On 26 April Yoav Benjamini will give the class, using slides: Part 1 (pdf), Part 2 (pptx)

Sunday's class will be based on the survey by Goldberg et al. (that appeared in 2010 in the Foundations and Trends in Machine Learning series).

The E-Coli models did finish running eventually. You can see an embedding of ecoli1 into two dimensions here. As can be seen, there is a lot of structure (dense area in the middle, sparse sub-networks around), and perhaps some clustering structure as well.

- It is big (obviously): this could mean having many observations (large n), many features/variables (large p), or both
- It has additional structure information: temporal, spatial, graph structure (like network data), etc.
- It leads to non-traditional modeling problems, like network evolution, collaborative filtering, structured learning, etc.
- It presents significant practical challenges in handling the
data and modeling it, including:
- The need to maintain privacy and security of the data while sharing it and extracting information from it
- The difficulty in storing and performing calculations at scale
- The difficulty in correctly interpreting the data and generating valid statistical modeling problems from it

- The full extent of its potential utility is unclear and subject to research

- Internet usage data, including social network information, search and advertising information, etc.
- Health records and related information
- Scientific databases, including areas like particle physics, electron microscopy and genetics
- Images and video surveillance data

- Network modeling: Probabilistic models of network evolution; Parameter estimation and inference
- Privacy: Differential privacy; Algorithms to guarantee privacy in different settings; Examples of privacy breaches
- Efficient computation in predictive modeling: Regularization path algorithms; Stochastic gradient descent
- Statistical validity of scientific research on modern data: Replicability; Sequential testing on public databases
- Streaming and online data analysis
- Spectral analysis of large random matrices: statistical and computational issues
- p >> n: Sparsity and computation
- Turning data into modeling: Competitions and proof of concept projects; Leakage in data mining

- Basic knowledge of mathematical foundations:
- Calculus: Integration; Sums of series; Extrema, etc.
- Linear algebra/geometry: Basic properties of matrices: inverse, trace, determinant; SVD and eigen decompositions: PCA and its geometrical and statistical interpretations

- Solid fundamentals in Probability: Discrete/continuous
probability definitions; Important distributions:
Bernoulli/Binomial, Poisson, Geometric, Hypergeometric, Negative
Binomial, Normal, Exponential/Double Exponential (Laplace), Uniform,
Beta, Gamma, etc.; Limit laws: large numbers and CLT; Inequalities:
Markov, Chebyshev, Hoeffding
- Conditional distributions and moments: Basic definitions and Bayes rules; Definitions and properties of conditional expectation and variance; Laws: Iterated expectation, total variation; Intuition vs mathematics in conditional probabilities: Simpson's paradox etc.

- Solid fundamentals in Statistics:
- The equivalent of a course in
Statistical Theory:

Basic definitions: Estimation, confidence intervals, hypothesis testing, basic properties of statistical tests and estimators: Level, power, p-values, bias, variance, consistency etc.; Basic theoretical results: Neyman-Pearson Lemma, Rao-Blackwell, Cramer-Rao, Wilks; Important families of tests: Z,t, F, χ^{2}, GLRT; Bayesian inference: Basics and uses - The equivalent of a course in Regression / Analysis of
Variance: Algebra and geometry of multivariate regression; Inference
in linear regression; Error decompositions: Bias + Variance;
Generalizations: Logistic regression, auto-regression, model
selection (c
_{p}/AIC); Basic ANOVA; Familiarity with mixed/random effects models - Advantage: Some knowledge in modern/nonparametric statistics and/or statistical learning; Practical experience with R

- The equivalent of a course in
Statistical Theory:

Elements of Statistical Learning by Hastie, Tibshirani & Friedman. Including data and errata)

Frontiers in Massive Data Analysis report from the National Research Council

The Algorithmic Foundations of Differential Privacy by Dwork and Roth

R Project website also contains extensive documentation.

A basic "getting you started in R" tutorial. Uses the Boston Housing Data (thanks to Giles Hooker).

File translated from T

On 18 Jun 2015, 11:20.