Statistical Learning and Big Data

 
Course in University of Milano-Bicocca, October 2019
Home page on http://www.tau.ac.il/ ∼ saharon/StatLearn-Milan.html
Lecturer: Saharon Rosset
saharon@post.tau.ac.il
Textbook: Elements of Statistical Learning by Hastie, Tibshirani & Friedman

Announcements and handouts

Homework problems: 0 (warmup)
1 (Given 7 October, submission 8 October before class)
2 (Given 8 October, submission 9 October before class)
3 (uses the code in kNN-prob3.r) (Given 9 October, submission 10 October before class)
4 (Given 10 October, submission 11 October before class)
5+6 (weekend) (Due Monday 14 October before class)
7 (Due Tuesday 15 October before class)
8 (Due Wednesday 16 October by midnight)
9 (uses the code in AdaBoost.r) (Due Thursday 17 October by midnight)
10 (Requires installing the Keras R package, and also Python) (Due Sunday 20 October by midnight)
11 (Due Sunday 20 October by midnight)
(7 October) Slides from class 1
Code for analyzing prostate cancer data
Code for analyzing Netflix data.
(9 October) Slides on bias-variance decomposition of linear regression.
(10 October) Code for running regularized linear regression variants and PCA on Netflix data.
(11 October) Note on quantile regression for modeling conditional quantiles.
Case study presentation on estimating customer wallet and targeting at IBM.
(14 October) Code for running classification methods on Netflix data.
(15 October) Code for running tree methods, bagging and random forest on Netflix data.
(16 October) Code for running boosted trees on Netflix data.
(17 October) Giora Simchoni's blog post on differentiating Simpsons from South Park using Convolutional Neural Networks.
(18 October) Notes on Poisson regression and variance stabilizing transformations.
Presentation on KDD-Cup 2007 based on the Netflix competition data.

Syllabus

The goal of this course is to gain familiarity with the basic ideas and methodologies of statistical (machine) learning. The focus is on supervised learning and predictive modeling, i.e., fitting y ≈ f(x), in regression and classification.
We will start by thinking about some of the simpler, but still highly effective methods, like nearest neighbors and linear regression, and gradually learn about more complex and "modern" methods and their close relationships with the simpler ones.
As time permits, we will also cover one or more industrial "case studies" where we track the process from problem definition, through development of appropriate methodology and its implementation, to deployment of the solution and examination of its success in practice.
The homework and exam will combine hands-on programming and modeling with theoretical analysis. Topics list (we will cover some of these, as time permits):

Prerequisites

Basic knowledge of mathematical foundations: Calculus; Linear Algebra; Geometry
Undergraduate courses in: Probability; Theoretical Statistics
Statistical programming experience in R is not a prerequisite, but an advantage

Books and resources

Textbook:
Elements of Statistical Learning by Hastie, Tibshirani & Friedman
Book home page (including downloadable PDF of the book, data and errata)

Other recommended books:
Computer Age Statistical Inference by Efron and Hastie
Modern Applied Statistics with Splus by Venables and Ripley
Neural Networks for Pattern Recognition by Bishop
(Several other books on Pattern Recognition contain similar material)
All of Statistics and All of Nonparametric Statistics by Wasserman

Online Resources:
Data Mining and Statistics by Jerry Friedman
Statistical Modeling: The Two Cultures by the late, great Leo Breiman
Course on Machine Learning from Stanford's Coursera.
The Netflix Prize competition is now over, but will still play a substantial role in our course.

Course work and grading

The grading will be based on a combination of homework and a final exam. Given the short format of the course, a single homework problem will be given every day after class. The problems will combine theory and applied work in R. Out of the total of about eleven problems that will be given, you will have to solve and submit seven. This will account for about 10% of the course grade. If you submit more than seven, the best ones will count.
Those who do not require a grade are strongly encouraged to look at the homework each day, as a way of thinking about the material and challenging your understanding. Some of the problems will be solved in class the day after they are given.

Computing

The course will require use of statistical modeling software. It is strongly recommended to use R (freely available for PC/Unix/Mac) or its commercial kin Splus.
R Project website also contains extensive documentation.
A basic "getting you started in R" tutorial. Uses the Boston Housing Data (thanks to Giles Hooker).
Modern Applied Statistics with Splus by Venables and Ripley is an excellent source for statistical computing help for R/Splus.



File translated from TEX by TTH, version 4.10.
On 19 Oct 2019, 02:47.