Statistical/Machine Learning

Semester 2 2016-2017
Monday 16-19, Melamed Hall
Home page on ∼ saharon/StatLearn.html
Lecturer: Saharon Rosset
Schreiber 022
Office hrs: Tuesday 14-16 (with email coordination)
Textbook: Elements of Statistical Learning by Hastie, Tibshirani & Friedman

Announcements and handouts

(13 March) Slides from class 1 and R code from class. You can also get just the raw code.
(20 March) Homework 1 is now available. Due 3/4 in class. Submission in pairs is encouraged (but not in triplets or larger, please).
(27 March) Competition instructions are now available. Associated code demonstrated in class.
Slides on bias-variance decomposition of linear regression.
(3 April) Code for running regularized linear regression and PCA on competition data.
Homework 2 is now available. Uses this writeup on quantile regression. Due 1/5 in class (updated from 24/4!). Submission in pairs is encouraged (but not in triplets or larger, please).
Update: Since there is no class on 1/5, due date for HW2 was postponed again to 8/5
(8 May) Code for running classification methods on competition data.
Code for running logistic regression on South African Heart dataset.
Homework 3 is now available, due 29/5 in class. Updated on 19/5!
(15 May) Code for running tree-based methods on competition data.
(19 May) Material from extra class:
Wallet estimation case study
Note on Poisson regression and variance stabilization
KDD-Cup 2007 case study
Yehuda Koren's presentation on $1M Netflix competition
***Updated Homework 3, with a problem on these topics added, still due on 29/5***
(4 June) Homework 4, due in the last class on 26/6.
Updated HW4 with problem on boosting added.
(11 June) Code for running boosting and support vector regression on competition data.
(13 June) Updated Homework 4, with problem on boosting added, due in the last class on 26/6.
(19 June) Blog post applying deep learning to image classification by former class student Giora Simchoni.
(22 June) Competition final results: Congratulations to group "The OOBs": Dana Kaner, Aviv Navon, Dor Bank, our winners at 0.7591. We will hear from them on Monday about how they got there.
Group "PowerPuffs" gets an honorable mention for being the only other group below 0.76.
Overall, we had 28 teams, of them 21 reached the bonus.


The goal of this course is to gain familiarity with the basic ideas and methodologies of statistical (machine) learning. The focus is on supervised learning and predictive modeling, i.e., fitting y ≈ f(x), in regression and classification.
We will start by thinking about some of the simpler, but still highly effective methods, like nearest neighbors and linear regression, and gradually learn about more complex and "modern" methods and their close relationships with the simpler ones.
As time permits, we will also cover one or more industrial "case studies" where we track the process from problem definition, through development of appropriate methodology and its implementation, to deployment of the solution and examination of its success in practice.
The homework and exam will combine hands-on programming and modeling with theoretical analysis. Topics list:


Basic knowledge of mathematical foundations: Calculus; Linear Algebra; Geometry
Undergraduate courses in: Probability; Theoretical Statistics
Statistical programming experience in R is not a prerequisite, but an advantage

Books and resources

Elements of Statistical Learning by Hastie, Tibshirani & Friedman
Book home page (including downloadable PDF of the book, data and errata)

Other recommended books:
Computer Age Statistical Inference by Efron and Hastie
Modern Applied Statistics with Splus by Venables and Ripley
Neural Networks for Pattern Recognition by Bishop
(Several other books on Pattern Recognition contain similar material)
All of Statistics and All of Nonparametric Statistics by Wasserman

Online Resources:
Data Mining and Statistics by Jerry Friedman
Statistical Modeling: The Two Cultures by the late, great Leo Breiman
Course on Machine Learning from Stanford's Coursera.
The Netflix Prize competition is now over, but will still play a substantial role in our course.


There will be about four homework assignments, which will count for about 30% of the final grade, and a final take-home exam. Both the homework and the exam will combine theoretical analysis with hands-on data analysis.
We will also have an optional data modeling competition, whose winners will get a boost in grade and present to the whole class.


The course will require extensive use of statistical modeling software. It is strongly recommended to use R (freely available for PC/Unix/Mac) or its commercial kin Splus.
R Project website also contains extensive documentation.
A basic "getting you started in R" tutorial. Uses the Boston Housing Data (thanks to Giles Hooker).
Modern Applied Statistics with Splus by Venables and Ripley is an excellent source for statistical computing help for R/Splus.

File translated from TEX by TTH, version 4.08.
On 22 Jun 2017, 12:19.