Statistical/Machine Learning

Semester 2 2013
Tuesday 13-16, Orenstein 111 Home page on ∼ saharon/StatLearn.html
Lecturer: Saharon Rosset
Schreiber 022
Office hrs: Thursday 16-18 or by appointment
Textbook: Elements of Statistical Learning by Hastie, Tibshirani & Friedman

Announcements and handouts

(26 February) Slides from class 1 and R code I demonstrated in class. You can also get just the raw code.
(5 March) Homework 1 is now available. Due 9 April in class.
Competition instructions and competition sample code.
(21 March) Homework 2 is now available. It uses the notes on quantile regression. Due 23 April in class.
Slides on geometry of linear regression and its bias-variance decomposition.
Code for fitting ridge, lasso and PCA regression on our competition data, for those who want an early start. I will discuss this code in the next class on 9/4.
(5 May) Homework 3 is now available. This is now the final version (virtually unchanged from preliminary one).
(7 May) Class presentations: Wallet estimation using quantile regression and KDD-Cup 2007.
Class notes on Poisson distribution and variance stabilizing transformations.
(19 May) Competition update: Moving week in our competition! We now have four teams in the bonus, with the leader at 0.7656.
(21 May) Code for fitting trees and bagging to our competition data.
(26 May) Competition update : New leaders are well clear of the field at 0.7597. Still four teams total in the bonus.
(10 June) Competition update: leaders are still at 0.7597, still well clear of the field. We already have seven teams in the bonus!
(10 June) Homework 4 is updated with questions about Yehuda Koren's talk.
(12 June) Code for fitting boosted trees and SVR to our competition data.
Note: Teams that have passed the bonus threshold in the competition are welcome to use boosting to improve their scores. However, I will not accept new submissions that accomplish the bonus by using boosting. In other words, you first have to break the 0.77 barrier without using it.


The goal of this course is to gain familiarity with the basic ideas and methodologies of statistical (machine) learning. The focus is on supervised learning and predictive modeling, i.e., fitting y ≈ f(x), in regression and classification.
We will start by thinking about some of the simpler, but still highly effective methods, like nearest neighbors and linear regression, and gradually learn about more complex and "modern" methods and their close relationships with the simpler ones.
As time permits, we will also cover one or more industrial "case studies" where we track the process from problem definition, through development of appropriate methodology and its implementation, to deployment of the solution and examination of its success in practice.
The homework and exam will combine hands-on programming and modeling with theoretical analysis. Topics list:


Basic knowledge of mathematical foundations: Calculus; Linear Algebra; Geometry
Undergraduate courses in: Probability; Theoretical Statistics
Statistical programming experience in R is not a prerequisite, but an advantage

Books and resources

Elements of Statistical Learning by Hastie, Tibshirani & Friedman
Book home page (including data and errata)

Other recommended books:
Modern Applied Statistics with Splus by Venables and Ripley
Neural Networks for Pattern Recognition by Bishop
(Several other books on Pattern Recognition contain similar material)
All of Statistics and All of Nonparametric Statistics by Wasserman

Online Resources:
Data Mining and Statistics by Jerry Friedman
Statistical Modeling: The Two Cultures by the late, great Leo Breiman
Course on Machine Learning from Stanford's Coursera.
The Netflix Prize competition is now over, but will still play a substantial role in our course.


There will be about four homework assignments, which will count for about 30% of the final grade, and a final take-home exam. Both the homework and the exam will combine theoretical analysis with hands-on data analysis.
We will also have an optional data modeling competition, whose winners will get a boost in grade and present to the whole class.


The course will require extensive use of statistical modeling software. It is strongly recommended to use R (freely available for PC/Unix/Mac) or its commercial kin Splus.
R Project website also contains extensive documentation.
A basic "getting you started in R" tutorial. Uses the Boston Housing Data (thanks to Giles Hooker).
Modern Applied Statistics with Splus by Venables and Ripley is an excellent source for statistical computing help for R/Splus.

File translated from TEX by TTH, version 4.03.
On 12 Jun 2013, 06:36.