Statistical/Machine Learning

 
Semester 2 2011
Wednesday 13-16, Schreiber 007
Home page on http://www.tau.ac.il/ ∼ saharon/StatLearn.html
Lecturer: Saharon Rosset
Schreiber 022
saharon@post.tau.ac.il
Office hrs: Thursday 14-16 or by appointment
Textbook: Elements of Statistical Learning by Hastie, Tibshirani & Friedman
(6 copies in the Exact Sciences library)

Final take home exam

Final instructions and Final rules are now available.
Note you are required to select between the two optional date ranges for taking the exam by 27 May

Announcements and handouts

(23 February) Slides from class 1 and R code I demonstrated in class.
(2 March) Homework 1 is now available. Due 23 March in class.
Competition instructions are also available. Sample R code: regression, k-NN. Special thanks to Giles Hooker who originally prepared this data for his class at Cornell.
(9 March) Slides on mathematical setup and geometric interpretation of linear regression and its regularized versions.
(16 March) R code from class to run PCA and regularization on competition data.
(23 March) Homework 2 is now available. Due 13 April (class before Pesach).
(25 March) Special Friday extra class: 25/3, 10-12, Schreiber 7. This class will cover material outside the main flow of the course: quantile regression and its use in an industrial case study.
Notes for this class:
Note on quantile regression.
Powerpoint presentation on customer wallet estimation using quantile regression (PDF version with some minor conversion issues).
(6 April) R code from class to run classification approaches on competition data.
(14 April) Homework 3 is now available. Due 4 May in class.
(27 April) R code from class to run trees and bagging on competition data.
(4 May) Homework 4 is now available. Due 1 June in class (last class).
Slides of Yehuda Koren's talk.
(20 May) Class notes and KDD Cup 2007 presentation from extra class.
(25 May) R code for running boosting and support vector machines on competition data.
(29 May) Final Competition Results: Our clear cut winners and only double bonus winners are the team NoFreeLunch (Shachar and Itamar).
Congratulations! They will tell us about their methods on Tuesday in class.
All other teams except one reached the single bonus level.

Syllabus

The goal of this course is to gain familiarity with the basic ideas and methodologies of statistical (machine) learning. The focus is on supervised learning and predictive modeling, i.e., fitting y ≈ f(x), in regression and classification.
We will start by thinking about some of the simpler, but still highly effective methods, like nearest neighbors and linear regression, and gradually learn about more complex and "modern" methods and their close relationships with the simpler ones.
As time permits, we will also cover one or more industrial "case studies" where we track the process from problem definition, through development of appropriate methodology and its implementation, to deployment of the solution and examination of its success in practice.
The homework and exam will combine hands-on programming and modeling with theoretical analysis.
Topics list:

Prerequisites

Basic knowledge of mathematical foundations: Calculus; Linear Algebra; Geometry
Undergraduate courses in: Probability; Theoretical Statistics
Statistical programming experience in R is not a prerequisite, but an advantage

Books and resources

Textbook:
Elements of Statistical Learning by Hastie, Tibshirani & Friedman
Book home page (including data and errata)

Other recommended books:
Modern Applied Statistics with Splus by Venables and Ripley
Neural Networks for Pattern Recognition by Bishop
(Several other books on Pattern Recognition contain similar material)
All of Statistics and All of Nonparametric Statistics by Wasserman

Online Resources:
Data Mining and Statistics by Jerry Friedman
Statistical Modeling: The Two Cultures by the late, great and highly opinionated Leo Breiman
Two day course on Machine Learning given by Berkeley RAD Lab
Tutorial on data mining from Two Crows corporation
The Netflix Prize competition is now over, but will still play a substantial role in our course.

Grading

There will be about four homework assignments, which will count for about 30% of the final grade, and a final take-home exam. Both the homework and the exam will combine theoretical analysis with hands-on data analysis.
We will also have an optional data modeling competition, whose winners will get a boost in grade and present to the whole class.

Computing

The course will require extensive use of statistical modeling software. It is strongly recommended to use R (freely available for PC/Unix/Mac) or its commercial kin Splus.
R Project website also contains extensive documentation.
A basic "getting you started in R" tutorial. Uses the Boston Housing Data (thanks to Giles Hooker).
Modern Applied Statistics with Splus by Venables and Ripley is an excellent source for statistical computing help for R/Splus.



File translated from TEX by TTHgold, version 4.00.
On 29 May 2011, 14:37.