Elements of Statistical Learning by Hastie, Tibshirani & Friedman
(6 copies in the Exact Sciences library)
Final take home exam
Final instructions and
Final rules are now available. Note you are required to select between the two optional date
ranges for taking the exam by 27 May
Announcements and handouts
(23 February)
Slides
from class 1 and
R
code I demonstrated in class.
(2 March)
Homework 1 is now available. Due 23 March in class. Competition
instructions are also available. Sample R code:
regression,
k-NN. Special thanks to
Giles Hooker who
originally prepared this data for his class at Cornell.
(9 March)
Slides
on mathematical setup and geometric interpretation of linear
regression and its regularized versions.
(16 March)
R code from class to run PCA and regularization on competition
data.
(23 March)
Homework
2 is now available. Due 13 April (class before Pesach).
(25 March) Special Friday extra class: 25/3, 10-12, Schreiber
7. This class will cover material outside the main flow of the
course: quantile regression and its use in an industrial case
study.
Notes for this class: Note on quantile regression. Powerpoint presentation on customer wallet estimation using
quantile regression
(PDF
version with some minor conversion issues).
(6 April)
R code from class to run classification approaches on competition
data.
(14 April)
Homework 3 is now available. Due 4 May in class.
(27 April)
R code from class to run trees and bagging on competition data.
(4 May)
Homework
4 is now available. Due 1 June in class (last class). Slides
of Yehuda Koren's talk.
(20 May)
Class
notes and
KDD
Cup 2007 presentation from extra class.
(25 May)
R code for running boosting and support vector machines on competition data.
(29 May) Final Competition Results: Our clear cut winners and
only double bonus winners are the team NoFreeLunch (Shachar and Itamar).
Congratulations! They will tell us about their methods on Tuesday in
class.
All other teams except one reached the single bonus level.
Syllabus
The goal of this course is to gain familiarity with the basic ideas and
methodologies of statistical (machine) learning. The focus is on
supervised learning and predictive modeling, i.e., fitting y ≈ ∧f(x), in regression
and classification.
We will start by thinking about some of the simpler, but still highly effective methods, like nearest
neighbors and linear regression, and gradually learn about more complex and "modern"
methods and their close relationships with the simpler ones.
As time permits, we will also cover one or more industrial
"case studies" where we track the process from problem definition, through
development of appropriate methodology and its implementation, to deployment
of the solution and examination of its success in practice.
The homework and exam will combine hands-on programming and modeling with
theoretical analysis.
Topics list:
Introduction (text chap. 1,2): Local vs. global modeling; Overview of statistical considerations: Curse of dimensionality, bias-variance tradeoff; Selection of loss functions; Basis expansions and kernels
Linear methods for regression and their extensions (text chap. 3): Regularization, shrinkage and principal components regression; Quantile regression
Linear methods for classification (text chap. 4): Linear discriminant analysis; Logistic regression; Linear support vector machines (SVM)
Classification and regression trees (text chap. 9.2)
Model assessment and selection (text chap. 7): Bias-variance decomposition; In-sample error estimates, including Cp and BIC; Cross validation; Bootstrap methods
Basis expansions, regularization and kernel methods (text chap. 5,6): Splines and polynomials; Reproducing kernel Hilbert spaces and non-linear SVM
Committee methods in embedded spaces (material from chaps 8-10): Bagging and boosting
Case studies: Customer wallet estimation; Netflix prize competition; maybe others...
Prerequisites
Basic knowledge of mathematical foundations: Calculus; Linear Algebra; Geometry
Undergraduate courses in: Probability; Theoretical Statistics
Statistical programming experience in R is not a prerequisite,
but an advantage
Books and resources
Textbook: Elements of Statistical Learning by Hastie, Tibshirani & Friedman Book home page (including data and errata)
Other recommended books: Modern Applied Statistics with Splus by Venables and Ripley Neural Networks for Pattern Recognition by Bishop
(Several other books on Pattern Recognition contain similar material) All of Statistics and All of Nonparametric Statistics by Wasserman
There will be about four homework assignments, which will count for about 30% of the final grade, and a final take-home exam.
Both the homework and the exam will combine theoretical analysis with hands-on data analysis.
We will also have an optional data modeling competition, whose
winners will get a boost in grade and present to the whole class.
Computing
The course will require extensive use of statistical modeling software. It is strongly recommended to use R (freely available for PC/Unix/Mac) or its commercial kin Splus. R Project website also contains extensive documentation. A basic "getting you started in R" tutorial. Uses the Boston Housing Data (thanks to Giles Hooker). Modern Applied Statistics with Splus by Venables and Ripley is an excellent source for statistical computing help for R/Splus.
File translated from
TEX
by
TTHgold,
version 4.00. On 29 May 2011, 14:37.