Elements of Statistical Learning by Hastie,
Tibshirani & Friedman
Announcements and handouts
from class 1 and
code I demonstrated in class. You can also get
the raw code. (3 November)Homework
1 is now available. Due 24/11 in class. Submission in pairs is
encouraged (but not in triplets or larger, please). (10 November)Competition
instructions are now available.
code demonstrated in class. Slides
on bias-variance decomposition of linear regression. (24 November)Code
for running regularized linear regression and PCA on competition
data. (25 November)Homework
2 is now available. Uses
writeup on quantile regression. Due 15/12 in class. Submission in
pairs is encouraged (but not in triplets or larger, please). (8 December)Code
for running classification variants on competition data.
for running logistic regression on South African Heart data. Writeup
on Poisson regression and variance stabilizing transformations.
3 is now available. Due 5/1 in class. Note the last problem (6)
relies on material not yet presented, and is subject to change. (22 December) Class presentations:
wallet estimation using quantile regression;
Cup 2007 using Poisson regression.
for trees and bagging on competition data, and
of running bagging.
4 final version is now available. Due 2/2 in my mailbox.
(16 January) Competition final results: Congratulations to Keren, our winner at 0.7613. We will
hear her describe her approach on Monday.
The only other team that managed to break the 0.7655 barrier and get
close was team DanLiat at 0.7619.
Nine other teams broke the 0.77 bonus threshold.
for running kernel machines and boosting on the competition data.
of her competition-winning solution.
The goal of this course is to gain familiarity with the basic ideas and
methodologies of statistical (machine) learning. The focus is on
supervised learning and predictive modeling, i.e., fitting y ≈ ∧f(x), in regression
We will start by thinking about some of the simpler, but still highly effective methods, like nearest
neighbors and linear regression, and gradually learn about more complex and "modern"
methods and their close relationships with the simpler ones.
As time permits, we will also cover one or more industrial
"case studies" where we track the process from problem definition, through
development of appropriate methodology and its implementation, to deployment
of the solution and examination of its success in practice.
The homework and exam will combine hands-on programming and modeling with
theoretical analysis. Topics list:
Introduction (text chap. 1,2): Local vs. global modeling; Overview of statistical considerations: Curse of dimensionality, bias-variance tradeoff; Selection of loss functions; Basis expansions and kernels
Linear methods for regression and their extensions (text chap. 3): Regularization, shrinkage and principal components regression; Quantile regression
Linear methods for classification (text chap. 4): Linear discriminant analysis; Logistic regression; Linear support vector machines (SVM)
Classification and regression trees (text chap. 9.2)
Model assessment and selection (text chap. 7): Bias-variance decomposition; In-sample error estimates, including Cp and BIC; Cross validation; Bootstrap methods
Basis expansions, regularization and kernel methods (text chap. 5,6): Splines and polynomials; Reproducing kernel Hilbert spaces and non-linear SVM
Committee methods in embedded spaces (material from chaps 8-10): Bagging and boosting
Case studies: Customer wallet estimation; Netflix prize competition; maybe others...
Basic knowledge of mathematical foundations: Calculus; Linear Algebra; Geometry
Undergraduate courses in: Probability; Theoretical Statistics
Statistical programming experience in R is not a prerequisite,
but an advantage
Other recommended books: Modern Applied Statistics with Splus by Venables and Ripley Neural Networks for Pattern Recognition by Bishop
(Several other books on Pattern Recognition contain similar material) All of Statistics and All of Nonparametric Statistics by Wasserman
There will be about four homework assignments, which will count for
about 30% of the final grade, and a final take-home exam. Both the
homework and the exam will combine theoretical analysis with
hands-on data analysis.
We will also have an optional data modeling competition, whose
winners will get a boost in grade and present to the whole class.