Semester 1 2022-2023
Tuesday 15-18, Kaplun 118
Announcements and Handouts
Notes for class 1
from class 1
r code for class 1.
Make-up class 2 for election day on 4 November:
Notes for class 2
Homework 1 (due 15 November before class). Submission in pairs is encouraged, but not in triplets or more please.
Notes for class 3
Slides demonstrating the bias-variance decomposition in Fixed-X linear regression from a geometric perspective
instructions are now available. Some code to read and examine the training data.
Notes for class 4
Homework 2 due 6 December before class (you may find this writeup on quantile regression helpful). Submission in pairs is encouraged, but not in triplets or more please.
Notes for class 5
R code for running regularized regression methods on Netflix dataset
In the first part of the class we will complete the discussion of Lasso from last week's notes.
Notes on classification for class 6
R code for running classification methods on Netflix dataset
Notes on classification for class 7
Homework 3 due 20 December before class. Submission in pairs is encouraged, but not in triplets or more please.
Notes on trees for class 8
R code for running tree-based methods on Netflix dataset
Notes on Random Forest and Boosting for class 9
R code for running boosting methods on Netflix dataset
Notes on AdaBoost and Model Selection for class 10
Homework 4 due 17 January before class (no extensions!). Submission in pairs is encouraged, but not in triplets or more please.
Notes on Deep learning for class 11 (not final)
Notes on kernel machines for class 12 (not final)
R code for running ϵ-SVR on the competition data
15 January: Competition Final results: 15 teams, 14 in the bonus, our clear winner is team AlphFlix (Yuval Alaluf) at 0.7448. No one else is below 0.751!
Note on Poisson regression and variance stabilization
KDD-cup 2007 presentation
Wallet estimation presentation (as time permits)
The goal of this course is to gain familiarity with the basic ideas and
methodologies of statistical (machine) learning. The focus is on
supervised learning and predictive modeling, i.e., fitting y ≈ ∧f(x), in regression
We will start by thinking about some of the simpler, but still highly effective methods, like nearest
neighbors and linear regression, and gradually learn about more complex and "modern"
methods and their close relationships with the simpler ones.
As time permits, we will also cover one or more industrial
"case studies" where we track the process from problem definition, through
development of appropriate methodology and its implementation, to deployment
of the solution and examination of its success in practice.
The homework and exam will combine hands-on programming and modeling with
theoretical analysis. Topics list:
- Introduction (text chap. 1,2): Local vs. global modeling; Overview of statistical considerations: Curse of dimensionality, bias-variance tradeoff; Selection of loss functions; Basis expansions and kernels
- Linear methods for regression and their extensions (text chap. 3): Regularization, shrinkage and principal components regression; Quantile regression
- Linear methods for classification (text chap. 4): Linear discriminant analysis; Logistic regression; Linear support vector machines (SVM)
- Classification and regression trees (text chap. 9.2)
- Model assessment and selection (text chap. 7): Bias-variance decomposition; In-sample error estimates, including Cp and BIC; Cross validation; Bootstrap methods
- Basis expansions, regularization and kernel methods (text chap. 5,6): Splines and polynomials; Reproducing kernel Hilbert spaces and non-linear SVM
- Committee methods in embedded spaces (material from chaps 8-10): Bagging and boosting
- Deep learning and its relation to statistical learning
- Case studies: Customer wallet estimation; Netflix prize competition; maybe others...
Basic knowledge of mathematical foundations: Calculus; Linear Algebra; Geometry
Undergraduate courses in: Probability; Theoretical Statistics
Statistical programming experience in R is not a prerequisite,
but an advantage
Books and resources
Elements of Statistical Learning by Hastie, Tibshirani & Friedman
Book home page (including downloadable PDF of the book, data and errata)
Other recommended books:
Computer Age Statistical Inference by Efron and Hastie
Modern Applied Statistics with Splus by Venables and Ripley
Neural Networks for Pattern Recognition by Bishop
(Several other books on Pattern Recognition contain similar material)
All of Statistics and All of Nonparametric Statistics by Wasserman
Data Mining and Statistics by Jerry Friedman
Statistical Modeling: The Two Cultures by the late, great Leo Breiman
Course on Machine
Learning from Stanford's Coursera.
The Netflix Prize competition
is now over, but will still play a substantial role in our course.
There will be about four homework assignments, which will count for
about 30% of the final grade, and a final take home project.
We will also have an optional data modeling competition, whose
winners will get a boost in grade and present to the whole class.
The course will require extensive use of statistical modeling software. It is recommended to use R (freely available for PC/Unix/Mac).
R Project website also contains extensive documentation.
A basic "getting you started in R" tutorial. Uses the Boston Housing Data (thanks to Giles Hooker).
Modern Applied Statistics with Splus by Venables and Ripley is an excellent source for statistical computing help for R/Splus.
Using Python is also possible, the main downside is that the code I hand out (which in many cases is also useful for the homework) is in R.
File translated from
On 16 Jan 2023, 15:21.