Statistical Learning, Fall 2022-2023

Class Competition: Netflix Data

 

The Netflix Prize

The Netflix Prize was a competition to predict user ratings of movies. Netflix provided ratings of 17770 movie titles by 480189 users, along with the date of each rating. The task was to predict ratings for 282000 user-movie-date triples that are not in the training set; all the users and movies in this test set appear in the training set. Netflix judged performance by root mean squared error on the test set and offered a $1,000,000 reward to the first team to improve the performance of their current system by more than 10%. The prize was won in 2009. Details of the Netflix Prize are available at:

www.netflixprize.com

Class Competition

Because the Netflix Prize involves a very large data set and a non-standard problem (you could be asked to predict for any movie), the class competition will simplify the problem considerably. The training data provide the ratings of 10,000 users for 99 movies, along with the dates at which the ratings were made. The first 14 of these movies were rated by all users; the remaining 85 may have missing values. The outcome is the rating that each user gave to a further movie (”Miss Congeniality”,2000); you are also given the date that this rating was made.

The task is to predict the rating for this movie by a further 2931 users in the test set. As with the training set, all users in the test set rated the first 14 movies, while the remaining 85 have missing values. The test set provides the same information as the training set – the dates and ratings of these 99 movies along with the date of the rating for ”Miss Congeniality”. As with the Netflix Prize, performance will be measured by root mean squared error (RMSE) on the test set.

Data Sets

The data for the competition are available as tab-delimited text files on the class home directory http://www.tau.ac.il/~saharon/StatsLearn2022/ , they include:

·         train_ratings_all.dat: The ratings that the users in the training data set gave to each of the 99 movies.

·         test_ratings_all.dat: Same info for the test set.

·         train_dates_all.dat: The date at which each of the ratings above were made.

·         test_dates_all.dat: Same info for the test set.

·         train_y_rating.dat: The ratings that the users in the training set gave to ”Miss Congeniality”.

·         train_y_date.dat: The dates at which the training set users rated ”Miss Congeniality”.

·         test_y_date.dat: Same info for the test set.

·         movie_titles.txt: Names and release dates for the 99 movies, given in the same order as the columns in the data above.

Some notes

·         Ratings are from 1 to 5. A value of 0 indicates a missing entry.

·         For convenience, dates are given as number of days from January 1, 1997.

·         Missing dates are labeled ’0000’.

Rules and Procedures

1.      You may work in groups of up to two. Each student may be on one team only.

2.      You may use any modeling technique you like, either parametric or nonparametric.

3.      You may not include information outside the data provided on the class web-site (no reverse engineering of Netflix data please).

4.      The competition will be run through an automated system, you must follow the steps detailed below, or your submissions will not count:

a.       Group Google sheet setup:

                                                  i.      Go to Google sheets

                                                ii.      Set up a new Google sheet for group submissions

                                              iii.      Generate a shareable and readable link (please follow these instructions carefully!): Press the SHARE button on the top right of the form and get a shareable URL (in Hebrew it might be on the top left…).
Make sure you are sharing with anyone with the link and not as restricted

b.      Competition signup: Go to the signup Google form and follow the instructions exactly:

                                                  i.      Enter group name

                                                ii.      Enter member names and emails as instructed (comma separated)

                                              iii.      Enter the shareable URL of the Google sheet you created above

                                              iv.      You will receive a confirmation email with your regustration details

c.       Every week on Sunday at noon the automated script will run over the Google sheets of all registered groups

                                                  i.      The first column of the main sheet should contain exactly 2931 numbers. These are your predictions for that week. The column must not have any header or any other content except 2931 numbers in cells 1-2931

                                                ii.      Each group will receive an email with this week's score. The email will mention the current score, whether the team is in the bonus or double bonus, and whether it's the leader. It will also update if the link is unreadable or if the format is wrong (as far as the script can tell)

                                              iii.      All other columns, other sub-sheets etc. make no difference (so for example you can store your old predictions in other columns)

5.      The current best performance on the test set will be posted on the class homepage.

6.      The competition will end on Sunday 15/1/2023 at noon. Each team’s best score will be their competition score.

7.      The team with the best performance will give a brief (~15min) description of their methods in class on Tuesday, 17/1/2023.

Grading

The competition is optional, and can only give a bonus in your grade. There are three types of bonus you can get:

·         Any team that beats the simple linear regression performance on the test set of RMSE=0.7769 by more than 0.0169 and goes below RMSE=0.76 will get a bonus of 5 points in final class grade.

·         The winning team, after giving a brief but informative talk in class, will receive an additional 5 points bonus in grade.

·         Any team that beats a much tougher threshold, tentatively set at RMSE=0.745, will get a "double bonus" of another 5 points.

Some Points to Ponder

·         Can we gain from treating the rankings as categorical? Ordinal?

·         How should missing rankings be handled? Dummy variables?

·         What would be a good distance measure for k-NN and related methods?

·         Efficient k-NN approaches for this big data?

·         How should dates be used? Can we use the release year of the movie?

Good luck!