In the course project, groups of three students will work together to create classifiers for an in-class Kaggle prediction competition. The competition training data is available from the uci-cs178-win21 Kaggle site. To give your Kaggle account permission to join the in-class competition and upload results, use the URL posted on Piazza.

Kaggle Competition

The Problem

Our competition data are satellite-based measurements of cloud temperature (infrared imaging), being used to predict the presence or absence of rainfall at a particular location. The data are courtesy of the UC Irvine Center for Hydrometeorology and Remote Sensing, and have been pre-processed to extract features corresponding to a model they use actively for predicting rainfall across the globe. Each data point corresponds to a particular lat-long location where the model thinks there might be rain; the extracted features include information such as IR temperature at that location, and information about the corresponding cloud (area, average temperature, etc.). The target value is a binary indicator of whether there was rain (measured by radar) at that location; you will notice that the data are slightly imbalanced (positives make up about 30% of the training data).

The Evaluation

Scoring of predictions is done using AUC, the area under the ROC (receiver-operator characteristic) curve. This gives an average of your learner’s performance at various levels of sensitivity to positive data. This means that you will likely do better if, instead of simply predicting the target class, you also include your confidence level of that class value, so that the ROC curve can be evaluated at different levels of specificity. To do so, you can report your confidence that it is raining (class +1) as a real number for each test point. Your predictions will then be sorted in order of confidence, and the ROC curve evaluated.

Using Kaggle

Download the training features X_train, the training category labels Y_train, and the test features X_test. You will learn classifiers using the training data, make predictions based on the test features, and upload your predictions to Kaggle for evaluation. Kaggle will then score your predictions, and report your performance on a random subset of the test data to place your team on the public leaderboard. After the competition, the score on the remainder of the test data will be used to determine your final standing; this ensures that your scores are not affected by overfitting to the leaderboard data.

Kaggle will limit you to at most 2 uploads per day, so you cannot simply upload every possible classifier and check their leaderboard quality. You will need to do your own validation, for example by splitting the training data into multiple folds, to tune the parameters of learning algorithms before uploading predictions for your top models. The competition closes (uploads will no longer be accepted or scored) on March 17, 2021 at 11:59pm Pacific time.

Submission Format

Your submission must be a file containing two columns separated by a comma. The first column should be the instance number (a positive integer), and the second column is the score for that instance (probability that it equals class +1). The first line of the file should be “ID,Prob1”, the name of the two columns. We have released a sample submission file, containing random predictions, named Y_random.txt.

Forming a Project Team

Students will work in teams of three students to complete the project. We encourage you to start looking for teammates now; one option is to use the "Search for Teammates!" page on Piazza. In exceptional circumstances, if you are not able to form a team of three students, smaller teams are allowed. However, the same grading standards are applied to all teams, so smaller teams should expect a larger workload.

Once you've identified your teammates, on the Team tab in Kaggle, merge with your teammates to form an integrated team. (We know that merging may make your individual HW4 score disappear from the leaderboard, and will not penalize you for this when grading.) You are required to form a merged team, and report the team members to the course staff, by March 4, 2021. After this date, you may not use individual Kaggle accounts to submit predictions for evaluation, only your merged team account.

To receive credit for forming your team on time, you must submit the "Group Project Team" assignment on gradescope. One team member should complete this assignment, and gradescope will then allow that person to select the other team members. Use the "View or edit group" option on gradescope to be sure this is done correctly. Do not complete the assignment multiple times; only one team member should submit.

Project Requirements

Each project team will learn several different classifiers for the Kaggle data, as well as an ensemble “blend” of them, to try to predict class labels as accurately as possible. We expect you to experiment with at least three (more is good) different types of classification models. Suggestions include:

K-Nearest Neighbors. KNN models for this data will need to overcome two issues: the large number of training & test examples, and the data dimension. As noted in class, distance-based methods often do not work well in high dimensions, so you may need to perform some kind of feature selection process to decide which features are most important. Also, computing distances between all pairs of training and test instances may be too slow; you may need to reduce the number of training examples somehow (for example by clustering), or use more efficient algorithms to find nearest neighbors. Finally, the right “distance” for prediction may not be Euclidean in the original feature scaling (these are raw numbers); you may want to experiment with scaling features differently.
Linear models. Since you have relatively few input features but a large amount of training data, you will probably need to define non-linear features for top performance, for example using polynomials or radial basis functions.
Kernel methods. libSVM is one efficient implementation of SVM training algorithms. But like KNN classifiers, SVMs (with non-linear kernels) can be challenging to learn from large datasets, and some data pre-processing or subsampling may be required.
Random forests. You will explore decision tree classifiers for this data on homework 4, and random forests would be a natural way to improve accuracy.
Boosted learners. Use AdaBoost, gradient boosting, or another boosting algorithm to train a boosted ensemble of some base learner (perceptrons, shallow decision trees, Gaussian naive Bayes models, etc.).
Neural networks. The key to learning a good NN model on these data will be to ensure that your training algorithm does not become trapped in poor local optima. You should monitor its performance across backpropagation iterations on training/validation data, and verify that predictive performance improves to reasonable values. Start with few layers (2-3) and moderate numbers of hidden nodes (100-1000) per layer, and verify improvements over baseline linear models.
Other. You tell us! Apply another class of learners, or a variant or combination of methods like the above. You can use existing libraries or modify course code. The only requirement is that you understand the model you are applying, and can clearly explain its properties in the project report.

For each learner, you should do enough work to make sure that it achieves “reasonable” performance, with accuracy similar to (or better than) baselines like logistic regression or decision trees. Then, take your best learned models, and combine them using a blending or stacking technique. This could be done via a simple average/vote, or a weighted vote based on another learning algorithm. Feel free to experiment and see what performance gains are possible.

Project Report

By March 18, 2021, each team must submit a single 2-page pdf document describing your learned classifiers and overall prediction ensemble. Please include:

A table listing each model, as well as your best blended/stacked model ensembles, and their performance on training and validation and leaderboard data. Also include your final performance on the Private Leaderboard, which becomes visible after the Kaggle competition closes, and your Kaggle team name.
For each model, a paragraph or two describing: what features you gave it (raw inputs, selected inputs, non-linear feature expansions, etc.); how was it trained (learning algorithm and software source); and key hyperparameter settings (plus your approach to choosing those settings).
A paragraph or two describing your overall prediction ensemble: how did you combine the individual models, and why did you pick that technique?
A conclusion paragraph highlighting the methods/algorithms that you think worked particularly well for this data, the methods/algorithms that worked poorly, and your hypotheses as to why.

Your project grade will be mostly based on the quality of your written report, and groups whose final prediction accuracy is mediocre may still receive a high grade, if their results are described and analyzed carefully. But, some additional points will also be given to the teams at the top of the leaderboard.

One team member should upload your pdf to the gradescope site, and gradescope will then allow that person to select the other team members. Use the "View or edit group" option on gradescope to be sure this is done correctly. Do not upload multiple copies of the project report; only one team member should upload.