Group Final Project

The ultimate goal of this project is to perform exploratory analysis on large data sets using the technologies discussed in class and to discuss your findings.

The Final Project is worth 30% of the total grade for the course.

There are six assignments associated with the final project. They are:

Dataset Selection

Choice of dataset is important to the success of your project. Part of the grade of your final project relies upon the novelty of your application, so try to pick a dataset that you expect will have some interesting characteristics, or on which you can perform interesting analysis.

Your dataset must meet the following criteria:

You may choose your own dataset, or use one of the datasets from the sources below:

The course staff will give you space to place your dataset (either on the cluster or in S3). Your group will be responsible for loading the dataset onto the provided space.

Additionally, you must ensure that you have the appropriate rights to use your selected dataset. (Basically, ensure that your dataset has a license that includes educational / non-commercial use)

You may not use any of the datasets used in the course’s MPs unless you get explicit approval from the course staff, and your plan is sufficiently different than the work done during the MP assignments.

Group Selection

You are free to choose your own group members, and we will create a Piazza post for finding group members. The Google form where you will commit to your team selection will be sent out on Week 8.

The form is due on Saturday, March 10th, at 11:59pm. Your group must have between 3 and 4 students (no fewer, no more).

Following your group selection submission, the course staff will create a Gitlab repository for your group.

Project Proposal

The proposal should include your team’s choice of datasets, and should contain a detailed plan of actions to evaluate the data. It is highly recommended to look at your data when writing the proposal to be familiar with it.

In your proposal, you should address:

Your group’s project proposal is due on Friday, March 16th, at 11:59pm.

Your proposal should be detailed enough that it is clear the direction in which you will be taking for your final project. You may deviate from your proposal if you find that to be necessary.

The proposal will be submitted through Moodle. Only 1 member per group needs to submit the proposal.

Project Report

The project report should address the following topics:

Your project report will be graded with respect to the following rubric.

There are no strict lower/upper bounds to the length of the report. However, as a rough estimate your group should aim for around ~1500 words. Feel free to include visuals and appendices as needed.

The final report is due on Reading Day, Wednesday, May 2nd, at 11:59pm through group Gitlab repository.


The presentation should address your group’s methodologies, the technologies you used, and your findings. Essentially, you should walk the audience through your project report, and present high-level findings. The presentation should be 8~10 minutes long. All group members should be involved in the presentation in some fashion.

The following dates are available for presentations:

The sign-up form will be given out a few weeks prior to these dates. Accommodations will be made (as necessary) for valid conflicts, and the course staff will try to accommodate groups such that all group members are able to present on their assigned presentation day.


There will be two types of evaluation:

You are required to evaluate (and will be evaluated by) your group members. This form is due on Wednesday, May 2nd, at 11:59pm

You are also required to attend the class on two presentation dates and fill out peer evaluations for other groups’ presentations. You will receive credit for evaluating your peers, and you will be graded (in part) by the evaluations of your peers.