Skip to content
This repository was archived by the owner on Mar 21, 2021. It is now read-only.
Sam Hopkins edited this page Aug 4, 2018 · 13 revisions

Batch 2 Capstone Project

This capstone project is meant to individually test students for the following:

  • Their grasp of the model development workflow
  • Their ability to do basic EDA
  • Their ability to train and evaluate a predictive model
  • Their ability to professionally communicate their findings
  • Their ability to understand the paradigm of providing predictions on unseen data
    • Robustness to unseen data that could be quite dirty
    • Providing predictions over time rather than kaggle-style all at once

This is quite different from the previous specializations because they will not be able to submit as many times as they want nor work in teams where they can mask an understanding of the material by brute-forcing search or blending into a crowd.

This specialization will be the primary way in which we may certify them as entry level data scientists so it is very important that this capstone is both difficult and fair.

Another thing to note is that the EDA and model development portion of this project is not the primary focus of this specialization. They should already know how to do this. It should NOT take a long time to do EDA and train a model that performs quite well.

Components

  • A single BLU describing how to deploy a model to heroku while saving observations to a database
  • 1 binary classification dataset that is split into 3 parts (see image below for details)
  • An initial report that they must submit describing their EDA on the dataset and the model that they will deploy
  • A simulator that feeds the test set and some true outcomes to the students deployed models over the course of a week or two.
  • A final report describing the test set and any updates to the model that they deployed

The BLU

This can probably be the same learning material as last year with a few minor updates. An update for windows users or a requirement to use windows subsystems for linux is almost certainly required.

The dataset

This should be a binary classification dataset that we can logically split into the following parts:

With the following additional requirements

  • X_test_1 and X_test_2 must contain both numerical and categorical values that X_train did not have
  • Model performance on X_test_2 must be demonstrably better regardless of the model if re-trained on X_test_1 and y_test_1
  • There must be noticeable shifts in the populations or distributions of 2 features that can be detected using statistical tests.

The reports

Both reports should be professional quality. This means that in order to receive a passing grade on it, we should feel comfortable submitting it to our boss or to a client. Developing the guidelines for these reports that gives clear guidance but doesn't just give a cookie-cutter recipe is a non-trivial task that will require quite a bit of judgement.

Evaluation

God willing, by the time we finish, we will have 30-40 more students that have submitted all material that must be graded quickly and with quality. By quickly I mean that it cannot take too much time per report because of limited instructor hours. By quality I mean it must be harsh but fair since we cannot certify anyone that does not deserve it while at the same time making sure that there are no surprises for the students.

In order to do this we must have clear guidelines for the reports that are clearly communicated to both the students and the instructors. We must also derive a simple and efficient grading rubric from these guidelines that allow instructors to quickly evaluate all of the material.

Schedule

Taken from this spreadsheet, here is a breakdown of what components of the capstone will be covered on which days:

With a work breakdown estimate of the following

Distribution of instructor work

  1. Sam - Selection and preparation of dataset
  2. Manu - Review and prep of BLU
  3. Hugo L. and Pedro F. - Development of reporting guidelines and evaluation criteria
  4. ? - Deployment of simulator
  5. ? - Collection of reports from students
    • This one is just
  6. ? - Calculating the roc_auc_score of each of the student models
  7. As many of us as possible - Grading the reports