-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This capstone project is meant to individually test students for the following:
- Their grasp of the model development workflow
- Their ability to do basic EDA
- Their ability to train and evaluate a predictive model
- Their ability to professionally communicate their findings
- Their ability to understand the paradigm of providing predictions on unseen data
- Robustness to unseen data that could be quite dirty
- Providing predictions over time rather than kaggle-style all at once
This is quite different from the previous specializations because they will not be able to submit as many times as they want nor work in teams where they can mask an understanding of the material by brute-forcing search or blending into a crowd.
This specialization will be the primary way in which we may certify them as entry level data scientists so it is very important that this capstone is both difficult and fair.
Another thing to note is that the EDA and model development portion of this project is not the primary focus of this specialization. They should already know how to do this. It should NOT take a long time to do EDA and train a model that performs quite well.
- A single BLU describing how to deploy a model to heroku while saving observations to a database
- 1 binary classification dataset that is split into 3 parts (see image below for details)
- An initial report that they must submit describing their EDA on the dataset and the model that they will deploy
- A simulator that feeds the test set and some true outcomes to the students deployed models over the course of a week or two.
- A final report describing the test set and any updates to the model that they deployed
This can probably be the same learning material as last year with a few minor updates. An update for windows users or a requirement to use windows subsystems for linux is almost certainly required.
This should be a binary classification dataset that we can logically split into the following parts:
With the following additional requirements
-
X_test_1andX_test_2must contain both numerical and categorical values thatX_traindid not have - Model performance on
X_test_2must be demonstrably better regardless of the model if re-trained onX_test_1andy_test_1 - There must be noticeable shifts in the populations or distributions of 2 features that can be detected using statistical tests.
Both reports should be professional quality. This means that in order to receive a passing grade on it, we should feel comfortable submitting it to our boss or to a client. Developing the guidelines for these reports that gives clear guidance but doesn't just give a cookie-cutter recipe is a non-trivial task that will require quite a bit of judgement.
God willing, by the time we finish, we will have 30-40 more students that have submitted all material that must be graded quickly and with quality. By quickly I mean that it cannot take too much time per report because of limited instructor hours. By quality I mean it must be harsh but fair since we cannot certify anyone that does not deserve it while at the same time making sure that there are no surprises for the students.
In order to do this we must have clear guidelines for the reports that are clearly communicated to both the students and the instructors. We must also derive a simple and efficient grading rubric from these guidelines that allow instructors to quickly evaluate all of the material.
Taken from this spreadsheet, here is a breakdown of what components of the capstone will be covered on which days:
With a work breakdown estimate of the following
- Sam - Selection and preparation of dataset
- Manu - Review and prep of BLU
- Hugo L. and Pedro F. - Development of reporting guidelines and evaluation criteria
- ? - Deployment of simulator
- ? - Collection of reports from students
- This one is just
- ? - Calculating the
roc_auc_scoreof each of the student models - As many of us as possible - Grading the reports