-
Notifications
You must be signed in to change notification settings - Fork 6
Transit peer group research: explore possible feature columns #1675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
nbviewer URLs for impacted notebooks: |
nbviewer URLs for impacted notebooks: |
preprocessor = ColumnTransformer( | ||
[ | ||
("numerical", StandardScaler(), num_cols), | ||
("categorical", OneHotEncoder(drop="first", sparse_output=False), cat_cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Investigate the drop parameter options. When I was in econ school, we learned the best practice that you should drop the value that shows up most often. Presumably we can leave the drop parameter empty and drop it ourselves too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After researching. OnehotEncoder(drop="first")
would be equivalent to pd.get_dummies(drop_first=True)
in that if you encode a feature column (service: PT, DO, TX, TN into service_DO, service_PT etc etc) drop="first" will drop the the first column alphabetically. This is to prevent dummy variable trap
with collinear variables? that part im still trying to understand.
Here is what it looks like for this example. service
has 4 unique values ['PT', 'DO', 'TX', 'TN']
. drop="first" creates the encoded columns 'categorical__service_PT', 'categorical__service_TN', 'categorical__service_TX',
. categorical_service_DO
is dropped (it drops the first column alphabetically).
The resulting feature array would like like:
service_PT | service_TX | service_TN |
---|---|---|
0 | 0 | 0 |
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
where [0,0,0] implies service_DO
.
It looks like we can drop=" " whatever value we want. So we can drop the most frequent column value ourselves or by using drop="". Looking at the data, we can probably drop service_PT
) | ||
data_fit = data.copy() # why do i need to copy/clone? | ||
|
||
data_fit["cluster_name"] = pipeline.fit_predict(data_fit) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean the cluster name is the predicted value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Trolleybus is fixed guideway - it has a pantograph attached to overhead wires
- you should simply group the purchased transportation - taxi/TNC with the purchased transportation generally
- It was interesting to see in the multi-year data, Metro is in one cluster pre-pandemic and another post-pandemic. This plus the correlation matrics tells me that we should consolidate to one year of data or aggregate across years, but we definitely shouldn't leave all years loose. To be explored more in Explore aggregating numerical and categorical data for multiple years #1683
The last thing I'd like to see on this PR is:
- Clustering on 2023 data only
- re-categorize the modes into fixed route, non-fixed-route, and other
- re-categorize the service into DO or PT generally
- sum up the numerical columns based on the id cols and the new categorical variables you just created (not the old ones), so you have one row per agency-groupedmode-groupedservice
- try running the clustering, 10 is still fine for now
- print out a sample of 5 rows from each cluster with their characteristics
- print summary stats grouped by by cluster (like mean UPT for example)
Issue:
Notebook that holds notes from Google Machine Learning Crash Course, specifically about categorical and numerical feature columns. (feature-crosses, log scaling, binning )
Categorized NTD feature columns into different groups. Also explore separating modes by fixed guideway and non-fixed guideway. Explored what "flattening" NTD data looks like after aggregating by:
Tested 3d scatter plot using plotly express
09/16/2025