Transit peer group research: explore possible feature columns #1675

csuyat-dot · 2025-09-09T18:23:45Z

Issue:

Refine Use of Explored NTD Variables #1646

Notebook that holds notes from Google Machine Learning Crash Course, specifically about categorical and numerical feature columns. (feature-crosses, log scaling, binning )

Categorized NTD feature columns into different groups. Also explore separating modes by fixed guideway and non-fixed guideway. Explored what "flattening" NTD data looks like after aggregating by:

mode/service. each row will be total upt/vrh/vrm by agency per year
mode/service/year. Each row will be total upt/vrh/vrm for 5 years per agency

Tested 3d scatter plot using plotly express

09/16/2025

new section for that compares 1yr (2023) data to multi-year (2018-2023) clustering.
created util module for easier data preparations for clustering and dendrogram
moved google machine learning course notes to separate markdown.

github-actions · 2025-09-09T18:25:35Z

nbviewer URLs for impacted notebooks:

…dule functions and plotly dendrogram

github-actions · 2025-09-16T21:07:24Z

nbviewer URLs for impacted notebooks:

KatrinaMKaiser · 2025-09-17T23:17:53Z

transit_agency_peer_groups/utils_transit_peer_groups.py

+    preprocessor = ColumnTransformer(
+    [
+        ("numerical", StandardScaler(), num_cols),
+        ("categorical", OneHotEncoder(drop="first", sparse_output=False), cat_cols)


Investigate the drop parameter options. When I was in econ school, we learned the best practice that you should drop the value that shows up most often. Presumably we can leave the drop parameter empty and drop it ourselves too.

After researching. OnehotEncoder(drop="first") would be equivalent to pd.get_dummies(drop_first=True) in that if you encode a feature column (service: PT, DO, TX, TN into service_DO, service_PT etc etc) drop="first" will drop the the first column alphabetically. This is to prevent dummy variable trap with collinear variables? that part im still trying to understand.

Here is what it looks like for this example. service has 4 unique values ['PT', 'DO', 'TX', 'TN']. drop="first" creates the encoded columns 'categorical__service_PT', 'categorical__service_TN', 'categorical__service_TX', . categorical_service_DO is dropped (it drops the first column alphabetically).

The resulting feature array would like like:

service_PT service_TX service_TN

0 0 0

1 0 0

0 1 0

0 0 1

where [0,0,0] implies service_DO.

It looks like we can drop=" " whatever value we want. So we can drop the most frequent column value ourselves or by using drop="". Looking at the data, we can probably drop service_PT

KatrinaMKaiser · 2025-09-19T00:19:38Z

transit_agency_peer_groups/utils_transit_peer_groups.py

+)
+    data_fit = data.copy() # why do i need to copy/clone?
+
+    data_fit["cluster_name"] = pipeline.fit_predict(data_fit)


Does this mean the cluster name is the predicted value?

Correct, we are predicting the cluster name for each agency. Each row in the dataframe gets a cluster_name. The result is something like this:

the grain of this example dataset is ntd_metric (upt/vrh, etc) per agency/service/mode/year. we'll flatten/aggregate a better dataset in the next issue.

KatrinaMKaiser · 2025-09-19T00:23:54Z

transit_agency_peer_groups/1646_explore_ntd_variables.ipynb

Trolleybus is fixed guideway - it has a pantograph attached to overhead wires

you should simply group the purchased transportation - taxi/TNC with the purchased transportation generally

It was interesting to see in the multi-year data, Metro is in one cluster pre-pandemic and another post-pandemic. This plus the correlation matrics tells me that we should consolidate to one year of data or aggregate across years, but we definitely shouldn't leave all years loose. To be explored more in Explore aggregating numerical and categorical data for multiple years #1683

The last thing I'd like to see on this PR is:

Clustering on 2023 data only

re-categorize the modes into fixed route, non-fixed-route, and other

re-categorize the service into DO or PT generally

sum up the numerical columns based on the id cols and the new categorical variables you just created (not the old ones), so you have one row per agency-groupedmode-groupedservice

try running the clustering, 10 is still fine for now

print out a sample of 5 rows from each cluster with their characteristics

print summary stats grouped by by cluster (like mean UPT for example)

csuyat-dot added 12 commits August 21, 2025 21:28

testing removing more category variables from the model

6d09f53

started notebook to explore ntd variables

14e783a

test aggregating data, categorizing columns and values

30ca6a0

added explainers for features and labels

5e48791

stuff

a8191bb

more stuff

067f111

notes from the google machine learning coures

4f892ee

added notes for categorical columns

7643d51

more notes from google course

0e93c9f

more notes

1734597

organizing, testing 3d plot

0ffdc90

final changes

b54aa58

csuyat-dot self-assigned this Sep 9, 2025

csuyat-dot requested a review from KatrinaMKaiser September 9, 2025 18:24

csuyat-dot added 6 commits September 11, 2025 23:10

stuff

e50726e

refactored query

199d8cf

organizing

ee7329d

moved notes to separate markdown, started utils module, tested new mo…

23adddd

…dule functions and plotly dendrogram

confirm util functions work

54c8599

compared dendrograms from 2 different datasets

9d5e0ec

KatrinaMKaiser reviewed Sep 19, 2025

View reviewed changes

csuyat-dot added 2 commits September 19, 2025 20:36

split out 2023 clustering to separate notebook

74b6774

testing new category groups

6aa84e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transit peer group research: explore possible feature columns #1675

Transit peer group research: explore possible feature columns #1675

csuyat-dot commented Sep 9, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 9, 2025

Uh oh!

github-actions bot commented Sep 16, 2025

Uh oh!

KatrinaMKaiser Sep 17, 2025

Uh oh!

csuyat-dot Sep 19, 2025

Uh oh!

KatrinaMKaiser Sep 19, 2025

Uh oh!

csuyat-dot Sep 19, 2025

Uh oh!

KatrinaMKaiser Sep 19, 2025

Uh oh!

Uh oh!

Transit peer group research: explore possible feature columns #1675

Are you sure you want to change the base?

Transit peer group research: explore possible feature columns #1675

Conversation

csuyat-dot commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Sep 9, 2025

Uh oh!

github-actions bot commented Sep 16, 2025

Uh oh!

KatrinaMKaiser Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

csuyat-dot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

KatrinaMKaiser Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

csuyat-dot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

KatrinaMKaiser Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

csuyat-dot commented Sep 9, 2025 •

edited

Loading