Skip to content

Conversation

csuyat-dot
Copy link
Contributor

@csuyat-dot csuyat-dot commented Sep 9, 2025

Issue:

Notebook that holds notes from Google Machine Learning Crash Course, specifically about categorical and numerical feature columns. (feature-crosses, log scaling, binning )

Categorized NTD feature columns into different groups. Also explore separating modes by fixed guideway and non-fixed guideway. Explored what "flattening" NTD data looks like after aggregating by:

  • mode/service. each row will be total upt/vrh/vrm by agency per year
  • mode/service/year. Each row will be total upt/vrh/vrm for 5 years per agency

Tested 3d scatter plot using plotly express

09/16/2025

  • new section for that compares 1yr (2023) data to multi-year (2018-2023) clustering.
  • created util module for easier data preparations for clustering and dendrogram
  • moved google machine learning course notes to separate markdown.

@csuyat-dot csuyat-dot self-assigned this Sep 9, 2025
Copy link

github-actions bot commented Sep 9, 2025

Copy link

preprocessor = ColumnTransformer(
[
("numerical", StandardScaler(), num_cols),
("categorical", OneHotEncoder(drop="first", sparse_output=False), cat_cols)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Investigate the drop parameter options. When I was in econ school, we learned the best practice that you should drop the value that shows up most often. Presumably we can leave the drop parameter empty and drop it ourselves too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After researching. OnehotEncoder(drop="first") would be equivalent to pd.get_dummies(drop_first=True) in that if you encode a feature column (service: PT, DO, TX, TN into service_DO, service_PT etc etc) drop="first" will drop the the first column alphabetically. This is to prevent dummy variable trap with collinear variables? that part im still trying to understand.

Here is what it looks like for this example. service has 4 unique values ['PT', 'DO', 'TX', 'TN']. drop="first" creates the encoded columns 'categorical__service_PT', 'categorical__service_TN', 'categorical__service_TX', . categorical_service_DO is dropped (it drops the first column alphabetically).

The resulting feature array would like like:

service_PT service_TX service_TN
0 0 0
1 0 0
0 1 0
0 0 1

where [0,0,0] implies service_DO.

It looks like we can drop=" " whatever value we want. So we can drop the most frequent column value ourselves or by using drop="". Looking at the data, we can probably drop service_PT

)
data_fit = data.copy() # why do i need to copy/clone?

data_fit["cluster_name"] = pipeline.fit_predict(data_fit)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the cluster name is the predicted value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, we are predicting the cluster name for each agency. Each row in the dataframe gets a cluster_name. The result is something like this:
image

the grain of this example dataset is ntd_metric (upt/vrh, etc) per agency/service/mode/year. we'll flatten/aggregate a better dataset in the next issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Trolleybus is fixed guideway - it has a pantograph attached to overhead wires
  • you should simply group the purchased transportation - taxi/TNC with the purchased transportation generally
  • It was interesting to see in the multi-year data, Metro is in one cluster pre-pandemic and another post-pandemic. This plus the correlation matrics tells me that we should consolidate to one year of data or aggregate across years, but we definitely shouldn't leave all years loose. To be explored more in Explore aggregating numerical and categorical data for multiple years #1683

The last thing I'd like to see on this PR is:

  • Clustering on 2023 data only
  • re-categorize the modes into fixed route, non-fixed-route, and other
  • re-categorize the service into DO or PT generally
  • sum up the numerical columns based on the id cols and the new categorical variables you just created (not the old ones), so you have one row per agency-groupedmode-groupedservice
  • try running the clustering, 10 is still fine for now
  • print out a sample of 5 rows from each cluster with their characteristics
  • print summary stats grouped by by cluster (like mean UPT for example)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants