Validation data generation, removal of loss function overrides#234
Validation data generation, removal of loss function overrides#234kaueltzen wants to merge 10 commits intoppdebreuck:masterfrom
Conversation
…t target instead.
|
Hi @ppdebreuck I have a question regarding this comment: Do you think it would make sense to enable the passing of a loss function for classification tasks in |
ppdebreuck
left a comment
There was a problem hiding this comment.
Thanks @kaueltzen ! Looks good, there just might be something off in the the stratified split, please let me know if you agree.
modnet/models/vanilla.py
Outdated
| return validate_model(**kwargs) | ||
|
|
||
|
|
||
| def generate_shuffled_and_stratified_val_split( |
There was a problem hiding this comment.
I prefer to have this function under modnet.utils and import it as from modnet.utils import generate_shuffled_and_stratified_val_split. This will avoid having it in __all__
modnet/models/vanilla.py
Outdated
| if isinstance(y[0][0], list) or isinstance(y[0][0], np.ndarray): | ||
| ycv = np.argmax(y[0], axis=1) | ||
| else: | ||
| ycv = y[0] |
There was a problem hiding this comment.
This is most likely wrong. It's picking the first row, but we need the first column: ycv = y[:,0]
There was a problem hiding this comment.
Hi @ppdebreuck thanks for pointing that out!
It was written for MODNetModel.fit() to handle a list of props from targets_groups, but outside of vanilla.py / generate_shuffled_and_stratified_val_data the targets are indeed not correctly handled, I will change it.
modnet/models/vanilla.py
Outdated
| else: | ||
| ycv = y[0] | ||
| return train_test_split( | ||
| range(len(y[0])), |
There was a problem hiding this comment.
this can simply become range(len(ycv))
There was a problem hiding this comment.
Edit: let's do range(len(y)), so it's compatible with my last comment.
modnet/models/vanilla.py
Outdated
| """ | ||
| if classification: | ||
| if isinstance(y[0][0], list) or isinstance(y[0][0], np.ndarray): | ||
| ycv = np.argmax(y[0], axis=1) |
There was a problem hiding this comment.
Stratifying a multilabel case is a bit tricky, and probably you don't need it ? So we can skip it: ycv=None
| ) | ||
| return ( | ||
| x[train_idx], | ||
| [t[train_idx] for t in y], |
There was a problem hiding this comment.
This doesn't seem correct ? y[train_idx] should work, right?. It would pick the wrong columns now.
There was a problem hiding this comment.
The y of MODNetModel.fit() and of generate_shuffled_and_stratified_val_data has the first two dimensions (n_target_groups, n_samples) while the y of generate_shuffled_and_stratified_val_split that is used elsewhere (data.df_targets.values) has the first 2 dimensions (n_samples, n_targets).
No problem with me, as you can put ROC_AUC as default metric to keep current behavior, while adding flexibility if you need to change it :) (I would put it in a separate PR) |
…rected axes of y in generate_shuffled_and_stratified_val_split and generate_shuffled_and_stratified_val_data
ppdebreuck
left a comment
There was a problem hiding this comment.
Thanks for the modifications @kaueltzen, I think this can be merged ?
Resolves #228 .
Overview
train_test_splitwith explicit generation of validation data that is reproducible, shuffled and stratified (if classification tasks are present)fitofMODNetModel