Skip to content

Comments

Refactored unit testing#1

Merged
F3z11 merged 7 commits intoF3z11:mainfrom
Dr4k3z:main
Dec 16, 2025
Merged

Refactored unit testing#1
F3z11 merged 7 commits intoF3z11:mainfrom
Dr4k3z:main

Conversation

@Dr4k3z
Copy link
Contributor

@Dr4k3z Dr4k3z commented Dec 4, 2025

Hey guys,
what’s up?

First of all, great work — I really like your implementation! 🙂

I took the liberty of slightly refactoring the code. In particular, I focused on the following points:

  1. Type hinting and formatting
    Using ruff and mypy, I added some basic type-hinting and formatting rules directly in the pyproject.toml. I feel these make the code more readable and maintainable. I also added GitHub Actions workflows to run these checks automatically on every push/PR.

  2. Unit testing
    I refactored the unit test you provided to use pytest.

  3. Scikit-learn integration
    I noticed that you inherit from BaseEstimator and ClassifierMixin. Scikit-learn provides a useful guide on how to correctly implement estimators that remain coherent with the rest of the library. You can find it here.
    Unfortunately, the guide is somewhat outdated and misses a few changes introduced in recent versions (>= 1.6), such as the recommended use of the validate_data API — which I talk about more here.

    I’ve implemented a basic testing pipeline to verify the estimator’s consistency using scikit-learn’s check_estimator. You can find these tests in test_sklearn_compatibility.py.
    The key test is:

    https://github.com/Dr4k3z/RegularizedDiscriminantAnalysis/blob/5010f0122a87cfd2c1a5e306b2781268347afe41/tests/test_sklearn_compatibility.py#L41-L43

    which passes in the latest commit of this PR.

    In the same file, you’ll also find other tests (_test_pipeline_usage and _test_grid_search_cv) that are currently not being executed. They come from another project of mine. I think they could be useful because they show how the estimator behaves in a real pipeline, but they would need to be adapted to your workflow.

    All modifications to fit() and predict() were made solely to comply with scikit-learn’s API standards; they do not alter the core logic of your implementation.

One point we may want to discuss further is variable naming. Scikit-learn requires that public attributes not be created inside fit(), so I’ve changed most of them to private. In my opinion, some of these do not need to be class members at all and could simply be local variables. Let me know what you think.

I hope you find the changes useful — I really enjoyed working on this!
Have a great day ❤️

Copy link
Owner

@F3z11 F3z11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Matteo! Nice to hear from you :))

First of all, I am glad you appreciated our implementation, it is really important for us. And secondly, but for sure not less important, thank you for your contribution from a software engineering point of view, we really appreciate that.

I reviewed your PR, and you made a lot of improvements from the stability and compliance point of view that make our RDA implementation stronger.
I noticed just a couple of minor issues that you find commented in the related file, but nothing substantial.
Regarding the variable names in the fit method, I proposed you a schema just at the beginning of the function. I’m happy to discuss this further if you have questions.

In case you have any other question feel free to ask.
Thank you again for your contribution, we really appreciate it! 😊

"""
return hasattr(self, "_is_fitted") and self._is_fitted

def fit(self, X: np.ndarray, y: np.ndarray) -> RegularizedDiscriminantAnalysis:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About the variable naming in fit, I propose this framework:

  • Public Attributes: I think we can start from the public attributes provided by LDA and QDA implementations in sklearn. classes, n_features_in, means, priors, pooled_covariances (we can think about renaming it to just covariance to match sklearn style).
  • Local variables: covariances (QDA uses this as raw, but in RDA we use the pooled version, which we make public),  class_counts ,  n_samples (these last two are not used after fit and are not very informative since the priors are rarely known in advance and the number of samples can be obtained by inspecting the data).
  • Private attributes: I do not see any attribute needed after fitting the model that should be kept private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid self._class_counts must remain a private field as it's defined within the fit() method and later on used in _apply_regularization(). We could also opt to pass it as a parameter

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I reviewed it, and I agree with you. Let's keep it private.

@Dr4k3z
Copy link
Contributor Author

Dr4k3z commented Dec 15, 2025

I've addressed your reviews. Thanks for the feedback. About the tests which are not currently being run, test_pipeline_usage() and test_grid_search_cv(), are these relevant? Could you think of a use case in which it would useful to use your classifier within a sklearn.Pipeline or sklearn.GridSearchCV? If not, than I'll remove them in the next commit.

Also, I suggest adding buttons to the README to highlight that Ruff, MyPy and Pytests actions are passing. Something like

[![mypy](https://github.com/USERNAME/RegularizedDiscriminantAnalysis/actions/workflows/mypy.yml/badge.svg)](https://github.com/USERNAME/RegularizedDiscriminantAnalysis/actions/workflows/mypy.yml)

P.S. If you're looking for an alternative to mypy for type-checking, check out ty

@F3z11
Copy link
Owner

F3z11 commented Dec 15, 2025

Yes! They would be very useful, especially because the RDA is a distance-based method, so standardization is recommended as preprocessing, and its integration into a pipeline could be useful. Also, the GridSearch test can be helpful since the RDA has 2 hyperparameters to tune (sorry, I forgot to mention this part of your commit last time, my bad).

For the Ruff, MyPy, and Pytests buttons, feel free to add them in the next (and I believe final) commit :))

Thank you for the suggestion about ty, I will definitely look into it; it seems interesting! For now, let's keep mypy for type checking :)

Thanks again for your work, I appreciate it!

@Dr4k3z
Copy link
Contributor Author

Dr4k3z commented Dec 16, 2025

That should do it :)
README buttons will start working as the pr is merged

@F3z11 F3z11 merged commit efbf429 into F3z11:main Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants