- 
          
- 
                Notifications
    You must be signed in to change notification settings 
- Fork 260
Populate make_classification_df date functionality with random dates #851
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Populate make_classification_df date functionality with random dates #851
Conversation
        
          
                dask_ml/datasets.py
              
                Outdated
          
        
      | dd.from_array( | ||
| np.array([random_date(*dates)] * len(X_df)), | ||
| chunksize=chunks, | ||
| np.array([random_date(*dates) for i in range(len(X_df))]), | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's some test errors because random_dates doesn't have a random seed. Maybe this signature would help?
def random_dates(start, end, random_state=None):
    rng = check_random_state(random_state)
    ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that could be a probable reason, I'll work on it. And also, the test written for the function(make_classification_df) doesn't check for true randomness. I'm thinking of adding a new test/modifying the existing one for the same, will that suffice?
link to test_make_classification_df
| Here's what I've done in the latest commit : 
 Another observation I made earlier today was that: the main repository when cloned and existing tests run, I observed the same number of errors that I previously got when running the tests after making my modification in the earlier commit. Could the maintainers of the repository look into the same? | 
| chunks=100, | ||
| dates=(date(2014, 1, 1), date(2015, 1, 1)), | ||
| ) | ||
| check_randomness = np.unique((X_df["date"] == X_df1["date"]).compute()) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, this checks the random state.
Shouldn't this also check that there's more than one unique value? That's what #845 is focused on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the code I've written checks for repeatability, on account of the seed. Since the the numpy's randint function is a deterministic pseudo random number generator, we can be sure that it will produce a random set of numbers. I've ensured in the code that the function random_date is called multiple times, each with a different seed. Hence I feel the line np.unique(X["date"]).size >= threshold would be redundant. Open to any thoughts you might have here @stsievert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same random_seed=123 is passed to both calls to make_classification_df; I would expect every value in X_df["date"] and X_df1["date"] to be the same.
I think X["date"].nunique() >= threshold would be a lot simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that sounds good. Now I've to figure out, what would be a good threshold. Will n_samples/2 be good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, threshold=n_samples/2 is more than good. I think threshold=2 would suffice; that'd make sure #845 is resolved.
| chunksize=chunks, | ||
| np.array( | ||
| [ | ||
| random_date(*dates, random_state + i) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when random_state isn't an integer?  Scikit-learn allows for random_state to be an integer, None or an instance of np.random.RandomState (source).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will have to raise a ValueError exception there, on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use this code?
rng = check_random_state(random_state)
dates = [random_date(*dates, rng) for i in range(len(X_df))]
...There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code above will produce the same random number since the seed(rng) remains the same in subsequent calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main point: I think np.random.RandomState and None should be acceptable types for random_state. I'm fine expanding the for-loop, though I don't think that needs to happen:
[ins] In [193]: def random_dates(random_state):
           ...:     return random_state.randint(100)
           ...:
[ins] In [194]: rng = np.random.RandomState(42)
[ins] In [196]: [random_dates(rng) for _ in range(20)]
Out[196]: [51, 92, 14, 71, 60, 20, 82, 86, 74, 74, 87, 99, 23, 2, 21, 52, 1, 87, 29, 37]There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can maybe just check if random_state is one of the accepted values like this and accordingly proceed -
if random_state is not None or not isinstance(random_state, np.random.RandomState) or not isinstance(random_state,int): print("random_state is not to be accepted")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That runs counter to the use of random_state in Scikit-learn. random_date is public, so it should accept all types of random_state that Scikit-learn accepts.
If random_date were a private function, I wouldn't really care
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
random_state : int, RandomState instance or None, optional (default=None), these are values accepted by Scikit-Learn's random_state. I think I can check if the random_state is in neither of the accepted values(Scitkit and Numpy) and set is as the default None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scikit-learn's check_random_state function will likely be useful: https://scikit-learn.org/stable/modules/generated/sklearn.utils.check_random_state.html
It takes those values and produces the correct random seed generator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stsievert the accepted types of random_state in Scikit and Numpy appear to be the same.
Refer this: Scikit's version also accepts Numpy's accepted values. Refer this: https://scikit-learn.org/dev/glossary.html#term-random_state
        
          
                dask_ml/datasets.py
              
                Outdated
          
        
      |  | ||
| def random_date(start, end): | ||
| def random_date(start, end, random_state=None): | ||
| rng_random_date = dask_ml.utils.check_random_state(random_state) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit:
| rng_random_date = dask_ml.utils.check_random_state(random_state) | |
| rng_random_date = sklearn.utils.check_random_state(random_state) | 
That way the .compute() can be avoided (especially relevant on repeated calls.).
        
          
                dask_ml/datasets.py
              
                Outdated
          
        
      | or not isinstance(random_state, np.random.RandomState) | ||
| or not isinstance(random_state, int) | ||
| ): | ||
| random_state = None | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the block necessary? None is already the default value for random_state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was to address the issue you mentioned earlier, "what if random_state is not an integer or any of the accepted values"
This fixes the date functionality of make_classification_df as mentioned in #845
Overview of the problem :
dask_ml.datasets.make_classification_dfwith a date range provided, fills the date column with just one unique datechunksizeto be equal tochunks(passed in through make_classification_df), populates the date column with NaN values.Findings :
The helper function
random_dateworks perfectly fine, generating a random date given the start and endthis line populates the list with the same date value, rather than calling the
random_datefunctionlen(X_df)time, which is the required fix.Run Existing Tests
Code Formatting (black, flake8, isort)
Custom Tests: Added
Seeking the maintainers and @ScottMGustafson to review/provide feedback on the proposed changes.