Skip to content
This repository was archived by the owner on Jul 16, 2019. It is now read-only.

Inherit from Dask Dataframe classes#43

Closed
mrocklin wants to merge 6 commits intorapidsai:masterfrom
mrocklin:dask-inherit
Closed

Inherit from Dask Dataframe classes#43
mrocklin wants to merge 6 commits intorapidsai:masterfrom
mrocklin:dask-inherit

Conversation

@mrocklin
Copy link
Copy Markdown
Collaborator

This allows us to remove a large amount of boilerplate code.

This requires changes upstream both in Dask DataFrame to enable better
subclassing and in PyGDF to better match the Pandas API.

This allows us to remove a large amount of boilerplate code.

This requires changes upstream both in Dask DataFrame to enable better
subclassing and in PyGDF to better match the Pandas API.
@mrocklin
Copy link
Copy Markdown
Collaborator Author

raises.match(r"\s+\|\s+".join(['bad', 'float32', 'float64']))
# print("out")
# raises.match(r"^Metadata mismatch found in `from_delayed`.")
# raises.match(r"\s+\|\s+".join(['bad', 'float32', 'float64']))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove dead code

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't dead. I had to comment it out to get tests to pass. I prefer to leave it here for now until we decide what the plan is on this.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! Can you please put a comment here to that effect. This is something we do in libgdf etc for "future code"

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to actually fix the test. Sorry, I should have mentioned above that this PR is a work-in-progress. I expect that I'll end up doing more changes before it's done. I've pushed up the first round of PRs just so that folks can take a look at what I'm proposing. I don't think that the time has come for scrutiny on the level of style or comments.

@mrocklin
Copy link
Copy Markdown
Collaborator Author

So, some reasons not to do this:

  1. dask-gdf becomes more strongly tied to upstream
  2. we'll need to implement NotImplemented methods to cover functionality that we don't implement. These classes just got a ton of new methods, many of which don't currently work.
  3. ... ?

@kkraus14
Copy link
Copy Markdown
Contributor

So, some reasons not to do this:

  1. dask-gdf becomes more strongly tied to upstream
  2. we'll need to implement NotImplemented methods to cover functionality that we don't implement. These classes just got a ton of new methods, many of which don't currently work.
  3. ... ?
  1. Given that we're already heavily tied to dask and distributed and will likely be tied very tightly to upstream as we add functionality to dask and distributed for GPUs I don't see this as an issue.
  2. I'm okay with this and it will give us additional functionality targets that we should have anyway. That being said there is some places we diverge from the Pandas API and those situations would need to be handled: multi-index, additional parameters for things like Joins / GroupBys, etc.
  3. I'd be +1 in favor of extending Dask.DataFrame to work with additional DataFrame classes underneath.

@mrocklin
Copy link
Copy Markdown
Collaborator Author

OK. I'll work on cleaning up things on the dask side and I'll add a couple checks to the pygdf side. Would we want to merge this if both of those go in? That would require us to wait on a release until the next time dask-core releases (probably at least a week or two away) and would pin us above that version for the future.

@kkraus14
Copy link
Copy Markdown
Contributor

Does dask-core publish nightlies to Conda? Given the "alpha" state of dask-gdf I'd be happy to have a dependency on pre-release Dask versions as we've done with Numba in the past for PyGDF as we identify functionality that should be added or bugs get fixed.

@mrocklin
Copy link
Copy Markdown
Collaborator Author

Does dask-core publish nightlies to Conda

No, but it's pure Python, so we typically pip install git+https://github.com/dask/dask

@kkraus14
Copy link
Copy Markdown
Contributor

I think relying on master until needed changes / features make it into a release is okay at this point.

@mrocklin
Copy link
Copy Markdown
Collaborator Author

mrocklin commented Sep 27, 2018 via email

@mrocklin
Copy link
Copy Markdown
Collaborator Author

Closing in favor of #48

@mrocklin mrocklin closed this Dec 18, 2018
@mrocklin mrocklin deleted the dask-inherit branch December 18, 2018 19:06
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants