-
Notifications
You must be signed in to change notification settings - Fork 7
[WIP] Add support for RDataFrame info operations #65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
56511a4 to
9830f2a
Compare
| node_py = self.head_node | ||
| else: | ||
| if node_py.operation.is_action(): | ||
| if node_py.operation.is_action() or node_py.operation.is_info(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would this imply when we will add support for instant actions? Something like this would maybe start to be to long
if node_py.operation.is_action() or node_py.operation.is_info() or node_py.operation.is_instant_action():
Instead could it be better to type the following?
if not node_py.operation.is_transformation():
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this function is called get_action_nodes so we need to remember to change the name afterwards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the explicit expression, for now we do not require three conditions and we could even include instant_actions in info or maybe a more suitable name if we need to support it.
| parent_node = pyroot_node | ||
|
|
||
| if node_py.operation.is_action(): | ||
| if node_py.operation.is_action() or node_py.operation.is_info(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, maybe use if not self.node_py.operation.is_transformation():
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same answer as above, indeed this could be refactored and use one single function to return nodes and values.
| # 'RResultPtr's of action nodes | ||
| nodes[i].value = values[i].GetValue() | ||
| if trigger_loop and hasattr(value, 'GetValue'): | ||
| # Info actions do not have GetValue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't call them "info actions". Info operations do not trigger the event loop and they have a different scope than the action operations
| # those should be in scope while doing | ||
| # a 'GetValue' call on them | ||
| nodes[i].ResultPtr = values[i] | ||
| node.ResultPtr = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all of these outputs are RResultPtrs. Of all the "other operations" of the ROOT RDataFrame class, only Display returns an RResultPtr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, for those nodes that actually return RResultPtrs, we are storing both the pointer and the pointed result. Why do we need that?
| node_py.pyroot_node = pyroot_node | ||
|
|
||
| # The new pyroot_node becomes the parent_node for the next | ||
| # recursive call |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this part becomes a little fuzzy. While before we were actually returning PyROOT class instances in the form of lazy transformations and booked actions, now we can also return fundamental types like strings and vectors.
|
|
||
| @abstractmethod | ||
| def execute(self, generator): | ||
| def execute(self, generator, trigger_loop=False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, now that I see it I wish it were added before. One counterargument to it though: in this way we are creating pyroot nodes even when no computation is involved, or not ?
I am thinking a situation in which the user first wants to find out the name of a column
df = PyRDF.RDataFrame("some_tree","some_file.root")
colnames = df.GetColumnNames() # this will create pyroot nodes and return a list of strings
and then executes some operations
filter1 = df.Filter("valid c++ filter")
histo1 = filter1.Histo1D("col1")
histo1.Draw() # this triggers the event loop, recreating the same pyroot objects
A way to solve this would be to implement some checks in the mapper function of the CallableGenerator, e.g.
if not node_py.pyroot_node:
RDFOperation...
[...]
| return FriendInfo(friend_names, friend_file_names) | ||
|
|
||
| def execute(self, generator): | ||
| def execute(self, generator, trigger_loop=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
distributed info operations won't work now, will they? Similar to how Count doesn't work in spark, it returns a fundamental type so we can't merge it. We should add all info operations to the operations not supported by Dist
| except TypeError as e: | ||
| self.proxied_node.children.remove(newNode) | ||
| raise e | ||
| return newNode.ResultPtr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case newNode.ResultPtr will not hold a RResultPtr but one of the types returned by the "other operations" of ROOT RDataFrame, such as strings or vectors. Only Display makes an exception, but it is not included in this PR.
Furthermore, while I get that creating a new node also for info operations can guarantee the execution of the same event loop only once via the generator.execute function, I don't think that these kind of operations are suited to be "nodes" in the graph, they are just querying the dataframe for metadata
Allows to call the following operations:
GetColumnNamesGetDefinedColumnNamesGetColumnTypeGetFilterNamesThis PR adds a new parameter
trigger_loopto theexecutemethod in theBackendclass. However, we still need to avoid the creation of all the nodes when callingexecutemultiple times.TO-DO's