[WIP] Add support for RDataFrame info operations #65

JavierCVilla · 2019-07-03T13:28:11Z

Allows to call the following operations:

GetColumnNames
GetDefinedColumnNames
GetColumnType
GetFilterNames

This PR adds a new parameter trigger_loop to the execute method in the Backend class. However, we still need to avoid the creation of all the nodes when calling execute multiple times.

TO-DO's

Avoid duplicated nodes
Define operations in one single place, now they are duplicated (maybe a different PR)
Add as non-supported operations in Spark.

vepadulano · 2019-07-03T14:49:37Z

PyRDF/CallableGenerator.py

            node_py = self.head_node
        else:
-            if node_py.operation.is_action():
+            if node_py.operation.is_action() or node_py.operation.is_info():


What would this imply when we will add support for instant actions? Something like this would maybe start to be to long

if node_py.operation.is_action() or node_py.operation.is_info() or node_py.operation.is_instant_action():

Instead could it be better to type the following?

if not node_py.operation.is_transformation():

Also this function is called get_action_nodes so we need to remember to change the name afterwards

I prefer the explicit expression, for now we do not require three conditions and we could even include instant_actions in info or maybe a more suitable name if we need to support it.

vepadulano · 2019-07-03T14:50:17Z

PyRDF/CallableGenerator.py

                parent_node = pyroot_node

-                if node_py.operation.is_action():
+                if node_py.operation.is_action() or node_py.operation.is_info():


same as above, maybe use if not self.node_py.operation.is_transformation():

Same answer as above, indeed this could be refactored and use one single function to return nodes and values.

vepadulano · 2019-07-03T15:02:58Z

PyRDF/backend/Local.py

            # 'RResultPtr's of action nodes
-            nodes[i].value = values[i].GetValue()
+            if trigger_loop and hasattr(value, 'GetValue'):
+                # Info actions do not have GetValue


I wouldn't call them "info actions". Info operations do not trigger the event loop and they have a different scope than the action operations

vepadulano · 2019-07-03T15:03:47Z

PyRDF/backend/Local.py

            # those should be in scope while doing
            # a 'GetValue' call on them
-            nodes[i].ResultPtr = values[i]
+            node.ResultPtr = value


Not all of these outputs are RResultPtrs. Of all the "other operations" of the ROOT RDataFrame class, only Display returns an RResultPtr.

Furthermore, for those nodes that actually return RResultPtrs, we are storing both the pointer and the pointed result. Why do we need that?

vepadulano · 2019-07-03T15:11:42Z

PyRDF/CallableGenerator.py

                node_py.pyroot_node = pyroot_node

                # The new pyroot_node becomes the parent_node for the next
                # recursive call


Now this part becomes a little fuzzy. While before we were actually returning PyROOT class instances in the form of lazy transformations and booked actions, now we can also return fundamental types like strings and vectors.

vepadulano · 2019-07-03T15:21:00Z

PyRDF/backend/Backend.py


    @abstractmethod
-    def execute(self, generator):
+    def execute(self, generator, trigger_loop=False):


I like this, now that I see it I wish it were added before. One counterargument to it though: in this way we are creating pyroot nodes even when no computation is involved, or not ?

I am thinking a situation in which the user first wants to find out the name of a column

df = PyRDF.RDataFrame("some_tree","some_file.root") colnames = df.GetColumnNames() # this will create pyroot nodes and return a list of strings

and then executes some operations

filter1 = df.Filter("valid c++ filter") histo1 = filter1.Histo1D("col1") histo1.Draw() # this triggers the event loop, recreating the same pyroot objects

A way to solve this would be to implement some checks in the mapper function of the CallableGenerator, e.g.

if not node_py.pyroot_node: RDFOperation... [...]

vepadulano · 2019-07-03T15:34:02Z

PyRDF/backend/Dist.py

        return FriendInfo(friend_names, friend_file_names)

-    def execute(self, generator):
+    def execute(self, generator, trigger_loop=True):


distributed info operations won't work now, will they? Similar to how Count doesn't work in spark, it returns a fundamental type so we can't merge it. We should add all info operations to the operations not supported by Dist

vepadulano · 2019-07-03T15:42:18Z

PyRDF/Proxy.py

+            except TypeError as e:
+                self.proxied_node.children.remove(newNode)
+                raise e
+            return newNode.ResultPtr


In this case newNode.ResultPtr will not hold a RResultPtr but one of the types returned by the "other operations" of ROOT RDataFrame, such as strings or vectors. Only Display makes an exception, but it is not included in this PR.

Furthermore, while I get that creating a new node also for info operations can guarantee the execution of the same event loop only once via the generator.execute function, I don't think that these kind of operations are suited to be "nodes" in the graph, they are just querying the dataframe for metadata

JavierCVilla added 5 commits July 3, 2019 14:21

Add new operation type: Info

7849b47

Refactor is_prunable including info nodes

912c68b

Add RDF info operations

8ad2915

Allow execution without triggering the event loop

73e066b

Add tests for new functions

0e39b2f

JavierCVilla requested review from etejedor and vepadulano July 3, 2019 13:28

Flake8 errors

9830f2a

JavierCVilla force-pushed the local-all-ops branch from 56511a4 to 9830f2a Compare July 3, 2019 14:55

vepadulano reviewed Jul 3, 2019

View reviewed changes

[WIP] Add support for RDataFrame info operations #65

Are you sure you want to change the base?

[WIP] Add support for RDataFrame info operations #65

Uh oh!

Conversation

JavierCVilla commented Jul 3, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

JavierCVilla commented Jul 3, 2019 •

edited

Loading