Skip to content

Conversation

@JavierCVilla
Copy link
Collaborator

@JavierCVilla JavierCVilla commented Jul 3, 2019

Allows to call the following operations:

  • GetColumnNames
  • GetDefinedColumnNames
  • GetColumnType
  • GetFilterNames

This PR adds a new parameter trigger_loop to the execute method in the Backend class. However, we still need to avoid the creation of all the nodes when calling execute multiple times.

TO-DO's

  • Avoid duplicated nodes
  • Define operations in one single place, now they are duplicated (maybe a different PR)
  • Add as non-supported operations in Spark.

node_py = self.head_node
else:
if node_py.operation.is_action():
if node_py.operation.is_action() or node_py.operation.is_info():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would this imply when we will add support for instant actions? Something like this would maybe start to be to long

if node_py.operation.is_action() or node_py.operation.is_info() or node_py.operation.is_instant_action():

Instead could it be better to type the following?

if not node_py.operation.is_transformation():

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this function is called get_action_nodes so we need to remember to change the name afterwards

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the explicit expression, for now we do not require three conditions and we could even include instant_actions in info or maybe a more suitable name if we need to support it.

parent_node = pyroot_node

if node_py.operation.is_action():
if node_py.operation.is_action() or node_py.operation.is_info():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, maybe use if not self.node_py.operation.is_transformation():

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same answer as above, indeed this could be refactored and use one single function to return nodes and values.

# 'RResultPtr's of action nodes
nodes[i].value = values[i].GetValue()
if trigger_loop and hasattr(value, 'GetValue'):
# Info actions do not have GetValue
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't call them "info actions". Info operations do not trigger the event loop and they have a different scope than the action operations

# those should be in scope while doing
# a 'GetValue' call on them
nodes[i].ResultPtr = values[i]
node.ResultPtr = value
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all of these outputs are RResultPtrs. Of all the "other operations" of the ROOT RDataFrame class, only Display returns an RResultPtr.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, for those nodes that actually return RResultPtrs, we are storing both the pointer and the pointed result. Why do we need that?

node_py.pyroot_node = pyroot_node

# The new pyroot_node becomes the parent_node for the next
# recursive call
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this part becomes a little fuzzy. While before we were actually returning PyROOT class instances in the form of lazy transformations and booked actions, now we can also return fundamental types like strings and vectors.


@abstractmethod
def execute(self, generator):
def execute(self, generator, trigger_loop=False):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, now that I see it I wish it were added before. One counterargument to it though: in this way we are creating pyroot nodes even when no computation is involved, or not ?

I am thinking a situation in which the user first wants to find out the name of a column

df = PyRDF.RDataFrame("some_tree","some_file.root")

colnames = df.GetColumnNames() # this will create pyroot nodes and return a list of strings

and then executes some operations

filter1 = df.Filter("valid c++ filter")
histo1 = filter1.Histo1D("col1")

histo1.Draw() # this triggers the event loop, recreating the same pyroot objects

A way to solve this would be to implement some checks in the mapper function of the CallableGenerator, e.g.

if not node_py.pyroot_node:
    RDFOperation...
    [...]

return FriendInfo(friend_names, friend_file_names)

def execute(self, generator):
def execute(self, generator, trigger_loop=True):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

distributed info operations won't work now, will they? Similar to how Count doesn't work in spark, it returns a fundamental type so we can't merge it. We should add all info operations to the operations not supported by Dist

except TypeError as e:
self.proxied_node.children.remove(newNode)
raise e
return newNode.ResultPtr
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case newNode.ResultPtr will not hold a RResultPtr but one of the types returned by the "other operations" of ROOT RDataFrame, such as strings or vectors. Only Display makes an exception, but it is not included in this PR.

Furthermore, while I get that creating a new node also for info operations can guarantee the execution of the same event loop only once via the generator.execute function, I don't think that these kind of operations are suited to be "nodes" in the graph, they are just querying the dataframe for metadata

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants