Skip to content

Conversation

@vepadulano
Copy link
Owner

The new run_graphs function is inspired by the logic of ROOT::RDF::RunGraphs. Its main use for now is to trigger the execution of multiple Spark jobs concurrently, each job is submitted in a different thread as suggested by the Spark docs.

This implementation is open to suggestions, as are also the name of the function and the place it should belong to in PyRDF. A first test is added to check that the functionality works

Copy link

@stwunsch stwunsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! 🥳

@vepadulano vepadulano requested a review from etejedor January 12, 2021 08:23
logger = logging.getLogger(__name__)


def run_graphs(proxies, numthreads=multiprocessing.cpu_count()):
Copy link
Collaborator

@etejedor etejedor Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this numthreads parameter really determine the number of event loops that are going to run in parallel?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, the function spawns exactly numthreads python threading.Thread objects. Each thread will submit a Spark job separately, which in turn will run a distributed event loop on the cluster. So numthreads is also the number of event loops that will be running distributedly and concurrently on the cluster at the same time.

The main reason is to avoid to spawn too many python threads at the same time, for example in an analysis with more samples than available threads on the driver

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree it's good to have it as limiting factor. What I'm not so sure about is the default, since a priori the number of cores in my client machine is not necessarily related to the number of concurrent jobs I am able to submit to a particular Spark cluster.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point, I just put that number as a first approximation. I think it should be somewhere between 4-8, anyway it's a parameter so a user could also ask for more threads to submit more concurrent jobs

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RunGraphs defaults to 4 concurrent job submissions for the moment

The RunGraphs function is inspired by ROOT::RDF::RunGraphs. In PyRDF, this function dispatches the concurrent execution of multiple computation graphs to the backend in use. If the backend doesn't implement this functionality, it defaults to running the distributed graphs sequentially.
@vepadulano vepadulano changed the title Trigger multiple graph executions with multithreading RunGraphs: Trigger the execution of multiple graphs Jan 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants