RunGraphs: Trigger the execution of multiple graphs #106

vepadulano · 2021-01-11T17:24:35Z

The new run_graphs function is inspired by the logic of ROOT::RDF::RunGraphs. Its main use for now is to trigger the execution of multiple Spark jobs concurrently, each job is submitted in a different thread as suggested by the Spark docs.

This implementation is open to suggestions, as are also the name of the function and the place it should belong to in PyRDF. A first test is added to check that the functionality works

stwunsch

Awesome! 🥳

PyRDF/backend/Utils.py

etejedor · 2021-01-12T13:25:15Z

PyRDF/backend/Utils.py

 logger = logging.getLogger(__name__)


+def run_graphs(proxies, numthreads=multiprocessing.cpu_count()):


Does this numthreads parameter really determine the number of event loops that are going to run in parallel?

Correct, the function spawns exactly numthreads python threading.Thread objects. Each thread will submit a Spark job separately, which in turn will run a distributed event loop on the cluster. So numthreads is also the number of event loops that will be running distributedly and concurrently on the cluster at the same time.

The main reason is to avoid to spawn too many python threads at the same time, for example in an analysis with more samples than available threads on the driver

Yes, I agree it's good to have it as limiting factor. What I'm not so sure about is the default, since a priori the number of cores in my client machine is not necessarily related to the number of concurrent jobs I am able to submit to a particular Spark cluster.

Fair point, I just put that number as a first approximation. I think it should be somewhere between 4-8, anyway it's a parameter so a user could also ask for more threads to submit more concurrent jobs

RunGraphs defaults to 4 concurrent job submissions for the moment

PyRDF/backend/Utils.py

The RunGraphs function is inspired by ROOT::RDF::RunGraphs. In PyRDF, this function dispatches the concurrent execution of multiple computation graphs to the backend in use. If the backend doesn't implement this functionality, it defaults to running the distributed graphs sequentially.

stwunsch reviewed Jan 12, 2021

View reviewed changes

PyRDF/backend/Utils.py Show resolved Hide resolved

PyRDF/backend/Utils.py Outdated Show resolved Hide resolved

PyRDF/backend/Utils.py Outdated Show resolved Hide resolved

vepadulano requested a review from etejedor January 12, 2021 08:23