Skip to content

Commit aff8c1d

Browse files
committed
[DF] Add distributed FromSpec documentation.
1 parent ba5375b commit aff8c1d

File tree

1 file changed

+36
-1
lines changed

1 file changed

+36
-1
lines changed

tree/dataframe/src/RDataFrame.cxx

Lines changed: 36 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -791,7 +791,6 @@ of the cluster schedulers supported by Dask (more information in the
791791
~~~{.py}
792792
import ROOT
793793
from dask.distributed import Client
794-
795794
# In a Python script the Dask client needs to be initalized in a context
796795
# Jupyter notebooks / Python session don't need this
797796
if __name__ == "__main__":
@@ -839,6 +838,42 @@ if __name__ == "__main__":
839838
Note that when processing a TTree or TChain dataset, the `npartitions` value should not exceed the number of clusters in
840839
the dataset. The number of clusters in a TTree can be retrieved by typing `rootls -lt myfile.root` at a command line.
841840
841+
### Distributed FromSpec
842+
843+
RDataFrame can be also built from a JSON sample specification file using the FromSpec function. In distributed mode, two arguments need to be provided: the path to the specification
844+
jsonFile (same as for local RDF case) and an additional executor argument - in the same manner as for the RDataFrame constructors above - an executor can either be a spark connection or a dask client.
845+
If no second argument is given, the local version of FromSpec will be run. Here is an example of FromSpec usage in distributed RDF using either spark or dask backends.
846+
For more information on FromSpec functionality itself please refer to [FromSpec](\ref rdf-from-spec) documentation.
847+
848+
Using spark:
849+
~~~{.py}
850+
import pyspark
851+
import ROOT
852+
853+
conf = SparkConf().setAppName(appName).setMaster(master)
854+
sc = SparkContext(conf=conf)
855+
856+
# The FromSpec function accepts an optional "sparkcontext" parameter
857+
# and it will distribute the application to the connected cluster
858+
df_fromspec = ROOT.RDF.Experimental.FromSpec("myspec.json", executor = sc)
859+
# Proceed as usual
860+
df_fromspec.Define("x","someoperation").Histo1D(("name", "title", 10, 0, 10), "x")
861+
~~~
862+
863+
Using dask:
864+
~~~{.py}
865+
import ROOT
866+
from dask.distributed import Client
867+
868+
if __name__ == "__main__":
869+
client = Client("dask_scheduler.domain.com:8786")
870+
871+
# The FromSpec function accepts the Dask Client object as an optional argument
872+
df_fromspec = ROOT.RDF.Experimental.FromSpec("myspec.json", executor=client)
873+
# Proceed as usual
874+
df_fromspec.Define("x","someoperation").Histo1D(("name", "title", 10, 0, 10), "x")
875+
~~~
876+
842877
### Distributed Snapshot
843878
844879
The Snapshot operation behaves slightly differently when executed distributedly. First off, it requires the path

0 commit comments

Comments
 (0)