-
Notifications
You must be signed in to change notification settings - Fork 17
Description
Currently, HistProducerFileTask can quickly submit many jobs
Job map works as nInputFiles * nPlots
Since it is common to have thousands of inputFiles (each dataset is down to the nano file) and 10's of plots, this can become very large
(flaf_env) [daebi@lxplus930 HH_bbWW]$ law run HistProducerFileTask --period Run3_2022 --version may22 --print-status 0,1
print task status with max_depth 0 and target_depth 1
0 > HistProducerFileTask(effective_workflow=htcondor, branch=-1, version=may22, period=Run3_2022, customisations=, test=False, n_cpus=1, workflow=htcondor)
jobs: LocalFileTarget(fs=local_fs, path=/afs/cern.ch/work/d/daebi/diHiggs/HH_bbWW/data/HistProducerFileTask/may22/Run3_2022/htcondor_jobs_0To12780.json, opti
onal)
existent
collection: TargetCollection(len=12780, threshold=12780.0)
The accepted solution to this was to just use --tasks-per-job 10 or so, reducing the number of condor jobs by a factor of 10
But this has another issue, since the tasks-per-job argument wraps jobs near each other, it will pair similar files together in order
The problem is that quick 'small' files will run in the same job very quickly and slow 'large' files will run in the same job very very slowly
Take TTto4Q vs TTto2L2Nu as example -- This method will put 10 TTto4Q files together and finish quickly as almost no events pass the lepton requirements. On the other side, 10 TTto2L2Nu files together will finish very slowly as each file has many events passing the lepton requirements.
I do not have a solution to this, but any better branch organization could help. Even doing ---print-status 0,1 takes 20 minutes to load as it must index 13,000 jobs. If HistProducerFileTask could run on an hadded file for the whole dataset maybe? Or if --tasks-per-job could randomly match jobs in a way to smooth out performance?