FakeFactor framework for the estimation of jets misidentified taus with pyROOT.
The environment can be set up with conda via
conda env create --file environment.yamlThis framework is designed for n-tuples produced with CROWN as input.
All information for the preselection step is defined in a configuration file in the configs/ folder.
The preselection config has the following parameters:
-
The expected input folder structure is NTUPLE_PATH/ERA/SAMPLE_TAG/CHANNEL/*.root
parameter type description ntuple_pathstringabsolute path to the folder with the n-tuples on the dcache, a remote path is expected like "root://cmsxrootd-kit.gridka.de//store/user/USER/..." erastringdata taking era ("2018, "2017", "2016preVFP", "2016postVFP") channelstringtau pair decay channels ("et", "mt", "tt") treestringname of the tree in the n-tuple files ("ntuple" in CROWN) analysisstringanalysis name, needed to get the output features which are saved/needed for the later steps e.g. "smhtt_ul" -
The output folder structure is OUTPUT_PATH/preselection/ERA/CHANNEL/*.root
parameter type description output_pathstringabsolute path where the files with the preselected events will be stored, a local path is expected like "/ceph/USER/..." -
In
processesall the processes are defined that should be preprocessed.
The names are also used for the output file naming after the processing.
Each process needs two specifications:parameter type description tau_gen_modeslistsplit of the events corresponding to the origin of the hadronic tau sampleslistlist of all sample tags corresponding to the specific process The
tau_gen_modeshave following modes:parameter type description Tstringgenuine tau Jstringjet misidentified as a tau Lstringlepton misidentified as a tau allstringif no split should be performed -
In
event_selection, parameter for all selections that should be applied are defined.
This is basically a dictionary of cuts where the key is the name of a cut and the value is the cut itself as a string e.g.had_tau_pt: "pt_2 > 30". The name of a cut is not really important, it is only used as an output information in the terminal. A cut can only use variables which are in the ntuples. -
In
mc_weightsall weights that should be applied for simulated samples are defined.
There are two types of weights.- Like for
event_selectiona weight can directly be specified and is then applied to all samples the same way e.g.lep_id: "id_wgt_mu_1" - Some weights are either sample specific or need additional information. Currently implemented options are:
parameter type description generatorstring""if a normal generator weight should be applied to all samples, if"stitching"for DY+jets and W+jets a special stitching weights is appliedlumistringluminosity scaling, this depends on the era and uses the eraparameter of the config to get the correct weight, so basically it's not relevant what is in the stringZ_pt_reweightstringreweighting of the Z boson pt, the weight in the ntuple is used and only applied to DY+jets Top_pt_reweightstringreweighting of the top quark pt, the weight in the ntuple is used and only applied to ttbar
- Like for
-
In
emb_weightsall weights that should be applied for embedded samples are defined.
Like forevent_selectiona weight can directly be specified and is then applied to all samples the same way e.g.single_trigger: "trg_wgt_single_mu24ormu27"
Scale factors for b-tagging and tau ID vs jet are applied on the fly during the FF calculation step.
To run the preselection step, execute the python script and specify the config file (relative path possible):
python preselection.py --config-file PATH/CONFIG.yamlIn this step the fake factors are calculated. This should be run after the preselection step.
All information for the FF calculation step is defined in a configuration file in the configs/ folder.
The FF calculation config has the following parameters:
-
The expected input folder structure is FILE_PATH/preselection/ERA/CHANNEL/*.root
parameter type description file_pathstringabsolute path to the folder with the preselected files erastringdata taking era ("2018, "2017", "2016preVFP", "2016postVFP") channelstringtau pair decay channels ("et", "mt", "tt") treestringname of the tree in the preselected files (same as in preselection e.g. "ntuple") -
The output folder structure is workdir/WORKDIR_NAME/ERA/fake_factors/CHANNEL/outputfiles
parameter type description workdir_namestringrelative path where the output files will be stored -
General options for the calculation:
parameter type description use_embeddingboolTrue if embedded sample should be used, False if only MC sample should be used -
In
target_processesthe processes for which FFs should be calculated (normally for QCD, Wjets, ttbar) are defined.
Each target process needs some specifications:parameter type description split_categoriesdictnames of variables for the fake factor measurement in different phase space regions - the FF measurement can be split based on variables in 1D or 2D (1 or 2 variables)
- each category/variable has a
listof orthogonal cuts (e.g. "njets" with "==1", ">=2") - implemented split variables are "njets", "nbtag" or "deltaR_ditaupair"
- at least one inclusive category needs to be specified
split_categories_binedgesdictbin edge values for each split_categoriesvariable- number of bin edges should always be N(variable cuts)+1
SRlike_cutsdictevent selections for the signal-like region of the target process ARlike_cutsdictevent selections for the application-like region of the target process SR_cutsdictevent selections for the signal region (normally only needed for ttbar) AR_cutsdictevent selections for the application region (normally only needed for ttbar) var_dependencestringvariable the FF measurement should depend on (normally pt of the hadronic tau e.g. "pt_2")var_binslistbin edges for the variable specified in var_dependenceEvent selections can be defined the same way as in the preselection step
event_selection. Only the tau vs jet ID cut is special because the name should always behad_tau_id_vs_jet(orhad_tau_id_vs_jet_*in tt channel), this is needed to read out the working points from the cut string and apply the correct tau vs jet ID weights. -
In
process_fractionsspecifications for the calculation of the process fractions are defined.parameter type description processeslistsample names (from the preprocessing step) of the processes for which the fractions should be stored in the correctionlib json, the sum of fractions of the specified samples is 1. split_categoriesdictsee target_processes(only in 1D)AR_cutslistsee target_processesSR_cutslistsee target_processes, (optional) not needed for the fraction calculation
To run the FF calculation step, execute the python script and specify the config file (relative path possible):
python ff_calculation.py --config-file PATH/CONFIG.yamlIn this step the corrections for the fake factors are calculated. This should be run after the FF calculation step.
Currently two different correction types are implemented:
- non closure correction depending on a specific variable
- DR to SR interpolation correction depending on a specific variable
All information for the FF correction calculation step is defined in a configuration file in the configs/ folder. Additional information is loaded from the used config in the previous FF calculation step (this is done automatically).
The FF correction config has the following parameters:
-
The expected input folder structure is workdir/WORKDIR_NAME/ERA/fake_factors/CHANNEL/*
parameter type description workdir_namestringthe name of the work directory for which the corrections should be calculated (normally the same as in the FF calculation step) erastringdata taking era ("2018, "2017", "2016preVFP", "2016postVFP") channelstringtau pair decay channels ("et", "mt", "tt") -
In
target_processesthe processes for which FF corrections should be calculated (normally for QCD, Wjets, ttbar) are defined.
Each target process needs some specifications:parameter type description non_closuredictone or two non closure corrections can be specified indicated by the variable the correction should be calculated for (e.g. leading_lep_pt), if more than one correction is specified,leading_lep_ptshould come first (due to code specifics) because the second corrections is calculated with the first already appliedDR_SRdictthis correction should be specified only once per process in target_processesEach correction has following specifications:
parameter type description var_dependencestringvariable the FF correction measurement should depend on (e.g. "pt_1"for "leading_lep_pt")var_binslistbin edges for the variable specified in var_dependenceSRlike_cutsdictevent selections for the signal-like region of the target process that should be replaced compared to the selection used in the previous FF calculation step ARlike_cutsdictevent selections for the application-like region of the target process that should be replaced compared to the selection used in the previous FF calculation step AR_SR_cutsdictevent selections for a switch from the determination region to the signal/application region, this is only relevant for DR_SRcorrectionsnon_closuredictthis is only relevant for DR_SRcorrections, since for this corrections additional fake factors are calculated it's possible to calculated and apply non closure corrections to these fake factors before calculating the actual DR to SR correction
To run the FF correction step, execute the python script and specify the config file (relative path possible):
python ff_corrections.py --config-file PATH/CONFIG.yaml An optional parameter is --only-main-corrections. By using this parameter the precalculation step for the DR to SR corrections is skipped. This is helpful is the precalculations step is already done.
- check out
configs/general_definitions.py, this file has many relevant definition for preselection, plotting or correctionlib output information - check
ntuple_pathandoutput_path(preselection) orfile_pathandworkdir_name(fake factors, corrections) in the used config files to avoid wrong inputs or outputs