Skip to content

Conversation

@Evildoor
Copy link
Contributor

Split extract_scope() in two functions: specific operation for dealing with composite dataset names and more generic scope extraction.

@Evildoor Evildoor self-assigned this Aug 11, 2020
@Evildoor Evildoor requested a review from mgolosova August 11, 2020 09:45
@Evildoor Evildoor mentioned this pull request Aug 11, 2020
Copy link
Collaborator

@mgolosova mgolosova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. New function remove_explicit_scope is a bit wierd.
    It's description says that it "removes ... from the dataset name", while in fact it "tries to split 'mixed' ds name into scope and pure ds name, and sets scope to None if the ds name is not 'mixed'".

  2. On the other hand, the second function (extract_scope) will fail if passed ds name is 'mixed'. In other words, you do not provide a single function that will return the scope: if one needs scope, he/she will have to use both functions explicitly. If extract_scope was using the remove_explicit_scope "under the hood", it would be better -- but still the first comment above would be fair.

My suggestion is to split things into separate (and independently usable) functions like these:

def get_dataset_name(dsn):
    """ Get 'pure' DS name (not mixed with e.g. scope prefix). """
    <...>


def get_dataset_scope(dsn):
    """ Get dataset scope from DS name (mixed or pure). """
    <...>

Maybe use different names, it's not the point here.

@Evildoor
Copy link
Contributor Author

  1. On the other hand, the second function (extract_scope) will fail if passed ds name is 'mixed'. In other words, you do not provide a single function that will return the scope: if one needs scope, he/she will have to use both functions explicitly. If extract_scope was using the remove_explicit_scope "under the hood", it would be better -- but still the first comment above would be fair.

My suggestion is to split things into separate (and independently usable) functions like these:

def get_dataset_name(dsn):
    """ Get 'pure' DS name (not mixed with e.g. scope prefix). """
    <...>


def get_dataset_scope(dsn):
    """ Get dataset scope from DS name (mixed or pure). """
    <...>

Maybe use different names, it's not the point here.

I would've agreed if both types of dataset names would've been (more or less) equal, but they are not, in my understanding.
'Mixed' dataset names consist of a normal dataset name and a scope explicitly attached to it. To begin with, ATLAS dataset nomenclature forbids : in dataset names, so are these really valid? [1] Will other stages, such as 095, be able to correctly process records with them? At the moment they don't have to because extract_scope() in 091 turns 'mixed' names into normal ones - and this, in my opinion, is the right thing to do, to replace these names ASAP and only work with normal ones after it.

If this is correct, then we need a function to turn a 'mixed' name into normal one, and only need it in 091, because we are not getting any new dataset names after 091. This function also gets the scope in the process as a byproduct. We also need a function to get a scope from a normal name, and this function should be used in 091 after the first one (when necessary). It can also be used in 095, or wherever it is needed after 091.

By the way, both our samples include zero occurences of 'mixed' names.

The same logic applies...

  1. New function remove_explicit_scope is a bit wierd.
    It's description says that it "removes ... from the dataset name", while in fact it "tries to split 'mixed' ds name into scope and pure ds name, and sets scope to None if the ds name is not 'mixed'".

... Here. Name/description can be altered a bit, but the main goal of the function is 'purification' of a name, scope is a byproduct. In theory we can throw this scope away, and use the other function for 'purified' names as well.

[1] It seems that what we call a 'mixed' name here is called DID in Rucio:

https://readthedocs.org/projects/rucio/downloads/pdf/next/ , point 1.1.2.

Shame that they don't provide any examples of complete filenames to compare them to our logic...

@mgolosova
Copy link
Collaborator

@Evildoor,

  1. the second function (extract_scope) will fail if passed ds name is 'mixed'.

My suggestion is to split things into separate (and independently usable) functions like these:

def get_dataset_name(dsn):
    """ Get 'pure' DS name (not mixed with e.g. scope prefix). """
    <...>


def get_dataset_scope(dsn):
    """ Get dataset scope from DS name (mixed or pure). """
    <...>

I would've agreed if both types of dataset names would've been (more or less) equal, but they are not, in my understanding.
'Mixed' dataset names consist of a normal dataset name and a scope explicitly attached to it. To begin with, ATLAS dataset nomenclature forbids : in dataset names, so are these really valid? [1]

If we get such values in fields we expect to contain dataset names -- they are valid: if some ATLAS system has accepted this value as a valid dataset name, then in terms of original task metadata it is a valid value.

Will other stages, such as 095, be able to correctly process records with them?

Hint: to answer this question one can check the stages' code ;)

At the moment they don't have to because extract_scope() in 091 turns 'mixed' names into normal ones

...which it does not, in fact. Stage 091 uses "purified" DS name only to query Rucio, that's it. For the output messages it uses values that were obtained from the input ones:

It is possible that handling of : did not came from some realistic input metadata, but from some documentation read by the function's author, I do not know for sure.
Hint: to make sure of this, one can query our ES and/or ProdSys DB for DS names with : ;) And check the commit that adds : handling, if it was not before we get to GitHub. Just in case the author was so nice to explain the origins of his/her decisions.

So what you say below:

By the way, both our samples include zero occurences of 'mixed' names.

is fair -- but these samples are not the ultimate source of information in such cases.

this <...> is the right thing to do, to replace these names ASAP and only work with normal ones after it.

Fine, let's find out how things are going now and keep/introduce (if actually needed) this functionality -- here/in another PR, respectively.

If this is correct, then we need a function to turn a 'mixed' name into normal one, and only need it in 091, because we are not getting any new dataset names after 091.

Fine; let it stay in 091, I won't mind for now.

This function also gets the scope in the process as a byproduct.

Normally, people do not expect any "byproducts" to be returned -- they expect some... expected result. For example: functipon str.strip() does not return the length of original string, or number of removed whitespace symbols -- or even two numbers, for those removed from the left and from the right. It can, I bet, and this information cannot be obtained from the function's result, just as in this case -- but no one expects it to do so. If this information is needed -- the original string is processed independently from the strip() function call. It is not intuitive and user-friendly, to return tuple (or any other structure) with some additional "byproduct" values when one calls a function which name and even brief description both claim that it merely transforms a string.

We also need a function to get a scope from a normal name, and this function should be used in 091 after the first one (when necessary). It can also be used in 095, or wherever it is needed after 091.

What we need, originally, is to get scope from the value we have. The value itself may be a string with a fair or extended DS name, or None,or invalid string, or not even a string/None object. How paranoid we are and what we'll check is defined by the probability of this or that situation, as well as the amount of time we can (and want to) spend on this question.

If we want things to be totally normalized (atomic functions with intuitive responses) in terms of splitting ds name processing logic into functions, then we have to define a set of functions:

  • normalize_ds_name() -- turn "extended" name to "normal" or leave "normal" one as is (and throw an exception if the passed value is not a string with DS name, normal or extended);
  • get_scope_from_fair_ds_name() -- must throw en exception if ds name is not a string containing the "normal" ds name;
  • get_scope_from_extended_ds_name() -- same for the "extended" one.

(+ something like is_{fair,extended}_ds_name(), but it's not the point here).

For the sake of usability, very soon we will decide to add one more function:

  • get_scope_from_fair_or_extended_ds_name().

What I suggest is to de-normalize things and leave only the first and the last one -- to speed up the development process. Why is it fine to do:

  • the internal logic of get_scope_from_{fair,extended}_ds_name() is not that complicated so things won't get less readable and more confusing;
  • for now I do not know in which situation (except for the Rucio query) it will be critical to know for sure if we're working with the extended name or the fair one. And by "I do not know" I mean that I saw no errors related to this issue. Maybe because there indeed are no "extended" names in original data and this conversation has no sence at all.

The same logic applies...

...and deserves same answer as above.

[1] It seems that what we call a 'mixed' name here is called DID in Rucio:

https://readthedocs.org/projects/rucio/downloads/pdf/next/ , point 1.1.2.

Shame that they don't provide any examples of complete filenames to compare them to our logic...

Use the force:

$ VIRTUAL_ENV=~dkb/.rucio rucio --help | grep file
<...>
    list-files          List DID contents
<...>
$ VIRTUAL_ENV=~dkb/.rucio rucio list-files mc15_13TeV:mc15_13TeV.448000.MGPy8EG_A14NNPDF23LO_TT_RPVdirectBL_1000_tau_0p01ns.evgen.EVNT.e6768_tid14381408_00
+----------------------------------------------+--------------------------------------+-------------+------------+----------+
| SCOPE:NAME                                   | GUID                                 | ADLER32     | FILESIZE   |   EVENTS |
|----------------------------------------------+--------------------------------------+-------------+------------+----------|
| mc15_13TeV:EVNT.14381408._000002.pool.root.1 | F052A56B-A1F4-EB4C-B9A1-01055FCB9AA4 | ad:b9afe23a | 66.087 MB  |     2000 |
+----------------------------------------------+--------------------------------------+-------------+------------+----------+
Total files : 1
Total size : 66.087 MB
Total events : 2000

The function, in fact, performs two operations:
- Normalizes dataset name, if necessary. Returns name either way.
- Determines and returns scope.
@Evildoor
Copy link
Contributor Author

  1. the second function (extract_scope) will fail if passed ds name is 'mixed'.

My suggestion is to split things into separate (and independently usable) functions like these:

def get_dataset_name(dsn):
    """ Get 'pure' DS name (not mixed with e.g. scope prefix). """
    <...>


def get_dataset_scope(dsn):
    """ Get dataset scope from DS name (mixed or pure). """
    <...>

I would've agreed if both types of dataset names would've been (more or less) equal, but they are not, in my understanding.
'Mixed' dataset names consist of a normal dataset name and a scope explicitly attached to it. To begin with, ATLAS dataset nomenclature forbids : in dataset names, so are these really valid? [1]

If we get such values in fields we expect to contain dataset names -- they are valid: if some ATLAS system has accepted this value as a valid dataset name, then in terms of original task metadata it is a valid value.

Will other stages, such as 095, be able to correctly process records with them?

Hint: to answer this question one can check the stages' code ;)

At the moment they don't have to because extract_scope() in 091 turns 'mixed' names into normal ones

...which it does not, in fact. Stage 091 uses "purified" DS name only to query Rucio, that's it. For the output messages it uses values that were obtained from the input ones:

My mistake then, sorry.

It is possible that handling of : did not came from some realistic input metadata, but from some documentation read by the function's author, I do not know for sure.
Hint: to make sure of this, one can query our ES and/or ProdSys DB for DS names with : ;) And check the commit that adds : handling, if it was not before we get to GitHub. Just in case the author was so nice to explain the origins of his/her decisions.

I failed to find any datasets with : in them in our ES, and while the code is added and changed already on GitHub, there are no explanations that answer this question. There is, however, 3931d66:

"Fortunately, we didn't have dataset names like scope:name in the stage input. At all ;)"

All things considered, this : case looks more puzzling than I expected and I'd like to leave this puzzle for some other time (note to self - make a trello card when this PR is concluded), while agreeing to your suggestion of two functions working with both name types and...

If this is correct, then we need a function to turn a 'mixed' name into normal one, and only need it in 091, because we are not getting any new dataset names after 091.

Fine; let it stay in 091, I won't mind for now.

... Keeping name normalization in 091 and moving scope extraction to the library in another PR, either new one or #284.

[1] It seems that what we call a 'mixed' name here is called DID in Rucio:
https://readthedocs.org/projects/rucio/downloads/pdf/next/ , point 1.1.2.
Shame that they don't provide any examples of complete filenames to compare them to our logic...

Use the force:

$ VIRTUAL_ENV=~dkb/.rucio rucio --help | grep file
<...>
    list-files          List DID contents
<...>
$ VIRTUAL_ENV=~dkb/.rucio rucio list-files mc15_13TeV:mc15_13TeV.448000.MGPy8EG_A14NNPDF23LO_TT_RPVdirectBL_1000_tau_0p01ns.evgen.EVNT.e6768_tid14381408_00
+----------------------------------------------+--------------------------------------+-------------+------------+----------+
| SCOPE:NAME                                   | GUID                                 | ADLER32     | FILESIZE   |   EVENTS |
|----------------------------------------------+--------------------------------------+-------------+------------+----------|
| mc15_13TeV:EVNT.14381408._000002.pool.root.1 | F052A56B-A1F4-EB4C-B9A1-01055FCB9AA4 | ad:b9afe23a | 66.087 MB  |     2000 |
+----------------------------------------------+--------------------------------------+-------------+------------+----------+
Total files : 1
Total size : 66.087 MB
Total events : 2000

For the record: this dataset name occurs in our ES, but without : and scope duplication.


Implemented the original suggestion, also rebased due to layout changes in master and force-pushed once more to correct a mistake.

@mgolosova
Copy link
Collaborator

It is possible that handling of : did not came from some realistic input metadata, but from some documentation read by the function's author, I do not know for sure.

I failed to find any datasets with : in them in our ES, and while the code is added and changed already on GitHub, there are no explanations that answer this question.

Ok, then it must have come to our code from some documentation and is not actually needed.

All things considered, this : case looks more puzzling than I expected and I'd like to leave this puzzle for some other time

It may be done later, but if we understand what's to be done, it will help to make a right choise now.
I mean that there are some options, each having different consequences for what we're discussing.

  1. We say, that DS names in the original data may have this "mixed" form. Then we can:
  2. We say, that DS names in the original data will never have this "mixed" form -- and remove :-related stuff.

The very first option is what we have now; in this case we need get_scope() to accept both forms, and get_name() is better to be used at Stage 095 as well (I guess AMI, just as Rucio, don't accept mixed names).

Next two options -- we can not get 'scope' from the original mixed name at least at stage 095, and (with early normalization) at stage 091 as well.
To keep things as they are for now -- we:

  • separate these functions;
  • make get_scope() accepting only the "normalized" names;
  • leave scope extration from the mixed names at the stage 091's level.

Or, let get_scope() work with mixed names as well; when it's not needed, it just won't be used (but will be a bit confusing in the context of other functions, that operate with DS names but don't worry about mixed names -- which is not good, but can definitely be settled later).

In the last case, we can remove it right now and forget for good. But if we do not want to change the behaviour now... we can do whatever we want, in fact: it is a bit more work now so that later we could simply remove a couple of functions completely, or a bit more work later -- to separate this finctionality from what's needed.

To summarize, the "safest" way is:

  • get_scope with both forms;
  • get_name used at both stages.

Then, if we decide to remove :-related stuff, we'll need to remove get_name and modify get_scope; and if we choose the "early normalization" -- we'll only modify get_scope.

Which is close, but not exactly your conclusion:

two functions working with both name types and...
... Keeping name normalization in 091 and moving scope extraction to the library in another PR, either new one or #284.

The difference is about get_name used or not used at Stage 095; which can be solved by checking if AMI accepts "mixed" names or not. But it is not the question for this PR of the pure refactoring (as well as other improvements that might have crossed my mind ;) ), so I'll approve and merge it.

@mgolosova mgolosova merged commit c15e404 into master Aug 17, 2020
@Evildoor
Copy link
Contributor Author

To summarize, the "safest" way is:

  • get_scope with both forms;
  • get_name used at both stages.

Then, if we decide to remove :-related stuff, we'll need to remove get_name and modify get_scope; and if we choose the "early normalization" -- we'll only modify get_scope.

Which is close, but not exactly your conclusion:

two functions working with both name types and...
... Keeping name normalization in 091 and moving scope extraction to the library in another PR, either new one or #284.

The difference is about get_name used or not used at Stage 095; which can be solved by checking if AMI accepts "mixed" names or not.

It does not (empty lines added for readability):

[vaulov@aiatlas171 095_datasetInfoAMI]$ ./amiDatasets.py -c 095.cfg -m s
...
2020-08-17 18:37:35 (INFO) (ProcessorStage) Starting stage execution.
{"datasetname": "mc15_13TeV.308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00", "deleted": true, "_type": "output_dataset", "data_format": ["AOD"], "_parent": 14266403, "_id": "mc15_13TeV.308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00"}
{"datasetname": "mc15_13TeV.308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00", "_type": "output_dataset", "gen_filt_eff": "1.0000E+00", "deleted": true, "data_format": ["AOD"], "process_group": "Exotics_Other", "_parent": 14266403, "_id": "mc15_13TeV.308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00"}

{"datasetname": "mc15_13TeV:mc15_13TeV.308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00", "deleted": true, "_type": "output_dataset", "data_format": ["AOD"], "_parent": 14266403, "_id": "mc15_13TeV:mc15_13TeV.308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00"}
(WARN) No values found in AMI for dataset 'mc15_13TeV:mc15_13TeV.308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00'
{"datasetname": "mc15_13TeV:mc15_13TeV.308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00", "deleted": true, "_type": "output_dataset", "data_format": ["AOD"], "_parent": 14266403, "_id": "mc15_13TeV:mc15_13TeV.308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00"}

{"datasetname": "mc15_13TeV:308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00", "deleted": true, "_type": "output_dataset", "data_format": ["AOD"], "_parent": 14266403, "_id": "mc15_13TeV:308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00"}
(WARN) No values found in AMI for dataset 'mc15_13TeV:308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00'
{"datasetname": "mc15_13TeV:308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00", "deleted": true, "_type": "output_dataset", "data_format": ["AOD"], "_parent": 14266403, "_id": "mc15_13TeV:308373.ParticleGun_single_magneticMonopole_3gD_1500GeV.recon.AOD.e6024_s3318_r9862_tid14266403_00"}

Alright then, will move name normalization into library as well.

@mgolosova mgolosova deleted the 091-scope-rework branch August 18, 2020 10:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants