-
Notifications
You must be signed in to change notification settings - Fork 2
091 scope functions rework #382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mgolosova
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
New function
remove_explicit_scopeis a bit wierd.
It's description says that it "removes ... from the dataset name", while in fact it "tries to split 'mixed' ds name into scope and pure ds name, and sets scope toNoneif the ds name is not 'mixed'". -
On the other hand, the second function (
extract_scope) will fail if passed ds name is 'mixed'. In other words, you do not provide a single function that will return the scope: if one needs scope, he/she will have to use both functions explicitly. Ifextract_scopewas using theremove_explicit_scope"under the hood", it would be better -- but still the first comment above would be fair.
My suggestion is to split things into separate (and independently usable) functions like these:
def get_dataset_name(dsn):
""" Get 'pure' DS name (not mixed with e.g. scope prefix). """
<...>
def get_dataset_scope(dsn):
""" Get dataset scope from DS name (mixed or pure). """
<...>
Maybe use different names, it's not the point here.
I would've agreed if both types of dataset names would've been (more or less) equal, but they are not, in my understanding. If this is correct, then we need a function to turn a 'mixed' name into normal one, and only need it in 091, because we are not getting any new dataset names after 091. This function also gets the scope in the process as a byproduct. We also need a function to get a scope from a normal name, and this function should be used in 091 after the first one (when necessary). It can also be used in 095, or wherever it is needed after 091. By the way, both our samples include zero occurences of 'mixed' names. The same logic applies...
... Here. Name/description can be altered a bit, but the main goal of the function is 'purification' of a name, scope is a byproduct. In theory we can throw this scope away, and use the other function for 'purified' names as well. [1] It seems that what we call a 'mixed' name here is called DID in Rucio: https://readthedocs.org/projects/rucio/downloads/pdf/next/ , point 1.1.2. Shame that they don't provide any examples of complete filenames to compare them to our logic... |
If we get such values in fields we expect to contain dataset names -- they are valid: if some ATLAS system has accepted this value as a valid dataset name, then in terms of original task metadata it is a valid value.
Hint: to answer this question one can check the stages' code ;)
...which it does not, in fact. Stage 091 uses "purified" DS name only to query Rucio, that's it. For the output messages it uses values that were obtained from the input ones:
It is possible that handling of So what you say below:
is fair -- but these samples are not the ultimate source of information in such cases.
Fine, let's find out how things are going now and keep/introduce (if actually needed) this functionality -- here/in another PR, respectively.
Fine; let it stay in 091, I won't mind for now.
Normally, people do not expect any "byproducts" to be returned -- they expect some... expected result. For example: functipon
What we need, originally, is to get scope from the value we have. The value itself may be a string with a fair or extended DS name, or If we want things to be totally normalized (atomic functions with intuitive responses) in terms of splitting ds name processing logic into functions, then we have to define a set of functions:
(+ something like For the sake of usability, very soon we will decide to add one more function:
What I suggest is to de-normalize things and leave only the first and the last one -- to speed up the development process. Why is it fine to do:
...and deserves same answer as above.
Use the force: |
8cbf292 to
ce28be8
Compare
The function, in fact, performs two operations: - Normalizes dataset name, if necessary. Returns name either way. - Determines and returns scope.
ce28be8 to
cec0ab8
Compare
My mistake then, sorry.
I failed to find any datasets with "Fortunately, we didn't have dataset names like All things considered, this
... Keeping name normalization in 091 and moving scope extraction to the library in another PR, either new one or #284.
For the record: this dataset name occurs in our ES, but without Implemented the original suggestion, also rebased due to layout changes in master and force-pushed once more to correct a mistake. |
Ok, then it must have come to our code from some documentation and is not actually needed.
It may be done later, but if we understand what's to be done, it will help to make a right choise now.
The very first option is what we have now; in this case we need Next two options -- we can not get 'scope' from the original mixed name at least at stage 095, and (with early normalization) at stage 091 as well.
Or, let In the last case, we can remove it right now and forget for good. But if we do not want to change the behaviour now... we can do whatever we want, in fact: it is a bit more work now so that later we could simply remove a couple of functions completely, or a bit more work later -- to separate this finctionality from what's needed. To summarize, the "safest" way is:
Then, if we decide to remove Which is close, but not exactly your conclusion:
The difference is about |
It does not (empty lines added for readability): Alright then, will move name normalization into library as well. |
Split
extract_scope()in two functions: specific operation for dealing with composite dataset names and more generic scope extraction.