Conversation
|
Some thoughts:
So anyway, I'd say 'testcase' should never merge files with different basenames as long as we don't have a very good use case where this is helpful. |
Input validator flags possibly, output validator flags usually not. The big benefit is reducing judging effort by allowing e.g test cases in a group with N <= 100 to be reused in a group with N <= 100,000, where all inputs to the process of judging the test case are identical. This can make a big difference (>3x wall clock time for some problems) in judging time for e.g. IOI problems. |
Allowing an assumption to be made does not imply forcing the assumption to be made. |
An ugly workaround is a dummy output validator flag per case. |
This makes it even worse, because neither the problem setter nor the participant what choice the judging system made... but it impacts them. |
AFAIK Kattis already performs deduplication of (input, output validator flags) |
Well, you can certainly force a non-deduplication by the suggested workarounds, but you can't allow the optimization without letting judging systems perform it. |
I would say:
With the current spec there is NO way to fix https://open.kattis.com/problems/evolutionaryexcerpt |
then we need a different solution for that. But this is not it. It breaks older problems and makes certain types of problems that existed in the past impossible. |
Yeah ok, in this case I do get that you want to reuse results. Or did we already add some metadata somewhere also in the spec that groups depend on other groups, and are only allowed to run if those dependent groups/cases pass?
Yes that'd be super annoying.
(I'm assuming you meant |
We have |
Ah, I haven't kept up with the development recently. Do note that it's very common for cases to be reused in ways that are not strict subsets. Where the second two groups reuse just subsets of the first. |
I'm not sure I agree with that quantifier: I can count on one hand the number of problems I've written where I want to enforce this behavior.
No, output validator flags are what affect judging, input validator flags are irrelevant after install, no? |
Hmm yeah that's currently not supported, although I think it easily could be?
Yeah I haven't needed it much, but making a list of one
oh right indeed, my bad. I thought you were suggesting that kattis only runs once ignoring output validator flags, and then only checks the team output multiple times against different output_validator_flags. But then is not the case then. |
Can't you add dummy output validator flaga per test case? That's how I would assume it's implemented now, since the current spec formulation is how Kattis does it today. I'd argue the legacy spec allows it specifically because it's not explicit. :)
I.e, this is clearly not true since Kattis today works under this assumption (unless I'm mistaken, @niemela ) |
No. Even the exact same submission should be rejudged since the input is truly random every time. And with the current formulation a judging system could decide that if the submission is identical it would not need to be rerun.
no thats a false conclusion. This means that Kattis today has a bug since it clearly violoates whats written in the satement of that problem that is hosted on Kattis. (In other words either the statement or the implementation is wrong but both is Kattis responsibility here?)
I would argue that a judging is not allowed to change the judging process in any way that is observable for a user and this is very well observable. |
|
Also as @RagnarGrootKoerkamp pointed out, the following sentence has very weird consequences.
It allows a judging system to run the same submission on the same input multiple times and pick the "worst" verdict? For a deterministic submission this makes no difference and therefore should be fine? In generell we should not assume things about submission that are are not necessarily true. And we should not allow a judging system to make decission that influence the verdict of a submission/are observable for participants. The arguments in favor of this (that I have read so far) also show that you do not actually want to reuse a testcase but rather the outcome (verdict) of a submission on a testcase. If you want this you should do it directly and not do this in some hacky way with reusing input files. |
|
I see what you're saying, but I just don't think it's a very big problem, and in particular not enough to outweigh the benefit of having the sameness of test cases result in the same verdict implicitly. Specifically for randomized solutions, it's in most languages, and especially all languages at e.g. the ICPC and IOI, trivial to derandomize: you fix a seed, and if you assume your solution passes e.g 95% of submission attempts, you submit with two different seeds. In fact, repeating your test case is something I as a jury member would strongly discourage - it's easily defeated by selecting as seed a hash of the input. I can kind of buy the point about knowledge of this behavior benefiting those who know of the behaviour. At e.g. the IOI, the rules were very clear in that it's your responsibility to make your solution deterministic. However, I would argue this is always the case, and that the implications of not assuming determinism are worse. At any contest, your solution may be rejudged at the discretion of the judges for a number of reasons. As such, any nondeterminstic solution could change its verdict on a rejudgment. For example, a discovered hardware problem, invalid test data, etc etc. Making problems that explicitly count on the non-determinism of solutions rather than requiring determinism - and, informing contestants of this as e.g the IOI does in its rule - suddenly makes it such that your verdict might suddenly change on a rejudge. I think it's deeply problematic for this to be the case for how the problem is expected to be solved. |
Of course it's not a false conclusion? It doesn't "break older problems" - at most, older problems are today relying on unspecified behavior that make them broken. Defining undefined behavior is not breaking backwards compatibility.
I don't think anything in the old spec guarantees any sources of randomness being random: a sandbox may for example always return a fixed timestamp, always return the same /dev/random output etc. |
I disagree :)
I don't see any benefit of this. As mentioned before, it seems like you want to express something like "test group
That is fine for them. IOI can add whaterver rules they would like. We on the other hand should not add any rules or any unneccessary restriction.
This is not true for ICPC?
Yes, but what are you arguing here? You do the rejudging because you expect a deterministic solution to get a new verdict... Obviously any solution could get a different judging here?!
This is besides the point. If the competition has such a rule than the judging system can do the assumption. But again, we should not add such assumptions. |
So actually in our randomized-input problems, we use the non-determinism of the interactor, and very much rely on this. So also here, rejudging will be broken, and if we can't even require our own code to be deterministic, it probably doesn't add much to require that from submissions. Also, there is stuff like So if you're requiring deterministic output, you're basically forcing submissions to avoid a bunch of common language features, which seems completely basides the point. Regarding rejudging: generally if something is accepted once it should remain accepted, and there's not much one can do about it anyway. The other case is when rejudging a WA submission becomes AC only due to randomness. But in that case you still have the option to manually run it a few more times and/or to just not apply the rejudging. |
Yeah so actually even if we would accept the former the later is still wrong since there is no word about the valdiator beeing deterministic... |
|
And with the current formulation a judging system could decide that if
the submission is identical it would not need to be rerun.
Yes, and the way you should rerun that is to change your random seed. I do
not think that problems with test data generated randomly for each new
submission are common enough to be what should dictate this. And to be
honest I'm not sure I'm completely sold on the idea of fully random test
data either. If I made such a problem, I'd request the solution to give me
a seed instead (that I e.g. xor with a per-test case seed if I wanted
multiple random test cases). That gives you both the behaviour you want and
allows assuming determinism.
So if you're requiring deterministic output
That's not, at least according to me, the point, nor what the text is
doing. It's about allowing the judging system to *assume* deterministic
output. Clearly it can never *require* this. As you say, language features
or bugs can introduce unintended non-determinism. The reason that IOI has
in its rules that solutions must be deterministic is not to forbid
randomized solutions: it's to make a fact of the judging process clear,
which is that if your submission have *unintended non-determinism*, your
verdict is not guaranteed. I argue that unintended non-determinism is a
bug, and that you should not be guaranteed any verdict in that case.
Regarding rejudging: generally if something is accepted once it should
remain accepted
I mean that's an opinion just as valid as mine that if you're
non-deterministic you might not always be accepted. :-)
You do the rejudging because you expect a deterministic solution to get a
new verdict... [snip] And I want to add the rejudging should probably only
happen to cases where this could happpen...
You mean that the judge should skip rejudging of test cases that didn't
change because the submissions can be assumed to be deterministic on the
other cases? ;)
That is true (at least if the person who wrote the code did not make it
intentionally hard...) but also irrelevant. A derandomized solution is a
different solution and can get a different verdict.
I think it's totally relevant since it clearly shows that you as a problem
author you don't gain anything by e.g. repeating a test case in the hopes
of having it run multiple times, since it's trivial to make your randomized
solution be random only over different test cases rather than each of your
instance of the same test case, which really is the argument that was used
most for why a problem might want non-determinism (in addition to the
random-testdata one which I think is bad practice and should be used by the
validator and submission together seeding a generator).
…On Fri, 30 Jan 2026, 03:24 mzuenni, ***@***.***> wrote:
*mzuenni* left a comment (Kattis/problem-package-format#575)
<#575 (comment)>
Judge systems may assume that the result of running a program on a test
case is deterministic.
[...]
The assumption of determinism means that a judge system could choose to
reuse the result of a previous run, or to re-run the equivalent test case.
Yeah so actually even if we would accept the former the later is still
wrong since there is no word about the valdiator beeing deterministic...
—
Reply to this email directly, view it on GitHub
<#575 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHIMXNRHGNHOQVXED2CMR34JK6HTAVCNFSM6AAAAACTKRIPJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTQMRRGQYTKMRRGM>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
First, it would not need to but with the current formulation we would allow this wich feels very wrong.
Very much no. For me the judging process is used to generate a verdict according to the "verdict distribution" of the (randomized) submission. That means for each test case the submission is sampled once and that is aggregated. Obviously a judge can still rejudge whatever he wants. That the right of a judge. But we only describe a problem package format here. A data format. IMO we should not write these assumptions about the submissions at all? |
niemela
left a comment
There was a problem hiding this comment.
I'm strongly against removing this. We had a long discussion about this back when...
...maybe we should schedule a call?
|
I think we're not going to find agreement on this @mzuenni , but I want to point out that my view on
Is different, but I'm not sure you agree with yourself on this either. There are three options:
I think most of your arguments for why this assumption should not be allowed very much means we must explicitly forbid it. It is not out of place for the PPF to dictate this: if parts of the judging process dictate how you get a verdict, we must specify it for a problem to mean the same thing across judges. And in fact I will now in some sense disagree with myself in saying that perhaps judges mustn't be allowed to either make or not make the assumption? (Although I still would prefer the defined behavior to be that judges should evaluate a single input only once, and as I've explained, no, you really can't and in my opinion shouldn't use multiple identical inputs to sample an output distribution: it's trivial to work around in submissions). Anyways, I think @niemela idea of moving this to a live discussion is the right call. |
I am also not in favor of identical test cases (and we BAPCtools warns for identical inputs unless silenced) but this happened in the past. Also the core issue is much larger. Not only is the assumption currently in the spec not strong enough to actually allow caching, but also right now it would allow a judging system to cache stuff across submissions? (For example if the submitted files are identical). IMO this is not intended and should never be allowed.
Yea i am happy with that |
The assumption of determinism was added to the new spec in af7ac80, but I think this it is a mistake. It breaks backwards compatibility and also breaks the assumptions of all kinds of user groups.
First of: non deterministic submissins do exists and are also sometimes required. The assumption of determinism clearly breaks such randomized problems in one way or another.