Skip to content

Do not assume determinism#575

Open
mzuenni wants to merge 1 commit intoKattis:masterfrom
mzuenni:master
Open

Do not assume determinism#575
mzuenni wants to merge 1 commit intoKattis:masterfrom
mzuenni:master

Conversation

@mzuenni
Copy link
Contributor

@mzuenni mzuenni commented Jan 29, 2026

The assumption of determinism was added to the new spec in af7ac80, but I think this it is a mistake. It breaks backwards compatibility and also breaks the assumptions of all kinds of user groups.

First of: non deterministic submissins do exists and are also sometimes required. The assumption of determinism clearly breaks such randomized problems in one way or another.

  1. The fact whether or not results are reused can heavily influence the acceptance probability for submissions in various ways:
    • a problem setter may assume that dublicate test cases can be used to reduce the acceptance probability
    • a participant who knows that result are reused might get an advantage because he knows that his submission only fails on test cases
    • a participant who does not know this would resubmit his time/hadware seeded submission expecting it to be rejudged
    • No user group can actually be sure whether or not results are reused...
  2. This assumption did not exist in legacy
    • This will confuse problem setters
    • This is an issue when upgrading problems
  3. This makes problems with randomized inputs impossible

@RagnarGrootKoerkamp
Copy link
Collaborator

Some thoughts:

  • It seems that was added in Define test cases #153 which seems to be purely for defining things, not for practical reasons, so maybe this is an oversight?
  • What benefit would we get anyway from identifying identical cases? For test groups, validator flags will be different, and without groups, identical cases usually shouldn't be used anyway (and both problemtools and bapctools warn for it I think?)
  • The implication of 'once we run a submission on a testcase, we fix the result forever, even for resubmissions' is clearly not intended right.
  • Within one submission, this could be worked around by adding a counter to the input that the input validator ignores. But that's of course not backwards-compatible.
  • There might be situations where you want to have 100x the same input, but without the need for an input validator, because the only known solutions are truely random anyway? (This is a bit of a stretch though.)

So anyway, I'd say 'testcase' should never merge files with different basenames as long as we don't have a very good use case where this is helpful.

@jsannemo
Copy link
Contributor

What benefit would we get anyway from identifying identical cases? For test groups, validator flags will be different,

Input validator flags possibly, output validator flags usually not.

The big benefit is reducing judging effort by allowing e.g test cases in a group with N <= 100 to be reused in a group with N <= 100,000, where all inputs to the process of judging the test case are identical.

This can make a big difference (>3x wall clock time for some problems) in judging time for e.g. IOI problems.

@jsannemo
Copy link
Contributor

The implication of 'once we run a submission on a testcase, we fix the result forever, even for resubmissions' is clearly not intended right.

Allowing an assumption to be made does not imply forcing the assumption to be made.

@jsannemo
Copy link
Contributor

There might be situations where you want to have 100x the same input,

An ugly workaround is a dummy output validator flag per case.

@mzuenni
Copy link
Contributor Author

mzuenni commented Jan 29, 2026

Allowing an assumption to be made does not imply forcing the assumption to be made.

This makes it even worse, because neither the problem setter nor the participant what choice the judging system made... but it impacts them.

@jsannemo
Copy link
Contributor

But that's of course not backwards-compatible.

AFAIK Kattis already performs deduplication of (input, output validator flags)

@jsannemo
Copy link
Contributor

Allowing an assumption to be made does not imply forcing the assumption to be made.

This makes it even worse, because neither the problem setter nor the participant what choice the judging system made... but it impacts them.

Well, you can certainly force a non-deduplication by the suggested workarounds, but you can't allow the optimization without letting judging systems perform it.

@mzuenni
Copy link
Contributor Author

mzuenni commented Jan 29, 2026

But that's of course not backwards-compatible.

AFAIK Kattis already performs deduplication of (input, output validator flags)

I would say:

  1. this is not allowed by the legacy spec (even though this is mit explicit)
  2. this obviously violates the guarantees given in https://open.kattis.com/problems/evolutionaryexcerpt which is a problem on kattis

With the current spec there is NO way to fix https://open.kattis.com/problems/evolutionaryexcerpt

@mzuenni
Copy link
Contributor Author

mzuenni commented Jan 29, 2026

but you can't allow the optimization without letting judging systems perform it.

then we need a different solution for that. But this is not it. It breaks older problems and makes certain types of problems that existed in the past impossible.

@RagnarGrootKoerkamp
Copy link
Collaborator

The big benefit is reducing judging effort by allowing e.g test cases in a group with N <= 100 to be reused in a group with N <= 100,000, where all inputs to the process of judging the test case are identical.

Yeah ok, in this case I do get that you want to reuse results.
In practice we use symlinks for this. Can that not be the goto way to prevent rerunning? (I guess windows is a problem ...)
But at least in our generators, we are very explicit about this.

Or did we already add some metadata somewhere also in the spec that groups depend on other groups, and are only allowed to run if those dependent groups/cases pass?

An ugly workaround is a dummy output validator flag per case.

Yes that'd be super annoying.

AFAIK Kattis already performs deduplication of (input, output validator flags)

(I'm assuming you meant (input, input validator flags) right?)
In that case: for interactive/multipass problems, output validator flags can affect what is send to the program, so assuming that output validator flags do not influence the result is tricky, right?

@RagnarGrootKoerkamp
Copy link
Collaborator

Or did we already add some metadata somewhere also in the spec that groups depend on other groups.

We have require_pass for this. Doesn't that make the re-running of testcases across groups completely irrelevant?
And so basically you should never have identical testcases for this purpose?

@jsannemo
Copy link
Contributor

Or did we already add some metadata somewhere also in the spec that groups depend on other groups.

We have require_pass for this. Doesn't that make the re-running of testcases across groups completely irrelevant?
And so basically you should never have identical testcases for this purpose?

Ah, I haven't kept up with the development recently.

Do note that it's very common for cases to be reused in ways that are not strict subsets.
E.g:
Group 1: N <= 100, B <= 10
Group 2: N <= 1000
Group 3: B <= 1

Where the second two groups reuse just subsets of the first.

@jsannemo
Copy link
Contributor

The big benefit is reducing judging effort by allowing e.g test cases in a group with N <= 100 to be reused in a group with N <= 100,000, where all inputs to the process of judging the test case are identical.

Yeah ok, in this case I do get that you want to reuse results.
In practice we use symlinks for this. Can that not be the goto way to prevent rerunning? (I guess windows is a problem ...)
But at least in our generators, we are very explicit about this.

Or did we already add some metadata somewhere also in the spec that groups depend on other groups, and are only allowed to run if those dependent groups/cases pass?

An ugly workaround is a dummy output validator flag per case.

Yes that'd be super annoying.

I'm not sure I agree with that quantifier: I can count on one hand the number of problems I've written where I want to enforce this behavior.

AFAIK Kattis already performs deduplication of (input, output validator flags)

(I'm assuming you meant (input, input validator flags) right?)
In that case: for interactive/multipass problems, output validator flags can affect what is send to the program, so assuming that output validator flags do not influence the result is tricky, right?

No, output validator flags are what affect judging, input validator flags are irrelevant after install, no?

@RagnarGrootKoerkamp
Copy link
Collaborator

Where the second two groups reuse just subsets of the first.

Hmm yeah that's currently not supported, although I think it easily could be?

I'm not sure I agree with that quantifier: I can count on one hand the number of problems I've written where I want to enforce this behavior.

Yeah I haven't needed it much, but making a list of one output_validator_flags per test case would be annoying. Then it's easier to just add some dummy integer to the input files.

No, output validator flags are what affect judging, input validator flags are irrelevant after install, no?

oh right indeed, my bad. I thought you were suggesting that kattis only runs once ignoring output validator flags, and then only checks the team output multiple times against different output_validator_flags. But then is not the case then.

@jsannemo
Copy link
Contributor

jsannemo commented Jan 29, 2026

But that's of course not backwards-compatible.

AFAIK Kattis already performs deduplication of (input, output validator flags)

I would say:

  1. this is not allowed by the legacy spec (even though this is mit explicit)
  2. this obviously violates the guarantees given in https://open.kattis.com/problems/evolutionaryexcerpt which is a problem on kattis

With the current spec there is NO way to fix https://open.kattis.com/problems/evolutionaryexcerpt

Can't you add dummy output validator flaga per test case? That's how I would assume it's implemented now, since the current spec formulation is how Kattis does it today.

I'd argue the legacy spec allows it specifically because it's not explicit. :)

then we need a different solution for that. But this is not it. It breaks older problems and makes certain types of problems that existed in the past impossible.

I.e, this is clearly not true since Kattis today works under this assumption (unless I'm mistaken, @niemela )

@mzuenni
Copy link
Contributor Author

mzuenni commented Jan 29, 2026

But that's of course not backwards-compatible.

AFAIK Kattis already performs deduplication of (input, output validator flags)

I would say:

  1. this is not allowed by the legacy spec (even though this is mit explicit)
  2. this obviously violates the guarantees given in https://open.kattis.com/problems/evolutionaryexcerpt which is a problem on kattis

With the current spec there is NO way to fix https://open.kattis.com/problems/evolutionaryexcerpt

Can't you add dummy output validator flaga per test case? That's how I would assume it's implemented now, since the current spec formulation is how Kattis does it today.

No. Even the exact same submission should be rejudged since the input is truly random every time. And with the current formulation a judging system could decide that if the submission is identical it would not need to be rerun.

hen we need a different solution for that. But this is not it. It breaks older problems and makes certain types of problems that existed in the past impossible.

I.e, this is clearly not true since Kattis today works under this assumption

no thats a false conclusion. This means that Kattis today has a bug since it clearly violoates whats written in the satement of that problem that is hosted on Kattis. (In other words either the statement or the implementation is wrong but both is Kattis responsibility here?)

I'd argue the legacy spec allows it specifically because it's not explicit. :)

I would argue that a judging is not allowed to change the judging process in any way that is observable for a user and this is very well observable.

@mzuenni
Copy link
Contributor Author

mzuenni commented Jan 29, 2026

Also as @RagnarGrootKoerkamp pointed out, the following sentence has very weird consequences.

Judge systems may assume that the result of running a program on a test case is deterministic.

It allows a judging system to run the same submission on the same input multiple times and pick the "worst" verdict? For a deterministic submission this makes no difference and therefore should be fine?

In generell we should not assume things about submission that are are not necessarily true. And we should not allow a judging system to make decission that influence the verdict of a submission/are observable for participants.

The arguments in favor of this (that I have read so far) also show that you do not actually want to reuse a testcase but rather the outcome (verdict) of a submission on a testcase. If you want this you should do it directly and not do this in some hacky way with reusing input files.

@jsannemo
Copy link
Contributor

I see what you're saying, but I just don't think it's a very big problem, and in particular not enough to outweigh the benefit of having the sameness of test cases result in the same verdict implicitly.

Specifically for randomized solutions, it's in most languages, and especially all languages at e.g. the ICPC and IOI, trivial to derandomize: you fix a seed, and if you assume your solution passes e.g 95% of submission attempts, you submit with two different seeds. In fact, repeating your test case is something I as a jury member would strongly discourage - it's easily defeated by selecting as seed a hash of the input.

I can kind of buy the point about knowledge of this behavior benefiting those who know of the behaviour. At e.g. the IOI, the rules were very clear in that it's your responsibility to make your solution deterministic. However, I would argue this is always the case, and that the implications of not assuming determinism are worse.

At any contest, your solution may be rejudged at the discretion of the judges for a number of reasons. As such, any nondeterminstic solution could change its verdict on a rejudgment. For example, a discovered hardware problem, invalid test data, etc etc.

Making problems that explicitly count on the non-determinism of solutions rather than requiring determinism - and, informing contestants of this as e.g the IOI does in its rule - suddenly makes it such that your verdict might suddenly change on a rejudge. I think it's deeply problematic for this to be the case for how the problem is expected to be solved.

@jsannemo
Copy link
Contributor

But that's of course not backwards-compatible.

AFAIK Kattis already performs deduplication of (input, output validator flags)

I would say:

  1. this is not allowed by the legacy spec (even though this is mit explicit)
  2. this obviously violates the guarantees given in https://open.kattis.com/problems/evolutionaryexcerpt which is a problem on kattis

With the current spec there is NO way to fix https://open.kattis.com/problems/evolutionaryexcerpt

Can't you add dummy output validator flaga per test case? That's how I would assume it's implemented now, since the current spec formulation is how Kattis does it today.

No. Even the exact same submission should be rejudged since the input is truly random every time. And with the current formulation a judging system could decide that if the submission is identical it would not need to be rerun.

hen we need a different solution for that. But this is not it. It breaks older problems and makes certain types of problems that existed in the past impossible.

I.e, this is clearly not true since Kattis today works under this assumption

no thats a false conclusion. This means that Kattis today has a bug since it clearly violoates whats written in the satement of that problem that is hosted on Kattis. (In other words either the statement or the implementation is wrong but both is Kattis responsibility here?)

Of course it's not a false conclusion? It doesn't "break older problems" - at most, older problems are today relying on unspecified behavior that make them broken. Defining undefined behavior is not breaking backwards compatibility.

I'd argue the legacy spec allows it specifically because it's not explicit. :)

I would argue that a judging is not allowed to change the judging process in any way that is observable for a user and this is very well observable.

I don't think anything in the old spec guarantees any sources of randomness being random: a sandbox may for example always return a fixed timestamp, always return the same /dev/random output etc.

@mzuenni
Copy link
Contributor Author

mzuenni commented Jan 30, 2026

I just don't think it's a very big problem

I disagree :)

outweigh the benefit of having the sameness of test cases result in the same verdict implicitly

I don't see any benefit of this. As mentioned before, it seems like you want to express something like "test group A relies on the verdict/score of test case X", but the current "solution" for this just does something entirely different.

Specifically for randomized solutions, it's in most languages, and especially all languages at e.g. the ICPC and IOI, trivial to derandomize

That is true (at least if the person who wrote the code did not make it intentionally hard...) but also irrelevant. A derandomized solution is a different solution and can get a different verdict.

At e.g. the IOI, the rules were very clear in that it's your responsibility to make your solution deterministic.

That is fine for them. IOI can add whaterver rules they would like. We on the other hand should not add any rules or any unneccessary restriction.

However, I would argue this is always the case, and that the implications of not assuming determinism are worse.

This is not true for ICPC?

At any contest, your solution may be rejudged at the discretion of the judges for a number of reasons. As such, any nondeterminstic solution could change its verdict on a rejudgment. For example, a discovered hardware problem, invalid test data, etc etc.

Yes, but what are you arguing here? You do the rejudging because you expect a deterministic solution to get a new verdict... Obviously any solution could get a different judging here?!
And I want to add the rejudging should probably only happen to cases where this could happpen... but whoever does the rejudging has the right to do so. That is just irrelevant for this discussion.

Making problems that explicitly count on the non-determinism of solutions rather than requiring determinism - and, informing contestants of this as e.g the IOI does in its rule - suddenly makes it such that your verdict might suddenly change on a rejudge. I think it's deeply problematic for this to be the case for how the problem is expected to be solved.

This is besides the point. If the competition has such a rule than the judging system can do the assumption. But again, we should not add such assumptions.

@RagnarGrootKoerkamp
Copy link
Collaborator

non-determinism of solutions

So actually in our randomized-input problems, we use the non-determinism of the interactor, and very much rely on this.
We want to insist that re-submissions of the same code will get a new random input.

So also here, rejudging will be broken, and if we can't even require our own code to be deterministic, it probably doesn't add much to require that from submissions.

Also, there is stuff like PYTHONHASHSEED, which influences the order in which thing are iterated over in a set, and supposedly cannot be changed from inside the program.
Similarly, Rust also randomizes the hash function for all HashSet instances. (Not sure if the random state is global or per instance, but it definitely changes across independent runs of the executable.)

So if you're requiring deterministic output, you're basically forcing submissions to avoid a bunch of common language features, which seems completely basides the point.
Sure, IOI may want to enfore this, but the spec absolutely should not (since it's also used in uni courses and such).


Regarding rejudging: generally if something is accepted once it should remain accepted, and there's not much one can do about it anyway. The other case is when rejudging a WA submission becomes AC only due to randomness. But in that case you still have the option to manually run it a few more times and/or to just not apply the rejudging.

@mzuenni
Copy link
Contributor Author

mzuenni commented Jan 30, 2026

Judge systems may assume that the result of running a program on a test case is deterministic.
[...]
The assumption of determinism means that a judge system could choose to reuse the result of a previous run, or to re-run the equivalent test case.

Yeah so actually even if we would accept the former the later is still wrong since there is no word about the valdiator beeing deterministic...

@jsannemo
Copy link
Contributor

jsannemo commented Jan 30, 2026 via email

@mzuenni
Copy link
Contributor Author

mzuenni commented Jan 30, 2026

And with the current formulation a judging system could decide that if

the submission is identical it would not need to be rerun.

First, it would not need to but with the current formulation we would allow this wich feels very wrong.
Second, as already pointed out the verdict does not only depend on the submission. The output validator could also be non deterministic and should be rerun.

You mean that the judge should skip rejudging of test cases that didn't change because the submissions can be assumed to be deterministic on the other cases? ;)

Very much no. For me the judging process is used to generate a verdict according to the "verdict distribution" of the (randomized) submission. That means for each test case the submission is sampled once and that is aggregated.
If there are two identical test cases the clear interpretation is that that test case should be used twice to sample the distribution. (And I would define that a judging process is ok if its final verdict obeys the same distribution. This allows stuff like lazy judging but clearly not caching).
Now resampling must happen if a sample was generated the wrong way (like invalid test data, hard ware whatever) and by default nothing else should be resampled since that would again low stuff like "rerun this X times and take the worst verdict" which changes the observably verdict distribution.

Obviously a judge can still rejudge whatever he wants. That the right of a judge. But we only describe a problem package format here. A data format. IMO we should not write these assumptions about the submissions at all?

Copy link
Member

@niemela niemela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm strongly against removing this. We had a long discussion about this back when...

...maybe we should schedule a call?

@jsannemo
Copy link
Contributor

I think we're not going to find agreement on this @mzuenni , but I want to point out that my view on

Obviously a judge can rejudge whatever he wants. That the right of a judge. But we only describe a problem package format here. A data format. We should not write assumptions about the submissions at all?

Is different, but I'm not sure you agree with yourself on this either. There are three options:

  • specify that when evaluating a submission this assumption can be made
  • specify that it mustn't be made
  • leave any assumption on this up to the judge itself.

I think most of your arguments for why this assumption should not be allowed very much means we must explicitly forbid it. It is not out of place for the PPF to dictate this: if parts of the judging process dictate how you get a verdict, we must specify it for a problem to mean the same thing across judges. And in fact I will now in some sense disagree with myself in saying that perhaps judges mustn't be allowed to either make or not make the assumption? (Although I still would prefer the defined behavior to be that judges should evaluate a single input only once, and as I've explained, no, you really can't and in my opinion shouldn't use multiple identical inputs to sample an output distribution: it's trivial to work around in submissions).

Anyways, I think @niemela idea of moving this to a live discussion is the right call.

@mzuenni
Copy link
Contributor Author

mzuenni commented Jan 30, 2026

you really can't and in my opinion shouldn't use multiple identical inputs to sample an output distribution: it's trivial to work around in submissions).

I am also not in favor of identical test cases (and we BAPCtools warns for identical inputs unless silenced) but this happened in the past.
But I would prefer the much more natural definition of "one input file means one run" instead of needing to first define what identical test cases mean, then assuming that submissions and output validators are deterministic, just for some judging systems to be able to cache them...

Also the core issue is much larger. Not only is the assumption currently in the spec not strong enough to actually allow caching, but also right now it would allow a judging system to cache stuff across submissions? (For example if the submitted files are identical). IMO this is not intended and should never be allowed.

Anyways, I think @niemela idea of moving this to a live discussion is the right call.

Yea i am happy with that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants