I hypothesize that the VIMA model can discern the chosen task among its 13 variations by observing the objects present in the environment. Please note that this is a speculative idea and may be subject to validation. I'm looking forward to hearing back from you on this hypothesis.
So, if my hypothesis is correct, even if any prompts are not given, this model can perform the task based on the environment.