TRON-2477: Do some more validation of TimeSpecs#1072
TRON-2477: Do some more validation of TimeSpecs#1072jfongatyelp wants to merge 4 commits intomasterfrom
Conversation
nemacysts
left a comment
There was a problem hiding this comment.
generally lgtm, just two questions
| try: | ||
| return factory.build(config) | ||
| except (ValueError, Exception) as e: | ||
| log.error(f"Failed to build job '{config.name}': {e}. Skipping job.") |
There was a problem hiding this comment.
i think ideally we'd be able to see that this has happened during a config update and show the failure then (so that we're aware), but having the job disappear from tronweb seems fine as well - someone will complain relatively quickly if this is impacting them :p
that said - when this happens: do we lose the previous runs in tronweb upon fixing the misconfiguration?
There was a problem hiding this comment.
i do also kinda wonder what happens if there's a dependency on a job with this misconfiguration
There was a problem hiding this comment.
i think ideally we'd be able to see that this has happened during a config update and show the failure then (so that we're aware), but having the job disappear from tronweb seems fine as well - someone will complain relatively quickly if this is impacting them :p
So when I attempt to upload a config w/ a bad schedule, this does fail and api returns an error (though doesn't seem to generate any logs about the failure on the server side 🤔 ), so at least setup_tron_namespace should alert us if this happens. I'll try to see how I can make sure the error is logged tho.
that said - when this happens: do we lose the previous runs in tronweb upon fixing the misconfiguration?
Great point, will run some more tests on my local run to verify
i do also kinda wonder what happens if there's a dependency on a job with this misconfiguration
For this one, I'm suspecting we'll be ok since I thought the way we do inter-job 'dependencies' were really just based on trigger names, rather than having any actual tie between the jobs/jobruns? But I will definitely also verify this.
There was a problem hiding this comment.
Drat, good call -- if the tron daemon starts with a bad config (because we can't send a bad config thru the api anymore while it's running) and skips the job on startup, then if someone pushes a correction, tron will re-start from 0, I'm guessing because recovery from state would only be done at startup time, and not when we re-initialize the job scheduler.
I'll have a dive in to the scheduler to see whether recovery at scheduling time is easy to achieve, otherwise I've half a mind to have to leave this as an Expected Behavior if you've sent an impossible schedule.
We ran into an infinite loop in our current TimeSpecification.get_match when asked for the next 30th of februrary, for example.
Other similar impossible job schedules such as April 31 would have run into the same issue. I considered changing this in the schedule parsing logic, but this will catch if (in the extremely slim chance) someone also specified an impossible date using our complex scheduling format (
groc daily,value: 30th day in feb).In this PR, we now do not allow an invalid TimeSpecificaation to be created, and when initially scheduling the next job runs, we check whether we get a valid TimeSpecification back, and if not, simply print an error in our logs and skip that job in our JobCollection.
In general, I do think our yelpsoa-configs validation should prevent us from even getting here again and will provide a better interface to job owners that show the invalid config, but if config on disk/in memory was ever able to get into a bad state like this, these changes would allow tron to continue gracefully, simply skipping that job with an invalid schedule. In addition, config change requests would succeed if provided with a new, valid schedule.
In addition to the TimeSpecification tests, I've tested locally and ensured if tron starts w/ an invalid schedule, it simply skips:
One unintuitive behavior as a result of this change is that the job will also no longer be shown in tronweb, though it persists in config, which could potentially lead to confusion. That said, I think trying to add the job but ensure everything down the line can properly handle+display appropriate warnings when it has an impossible schedule may be more trouble than this edge case is worth, tbh. I'm happy to give it a shot though if ppl think the opposite.