Skip to content

Conversation

zx80
Copy link

@zx80 zx80 commented Jun 29, 2025

This remove duplicate instances from test files and makes minor adjustments so that values are valid wrt to the schema intent, not only its letter.

The total number of remaining test cases is 28,188 instead of 35,407 initially.

@zx80 zx80 mentioned this pull request Jun 30, 2025
@jviotti
Copy link
Member

jviotti commented Jun 30, 2025

@michaelmior Does this mean we should look for more instances now, or is it OK?

@michaelmior
Copy link
Contributor

@jviotti I'd like to know more about what was changed. There may be some cases where what was written wasn't what was intended, but at the same time, these are real documents found in the wild so I'd personally lean towards keeping any "errors."

@michaelmior
Copy link
Contributor

Also, as far as the duplicate instances go, these instances were collected independently as opposed to being duplicated on our end, so I'm personally not too concerned about it. Although it could be worth looking into the amount of duplication.

@zx80
Copy link
Author

zx80 commented Jul 1, 2025

Also, as far as the duplicate instances go, these instances were collected independently as opposed to being duplicated on our end, so I'm personally not too concerned about it. Although it could be worth looking into the amount of duplication.

My guess is that duplicate values come from file copies, not necessary two distinct projects which just happend to use the very same json value written independendently one from the other. The amount of duplication is 20%.

@zx80
Copy link
Author

zx80 commented Jul 1, 2025

@jviotti I'd like to know more about what was changed.

They are listed in jmc/README.md: raw strings in yamllint, not a uri in openapi, property names typos in ansible meta because of a misplaced additionalProperties, one strange character in a string in lazygit.

There may be some cases where what was written wasn't what was intended, but at the same time, these are real documents found in the wild so I'd personally lean towards keeping any "errors."

I do not understand the logic of keeping broken values as well as broken schemas in a benchmark: for me the point of such tools is too show how fast json validation can be, not to show great performance figures because checks are silently skipped or validating obviously wrong values very quickly, as it does not serve the use case well.

Anyway, if you want to keep them, I'll add an option to my scripts in order not to fix obvious user error when generating and compiling models. There is already one, but more is needed to reach 100% pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants