Skip to content

Conversation

@AngledLuffa
Copy link
Contributor

Save the errors in the state, not just the error counts. Makes it easier for a calling program to use the results

@dan-zeman
Copy link
Member

Should we worry about memory in cases where a treebank has over 260k incidents?

@ellepannitto
Copy link
Contributor

Good point, we also haven't considered this when saving errors to dump into json.

I'm thinking about two possible strategies (but it's just a random idea, I have to give it a little more thought):

  • We can maybe add a flag from command line with a default value so that there's a maximum number of errors that are retained and dumped, but still people can customize it
  • otherwise, we can keep at most one error per line, ideally the one with lowest level (so that, if line X has an issue at level 2 we don't show issues at level 3, 4, 5 for that same line, and the number of errors is always at most the number of lines)

@harisont
Copy link

I like @ellepannitto's first idea (using the list as an array of length n, unbounded if the user passes, say, -1). Should we add this to the to-do list in #132 too?

(For context: our rewrite addresses the exact same problem, although it does not use the state to do so. If I recall correctly, our solution is described in the text of our draft pull request).

@dan-zeman
Copy link
Member

  • We can maybe add a flag from command line with a default value so that there's a maximum number of errors that are retained and dumped, but still people can customize it

There is already the option called --max-err (followed by integer, 0 means unlimited). It affects number of errors printed (the help says "How many errors to output before exiting" but in fact the validator does not exit, it still processes the rest of input and provides the complete number of errors. Maybe it could actually exit if people want limited number of errors, otherwise they wait for very long.

Now either this option could also regulate the number of errors saved and returned. Or there could be a similar option so that errors printed and errors returned are regulated separately.

@AngledLuffa
Copy link
Contributor Author

Should we worry about memory in cases where a treebank has over 260k incidents?

Sure, it'd be easy enough to redo it so the counts are still kept separately, but there's a field that keeps the errors up to --max_err and then stops

…ier for a calling program to use the results

Only keep track of --max_err errors, but still count all the errors
@AngledLuffa
Copy link
Contributor Author

(updated the PR)

@dan-zeman
Copy link
Member

Should we worry about memory in cases where a treebank has over 260k incidents?

Sure, it'd be easy enough to redo it so the counts are still kept separately, but there's a field that keeps the errors up to --max_err and then stops

Actually, after a bit more thinking, I would control the two requirements separately. In the current on-line validation, I print all errors to the log (i.e., --max-err 0), but I don't want to collect them in a data structure in memory. (And I suspect that even if I switch to printing JSON, I will print the errors and forget them before reaching the end of input.)

@AngledLuffa
Copy link
Contributor Author

Added a flag for that, as well

@dan-zeman dan-zeman merged commit ba41ef9 into master Sep 15, 2025
@dan-zeman dan-zeman deleted the save_errors branch September 15, 2025 20:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants