feat(core)!: more explicit handling of case-sensitivity in dictionaries#2630
feat(core)!: more explicit handling of case-sensitivity in dictionaries#263086xsk wants to merge 25 commits intoAutomattic:masterfrom
Conversation
`SpellCheck` shouldn't handle capitalization if `OrthographicConsistency` is going to do it anyway.
- Splits `WordId` into `CanonicalWordId` and `CaseFoldedWordId`. - Updates dictionary functions to more explicitly handle casing. There are now functions to get a specific word case-sensitively, multiple words case-insensitively, and ditto but merge all metadata (which was the old behavior). - Fixes issue where `SpellCheck` would sometimes mark words as incorrect if an identical entry with different casing existed in the dictionary (e.g. OS, PR, etc.). - Makes `SpellCheck` no longer care about casing, since that is handled by `OrthographicConsistency`.
…mattic#2476)" This partially reverts commit 5230d6a. Returns the word to the dictionary, since removing it should no longer be necessary.
Since casing-related issues are now handled by `OrthographicConsistency`, not `SpellCheck`.
Expands the criteria in which `OrthographicConsistency` will lint for incorrect capitalization. Makes the `no_improper_suggestion_for_macos` test pass.
Allow all word casing/orthography that are defined in the dictionary. If the dictionary contains the exact word, `OrthographicConsistency` will skip it.
Both variants are defined in the dictionary, and appear to be valid in this case.
The test would expect 'Al' to be linted by `OrthographicConsistency` for not being all-caps.
Remove needless borrow.
|
Does it handle the same word having multiple This happens a lot more than expected and causes some bugs of this type to crop up. I had a PR for it for many months that just tracked them but didn't really do anything with them as I wasn't sure what was and wasn't planned etc. |
|
I don't recall everything perfectly at the moment but both I think both cases are possible and only one is handled:
I believe you're talking about 1. which works but is behind a few bugs. But 2. is distinct. I found my old PR: #1035 Apologies if I've mixed up any concepts myself - it's been a while (-: |
|
Oh, I see what you mean now. Yeah no, at the moment I think this PR mostly sticks with the old behavior where once a I did change the field to store a I'll take another look though. I wasn't too certain with that change as is, and the PR you mentioned does bring up an issue I haven't even considered. |
If I recall correctly the old behaviour is for each new one to blindly stomp on the previous
Yes I never took on the case folding as I was worried about stepping on the toes of the main devs thinking it might've been an intentional design choice plus I didn't find out it was definitely case folding until months after I did the multi-
It looks like you're fixing up a whole bunch of sticky issues at once - good stuff! |
Support a word having multiple words that it's derived from.
I could be wrong, but I believe it will just keep the first value that was set, since in derived_from: self.derived_from.or(other.derived_from),...it sets the I threw together the changes so that it now supports a word having multiple |
Cool. I remember when I was working on that stuff that it turned out I missed some cases where a word could have multiple |
Avoid unnecessary cloning when using `DictWordMetadata::append`. This reduces the performance regression introduced by storing a Vec inside `DictWordMetadata`.
Remove `EitherWordId` since it's not being used anymore.

Issues
Fixes #2585
Related to #1688
Related to #2411
(Probably a few more, this has been a long-running issue.)
Description
WordIdwithCanonicalWordIdandCaseFoldedWordId.CanonicalWordIdhashes the input word as is, without lowercasing or normalization.CaseFoldedWordIdlowercases and normalizes the word before hashing it (this is the current behavior ofWordId).WordIdPairandto make it easier to work withEitherWordIdCanonicalWordId/CaseFoldedWordId.Vecof words, not an individual word.get_word_metadata_combinedfunctions.SpellCheckdespite being present in the dictionary.SpellCheckno longer care about capitalization, fully transferring that responsibility toOrthographicConsistency(which already handles it anyway for the most part).derived_fromfield inDictWordMetadatato contain multiple words.DerivedFromstruct to contain this data without exposing implementation details. The current implementation uses a Vec which allows for the insertion of unique values only.Running
cargo bench --bench parse_essayshowed a performance regression of about ~3% across the board with these changes. I didn't notice any performance issues when testing the modified build ofharper-ls.How Has This Been Tested?
cargo testChecklist