Skip to content

feat(core)!: more explicit handling of case-sensitivity in dictionaries#2630

Open
86xsk wants to merge 25 commits intoAutomattic:masterfrom
86xsk:fix-dict-casing2
Open

feat(core)!: more explicit handling of case-sensitivity in dictionaries#2630
86xsk wants to merge 25 commits intoAutomattic:masterfrom
86xsk:fix-dict-casing2

Conversation

@86xsk
Copy link
Contributor

@86xsk 86xsk commented Jan 30, 2026

Issues

Fixes #2585
Related to #1688
Related to #2411

(Probably a few more, this has been a long-running issue.)

Description

  • Changes the dictionary API to allow more explicitly handling capitalization/orthographic word variations.
    • Replaces WordId with CanonicalWordId and CaseFoldedWordId.
      • CanonicalWordId hashes the input word as is, without lowercasing or normalization.
      • CaseFoldedWordId lowercases and normalizes the word before hashing it (this is the current behavior of WordId).
      • Adds WordIdPair and EitherWordId to make it easier to work with CanonicalWordId/CaseFoldedWordId.
    • Querying the dictionary case-insensitively now means you will get a Vec of words, not an individual word.
      • However, the current behavior, where a word is queried case-insensitively and returns the merged metadata across all of its casing/orthographic variants is preserved in the get_word_metadata_combined functions.
    • Querying the dictionary case-sensitively will now return a specific entry in the curated dictionary, allowing differentiation for case/orthographic variants when desired.
  • Fixes the issue where some words would be linted by SpellCheck despite being present in the dictionary.
  • Makes SpellCheck no longer care about capitalization, fully transferring that responsibility to OrthographicConsistency (which already handles it anyway for the most part).
  • Re-adds "Pr" to the curated dictionary, since it no longer conflicts with "PR".
  • Allows the derived_from field in DictWordMetadata to contain multiple words.
    • Adds the DerivedFrom struct to contain this data without exposing implementation details. The current implementation uses a Vec which allows for the insertion of unique values only.

Running cargo bench --bench parse_essay showed a performance regression of about ~3% across the board with these changes. I didn't notice any performance issues when testing the modified build of harper-ls.

How Has This Been Tested?

  • cargo test
  • Manually building and testing the modified harper-ls on a few Markdown files

Checklist

  • I have performed a self-review of my own code
  • I have added tests to cover my changes

86xsk added 19 commits January 20, 2026 01:12
`SpellCheck` shouldn't handle capitalization if
`OrthographicConsistency` is going to do it anyway.
- Splits `WordId` into `CanonicalWordId` and `CaseFoldedWordId`.

- Updates dictionary functions to more explicitly handle casing. There
  are now functions to get a specific word case-sensitively, multiple
  words case-insensitively, and ditto but merge all metadata (which was
  the old behavior).

- Fixes issue where `SpellCheck` would sometimes mark words as incorrect
  if an identical entry with different casing existed in the dictionary
  (e.g. OS, PR, etc.).

- Makes `SpellCheck` no longer care about casing, since that is handled
  by `OrthographicConsistency`.
…mattic#2476)"

This partially reverts commit 5230d6a.

Returns the word to the dictionary, since removing it should no longer
be necessary.
Since casing-related issues are now handled by
`OrthographicConsistency`, not `SpellCheck`.
Expands the criteria in which `OrthographicConsistency` will lint for
incorrect capitalization.

Makes the `no_improper_suggestion_for_macos` test pass.
Allow all word casing/orthography that are defined in the dictionary. If
the dictionary contains the exact word, `OrthographicConsistency` will
skip it.
Both variants are defined in the dictionary, and appear to be valid in
this case.
The test would expect 'Al' to be linted by `OrthographicConsistency` for
not being all-caps.
Remove needless borrow.
@hippietrail
Copy link
Collaborator

hippietrail commented Jan 30, 2026

Does it handle the same word having multiple WordIds? For instance you might have dictionary entries like:

- unpack/Vd # also creates "unpacked"
- packed/VtTU # also creates "unpacked
- unpacked/J # specifically creates "unpacked"

This happens a lot more than expected and causes some bugs of this type to crop up. I had a PR for it for many months that just tracked them but didn't really do anything with them as I wasn't sure what was and wasn't planned etc.

@86xsk
Copy link
Contributor Author

86xsk commented Jan 30, 2026

I believe it will merge the metadata for words that have identical canonical word IDs, as seems to be the case here.

image

The code in expand_annotated_word adds or appends metadata based on the canonical word ID (i.e. the hash of the word exactly as stated in the dictionary or as expanded from another word, without lowercasing or normalization). "unpacked" will always have the same word ID regardless of whether it was specifically defined or came from an expansion, so all instances/manifestations of "unpacked" will have their metadata merged.

If in theory we had, say, a proper noun "Unpacked" (whether specifically defined or coming from an expansion), that would not be merged. It would remain a separate entry that can be queried with the appropriate dictionary functions.

However, tokens that originate from a Document still currently use the merged/combined metadata. So an "unpacked" token would have both the "unpacked" and "Unpacked" metadata merged.

// In Document::parse()
// Annotate DictWord metadata
for token in sent.iter_mut() {
    if let TokenKind::Word(meta) = &mut token.kind {
        let word_source = token.span.get_content(&self.source);
        let mut found_meta = dictionary
            .get_word_metadata_combined(word_source) // <---
            .map(|c| c.into_owned());

        if let Some(inner) = &mut found_meta {
            inner.pos_tag = token_tags[i].or_else(|| inner.infer_pos_tag());
            inner.np_member = Some(np_flags[i]);
        }

        *meta = found_meta;
        i += 1;
    } else if !token.kind.is_whitespace() {
        i += 1;
    }
}

@hippietrail
Copy link
Collaborator

hippietrail commented Jan 31, 2026

I don't recall everything perfectly at the moment but both I think both cases are possible and only one is handled:

  1. One WordID can refer to multiple words - due to case folding
  2. One word can have multiple WordIDs

I believe you're talking about 1. which works but is behind a few bugs. But 2. is distinct. I found my old PR: #1035

Apologies if I've mixed up any concepts myself - it's been a while (-:

@86xsk
Copy link
Contributor Author

86xsk commented Jan 31, 2026

Oh, I see what you mean now. Yeah no, at the moment I think this PR mostly sticks with the old behavior where once a DictWordMetadata has its one and onlyderived_from set, it's not updated again.

I did change the field to store a WordIdPair (storing both a canonical and case-folded word ID for its sole "parent" word), but I don't think that helps too much in this case. If I recall correctly, I did this so I could mimic the previous case-folded lookup in the DerivedFrom pattern, while still storing a canonical parent word ID too.

I'll take another look though. I wasn't too certain with that change as is, and the PR you mentioned does bring up an issue I haven't even considered.

@86xsk 86xsk marked this pull request as draft January 31, 2026 17:30
@hippietrail
Copy link
Collaborator

Oh, I see what you mean now. Yeah no, at the moment I think this PR mostly sticks with the old behavior where once a DictWordMetadata has its one and onlyderived_from set, it's not updated again.

If I recall correctly the old behaviour is for each new one to blindly stomp on the previous derived_from?

I did change the field to store a WordIdPair (storing both a canonical and case-folded word ID for its sole "parent" word), but I don't think that helps too much in this case. If I recall correctly, I did this so I could mimic the previous case-folded lookup in the DerivedFrom pattern, while still storing a canonical parent word ID too.

Yes I never took on the case folding as I was worried about stepping on the toes of the main devs thinking it might've been an intentional design choice plus I didn't find out it was definitely case folding until months after I did the multi-derived_from patch.

I'll take another look though. I wasn't too certain with that change as is, and the PR you mentioned does bring up an issue I haven't even considered.

It looks like you're fixing up a whole bunch of sticky issues at once - good stuff!

Support a word having multiple words that it's derived from.
@86xsk
Copy link
Contributor Author

86xsk commented Feb 1, 2026

If I recall correctly the old behaviour is for each new one to blindly stomp on the previous derived_from?

I could be wrong, but I believe it will just keep the first value that was set, since in DictWordMetadata::or...

derived_from: self.derived_from.or(other.derived_from),

...it sets the derived_from to the result of Option::or, which keeps the value if the option is already Some. (I can't say for sure that the value doesn't get overwritten elsewhere though.)


I threw together the changes so that it now supports a word having multiple derived_from words. Not too certain on all the details though. For instance, there would now be two types named DerivedFrom in harper-core.

@hippietrail
Copy link
Collaborator

If I recall correctly the old behaviour is for each new one to blindly stomp on the previous derived_from?

I could be wrong, but I believe it will just keep the first value that was set, since in DictWordMetadata::or...

derived_from: self.derived_from.or(other.derived_from),

...it sets the derived_from to the result of Option::or, which keeps the value if the option is already Some. (I can't say for sure that the value doesn't get overwritten elsewhere though.)

I threw together the changes so that it now supports a word having multiple derived_from words. Not too certain on all the details though. For instance, there would now be two types named DerivedFrom in harper-core.

Cool. I remember when I was working on that stuff that it turned out I missed some cases where a word could have multiple derived_froms - I found the code a bit hard to follow. I think I got some bikeshedding patches in that hopefully made it a bit easier though.

86xsk added 2 commits February 2, 2026 12:47
Avoid unnecessary cloning when using `DictWordMetadata::append`.

This reduces the performance regression introduced by storing a Vec
inside `DictWordMetadata`.
@86xsk 86xsk marked this pull request as ready for review February 3, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

"OS" flagged even though it's in dictionary.dict

2 participants