feat(core)!: more explicit handling of case-sensitivity in dictionaries by 86xsk · Pull Request #2630 · Automattic/harper

86xsk · 2026-01-30T02:59:18Z

Issues

Fixes #2585
Related to #1688
Related to #2411

(Probably a few more, this has been a long-running issue.)

Description

Changes the dictionary API to allow more explicitly handling capitalization/orthographic word variations.
- Replaces WordId with CanonicalWordId and CaseFoldedWordId.
  - CanonicalWordId hashes the input word as is, without lowercasing or normalization.
  - CaseFoldedWordId lowercases and normalizes the word before hashing it (this is the current behavior of WordId).
  - Adds WordIdPair ~~and EitherWordId~~ to make it easier to work with CanonicalWordId/CaseFoldedWordId.
- Querying the dictionary case-insensitively now means you will get a Vec of words, not an individual word.
  - However, the current behavior, where a word is queried case-insensitively and returns the merged metadata across all of its casing/orthographic variants is preserved in the get_word_metadata_combined functions.
- Querying the dictionary case-sensitively will now return a specific entry in the curated dictionary, allowing differentiation for case/orthographic variants when desired.
Fixes the issue where some words would be linted by SpellCheck despite being present in the dictionary.
Makes SpellCheck no longer care about capitalization, fully transferring that responsibility to OrthographicConsistency (which already handles it anyway for the most part).
Re-adds "Pr" to the curated dictionary, since it no longer conflicts with "PR".
Allows the derived_from field in DictWordMetadata to contain multiple words.
- Adds the DerivedFrom struct to contain this data without exposing implementation details. The current implementation uses a Vec which allows for the insertion of unique values only.

Running cargo bench --bench parse_essay showed a performance regression of about ~3% across the board with these changes. I didn't notice any performance issues when testing the modified build of harper-ls.

How Has This Been Tested?

cargo test
Manually building and testing the modified harper-ls on a few Markdown files

Checklist

I have performed a self-review of my own code
I have added tests to cover my changes

`SpellCheck` shouldn't handle capitalization if `OrthographicConsistency` is going to do it anyway.

- Splits `WordId` into `CanonicalWordId` and `CaseFoldedWordId`. - Updates dictionary functions to more explicitly handle casing. There are now functions to get a specific word case-sensitively, multiple words case-insensitively, and ditto but merge all metadata (which was the old behavior). - Fixes issue where `SpellCheck` would sometimes mark words as incorrect if an identical entry with different casing existed in the dictionary (e.g. OS, PR, etc.). - Makes `SpellCheck` no longer care about casing, since that is handled by `OrthographicConsistency`.

…mattic#2476)" This partially reverts commit 5230d6a. Returns the word to the dictionary, since removing it should no longer be necessary.

Since casing-related issues are now handled by `OrthographicConsistency`, not `SpellCheck`.

Expands the criteria in which `OrthographicConsistency` will lint for incorrect capitalization. Makes the `no_improper_suggestion_for_macos` test pass.

Allow all word casing/orthography that are defined in the dictionary. If the dictionary contains the exact word, `OrthographicConsistency` will skip it.

Both variants are defined in the dictionary, and appear to be valid in this case.

The test would expect 'Al' to be linted by `OrthographicConsistency` for not being all-caps.

Remove needless borrow.

hippietrail · 2026-01-30T09:47:33Z

Does it handle the same word having multiple WordIds? For instance you might have dictionary entries like:

- unpack/Vd # also creates "unpacked"
- packed/VtTU # also creates "unpacked
- unpacked/J # specifically creates "unpacked"

This happens a lot more than expected and causes some bugs of this type to crop up. I had a PR for it for many months that just tracked them but didn't really do anything with them as I wasn't sure what was and wasn't planned etc.

86xsk · 2026-01-30T18:29:09Z

I believe it will merge the metadata for words that have identical canonical word IDs, as seems to be the case here.

The code in expand_annotated_word adds or appends metadata based on the canonical word ID (i.e. the hash of the word exactly as stated in the dictionary or as expanded from another word, without lowercasing or normalization). "unpacked" will always have the same word ID regardless of whether it was specifically defined or came from an expansion, so all instances/manifestations of "unpacked" will have their metadata merged.

If in theory we had, say, a proper noun "Unpacked" (whether specifically defined or coming from an expansion), that would not be merged. It would remain a separate entry that can be queried with the appropriate dictionary functions.

However, tokens that originate from a Document still currently use the merged/combined metadata. So an "unpacked" token would have both the "unpacked" and "Unpacked" metadata merged.

// In Document::parse()
// Annotate DictWord metadata
for token in sent.iter_mut() {
    if let TokenKind::Word(meta) = &mut token.kind {
        let word_source = token.span.get_content(&self.source);
        let mut found_meta = dictionary
            .get_word_metadata_combined(word_source) // <---
            .map(|c| c.into_owned());

        if let Some(inner) = &mut found_meta {
            inner.pos_tag = token_tags[i].or_else(|| inner.infer_pos_tag());
            inner.np_member = Some(np_flags[i]);
        }

        *meta = found_meta;
        i += 1;
    } else if !token.kind.is_whitespace() {
        i += 1;
    }
}

hippietrail · 2026-01-31T05:28:23Z

I don't recall everything perfectly at the moment but both I think both cases are possible and only one is handled:

One WordID can refer to multiple words - due to case folding
One word can have multiple WordIDs

I believe you're talking about 1. which works but is behind a few bugs. But 2. is distinct. I found my old PR: #1035

Apologies if I've mixed up any concepts myself - it's been a while (-:

86xsk · 2026-01-31T16:50:25Z

Oh, I see what you mean now. Yeah no, at the moment I think this PR mostly sticks with the old behavior where once a DictWordMetadata has its one and onlyderived_from set, it's not updated again.

I did change the field to store a WordIdPair (storing both a canonical and case-folded word ID for its sole "parent" word), but I don't think that helps too much in this case. If I recall correctly, I did this so I could mimic the previous case-folded lookup in the DerivedFrom pattern, while still storing a canonical parent word ID too.

I'll take another look though. I wasn't too certain with that change as is, and the PR you mentioned does bring up an issue I haven't even considered.

hippietrail · 2026-01-31T18:02:22Z

Oh, I see what you mean now. Yeah no, at the moment I think this PR mostly sticks with the old behavior where once a DictWordMetadata has its one and onlyderived_from set, it's not updated again.

If I recall correctly the old behaviour is for each new one to blindly stomp on the previous derived_from?

I did change the field to store a WordIdPair (storing both a canonical and case-folded word ID for its sole "parent" word), but I don't think that helps too much in this case. If I recall correctly, I did this so I could mimic the previous case-folded lookup in the DerivedFrom pattern, while still storing a canonical parent word ID too.

Yes I never took on the case folding as I was worried about stepping on the toes of the main devs thinking it might've been an intentional design choice plus I didn't find out it was definitely case folding until months after I did the multi-derived_from patch.

I'll take another look though. I wasn't too certain with that change as is, and the PR you mentioned does bring up an issue I haven't even considered.

It looks like you're fixing up a whole bunch of sticky issues at once - good stuff!

Support a word having multiple words that it's derived from.

86xsk · 2026-02-01T21:02:58Z

If I recall correctly the old behaviour is for each new one to blindly stomp on the previous derived_from?

I could be wrong, but I believe it will just keep the first value that was set, since in DictWordMetadata::or...

derived_from: self.derived_from.or(other.derived_from),

...it sets the derived_from to the result of Option::or, which keeps the value if the option is already Some. (I can't say for sure that the value doesn't get overwritten elsewhere though.)

I threw together the changes so that it now supports a word having multiple derived_from words. Not too certain on all the details though. For instance, there would now be two types named DerivedFrom in harper-core.

hippietrail · 2026-02-02T01:59:25Z

If I recall correctly the old behaviour is for each new one to blindly stomp on the previous derived_from?

I could be wrong, but I believe it will just keep the first value that was set, since in DictWordMetadata::or...
derived_from: self.derived_from.or(other.derived_from),
...it sets the derived_from to the result of Option::or, which keeps the value if the option is already Some. (I can't say for sure that the value doesn't get overwritten elsewhere though.)

I threw together the changes so that it now supports a word having multiple derived_from words. Not too certain on all the details though. For instance, there would now be two types named DerivedFrom in harper-core.

Cool. I remember when I was working on that stuff that it turned out I missed some cases where a word could have multiple derived_froms - I found the code a bit hard to follow. I think I got some bikeshedding patches in that hopefully made it a bit easier though.

Avoid unnecessary cloning when using `DictWordMetadata::append`. This reduces the performance regression introduced by storing a Vec inside `DictWordMetadata`.

Remove `EitherWordId` since it's not being used anymore.

86xsk added 19 commits January 20, 2026 01:12

test(core): add failing test

008c534

test(comments): don't expect 'lin' to be marked as a spelling error

a5f4a1b

test(core): move tests

5face23

`SpellCheck` shouldn't handle capitalization if `OrthographicConsistency` is going to do it anyway.

test(core): don't expect SpellCheck to mark capitalization issues

fe37ba6

deps(core): add indexmap

6f02acd

chore: update snapshots

ce666c5

Partially revert "fix(core): PR getting flagged as 'misspelled' (Auto…

4443a4e

…mattic#2476)" This partially reverts commit 5230d6a. Returns the word to the dictionary, since removing it should no longer be necessary.

test(core): merge tests and add test

e982f23

Merge branch 'master' into fix-dict-casing2

5688aee

test(core): move test

0a1e2d4

Since casing-related issues are now handled by `OrthographicConsistency`, not `SpellCheck`.

fix(core): fix logic in OrthographicConsistency

cf9a90a

Expands the criteria in which `OrthographicConsistency` will lint for incorrect capitalization. Makes the `no_improper_suggestion_for_macos` test pass.

test(core): add failing test

007df6e

fix(core): allow all case-variants in OrthographicConsistency

7518350

Allow all word casing/orthography that are defined in the dictionary. If the dictionary contains the exact word, `OrthographicConsistency` will skip it.

test(core): remove Lego -> LEGO test in OrthographicConsistency

3c9d54e

Both variants are defined in the dictionary, and appear to be valid in this case.

chore: update snapshots

8b426d9

test(core): add test

3381fba

test(core): fix incorrect test expectation

e11a2d6

The test would expect 'Al' to be linted by `OrthographicConsistency` for not being all-caps.

refactor(core): appease Clippy

b23f652

Remove needless borrow.

86xsk marked this pull request as draft January 31, 2026 17:30

feat(core): support multiple derived_from

3f068bb

Support a word having multiple words that it's derived from.

86xsk added 2 commits February 2, 2026 12:47

perf(core): reduce Vec cloning

18ba296

Avoid unnecessary cloning when using `DictWordMetadata::append`. This reduces the performance regression introduced by storing a Vec inside `DictWordMetadata`.

refactor(core): reuse code from similar function

aeba563

86xsk added 3 commits February 3, 2026 14:13

Merge branch 'master' into fix-dict-casing2

f30cfff

refactor(core): remove dead code

32ce68c

Remove `EitherWordId` since it's not being used anymore.

Merge branch 'master' into fix-dict-casing2

5bd11c1

86xsk marked this pull request as ready for review February 3, 2026 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(core)!: more explicit handling of case-sensitivity in dictionaries#2630

feat(core)!: more explicit handling of case-sensitivity in dictionaries#2630
86xsk wants to merge 25 commits intoAutomattic:masterfrom
86xsk:fix-dict-casing2

86xsk commented Jan 30, 2026 •

edited

Loading

Uh oh!

hippietrail commented Jan 30, 2026 •

edited

Loading

Uh oh!

86xsk commented Jan 30, 2026 •

edited

Loading

Uh oh!

hippietrail commented Jan 31, 2026 •

edited

Loading

Uh oh!

86xsk commented Jan 31, 2026

Uh oh!

hippietrail commented Jan 31, 2026

Uh oh!

86xsk commented Feb 1, 2026

Uh oh!

hippietrail commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

86xsk commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issues

Description

How Has This Been Tested?

Checklist

Uh oh!

hippietrail commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

86xsk commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hippietrail commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

86xsk commented Jan 31, 2026

Uh oh!

hippietrail commented Jan 31, 2026

Uh oh!

86xsk commented Feb 1, 2026

Uh oh!

hippietrail commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

86xsk commented Jan 30, 2026 •

edited

Loading

hippietrail commented Jan 30, 2026 •

edited

Loading

86xsk commented Jan 30, 2026 •

edited

Loading

hippietrail commented Jan 31, 2026 •

edited

Loading