Skip to content

feat: 4-tier definition system with kaikki.org + parser hardening#132

Merged
Hugo0 merged 7 commits intomainfrom
feat/definition-quality-overhaul
Feb 24, 2026
Merged

feat: 4-tier definition system with kaikki.org + parser hardening#132
Hugo0 merged 7 commits intomainfrom
feat/definition-quality-overhaul

Conversation

@Hugo0
Copy link
Owner

@Hugo0 Hugo0 commented Feb 24, 2026

Summary

Overhauls the definition system from a fragile 2-tier approach (Wiktionary parser → LLM) to a robust 4-tier pipeline:

  1. kaikki.org native definitions — pre-built JSON from native Wiktionary editions (de.wiktionary.org, fr.wiktionary.org, etc.). Instant, offline, definitions in the word's own language.
  2. Wiktionary parser — now only invoked for languages where it's been tested and graded CONFIDENT or PARTIAL (39 of 65). Unreliable parsers are skipped entirely.
  3. kaikki.org English glosses — fallback from en.wiktionary.org. Better than nothing.
  4. LLM (gpt-4o-mini) — last resort, expensive.

What's new

  • scripts/build_definitions.py — Downloads kaikki.org data extracts, processes them into compact JSON. Outputs 70 files (12MB, 206K definitions) covering all 65 languages.
  • PARSER_CONFIDENCE map — 65 languages graded by automated golden tests (201 fixtures). Embedded in wiktionary.py so unreliable parsers are never called at runtime.
  • Garbage definition filter — Regex-based _is_garbage_definition() catches pronunciation data, etymology, "no definition" placeholders, and leaked wiki markup that the parser occasionally returns.
  • German parser fixes — Skips Nebenformen/Aussprache/Worttrennung metadata lines, handles Konjugierte/Deklinierte Form sections.
  • Language section filtering_filter_language_section() isolates the correct language on shared Wiktionary pages (e.g. Norwegian "no" edition has both Bokmål and Nynorsk).

Also included (small fixes from same session)

  • Stats endpoint returns JSON so client can compute percentile (was returning empty 200)
  • Dedup switched from IP-based to client_id (localStorage UUID) — works behind shared NATs
  • Share text formatting improved (em dash, double newline, full URL)
  • communityIsTopScore flag replaces fragile percentile === 0 check
  • stats.ts — extracted calculateCommunityPercentile() with 21 unit tests

Coverage breakdown (approximate)

Tier Source Language Words covered
1 kaikki native Native ~10% (13 langs with native editions)
2 Wiktionary parser Native ~43% (39 CONFIDENT/PARTIAL langs)
3 kaikki English English ~15% (all 65 langs)
4 LLM Native ~32% (everything else)

Every word has some definition path. Native-language definitions preferred when available.

Areas requesting careful review

1. Definition JSON files committed to repo (12MB)

We're committing 70 JSON files totaling 12MB. These are build artifacts from build_definitions.py but we ship them as data (loaded at runtime by Flask). Alternative would be to generate on deploy, but that requires downloading ~1.7GB of kaikki data. Is 12MB of JSON acceptable in the repo, or should this be a deploy-time build step?

2. PARSER_CONFIDENCE map is hardcoded

The confidence map is generated by test_wiktionary_parser.py but manually copy-pasted into wiktionary.py. If parser quality changes over time, the map goes stale. We could auto-generate it, but that adds complexity. Is the manual approach OK for now?

3. Garbage filter is regex-based

_is_garbage_definition() catches known patterns (Swedish "uttal:", Hungarian placeholders, Korean etymology "어원:", etc.) but is inherently incomplete — new garbage patterns from other languages could slip through. Should we add a more comprehensive approach, or is regex + iteration sufficient?

4. kaikki native vs parser overlap

For the 13 languages with native kaikki editions (de, fr, es, etc.), kaikki always wins over the parser. The parser was CONFIDENT for most of these, so we might be trading parser definitions (sometimes more detailed) for kaikki ones (sometimes shorter but cleaner). Is "kaikki native first" the right priority, or should parser-CONFIDENT languages prefer the parser?

5. English glosses as tier 3

For ~26 UNRELIABLE-parser languages without native kaikki, tier 3 returns English glosses (e.g. Faroese word → English definition). This is suboptimal — the user sees English instead of their language. But it's better than LLM hallucination. Is this acceptable or should we skip English glosses and go straight to LLM?

6. Stats/percentile changes bundled with definition work

These are logically separate features that happened in the same session. Happy to split into a separate PR if preferred.

Test plan

  • uv run pytest tests/ -x -q — 2250 passed, 675 skipped, 7 xfailed
  • pnpm test — 81 passed (5 test files)
  • pnpm build — compiles clean
  • Play a game, win → verify "Better than X% of players" in TODAY tab
  • Verify definitions load for: English, Finnish (parser), German (kaikki native), Faroese (English gloss)
  • Verify DALL-E images use improved definitions
  • Verify dark mode on definition cards
  • Check no console errors on game page

Summary by CodeRabbit

  • New Features

    • Community percentile and top-score indicators; share text updated with full URL and en‑dash; word image prompts enhanced with definition hints.
  • Data

    • Large expansion of bilingual definition resources and Wiktionary fixtures covering 60+ languages.
  • Tests

    • Extensive unit and integration tests for percentile logic and Wiktionary parsing/coverage added.
  • Bug Fixes

    • Improved submission deduplication and more graceful image/error responses.

…r 65 languages

- build_definitions.py: downloads kaikki.org data extracts, processes them
  into compact {lang}.json (native) and {lang}_en.json (English gloss) files
- Outputs 70 JSON files (12MB total, 206K definitions)
- Native editions for 13 languages (cs, de, el, en, es, fr, it, nl, pl, pt, ru, tr, vi)
- English glosses for all 65 languages from en.wiktionary.org
- Cleans wiki markup, truncates long entries, filters self-referential defs
- capture_wiktionary_fixtures.py: captures parser test fixtures from live APIs
- .gitignore: exclude bulk kaikki download data
- test_wiktionary_parser.py: 201 golden fixture tests per language, generates
  PARSER_CONFIDENCE map (CONFIDENT/PARTIAL/UNRELIABLE) used at runtime
- test_wiktionary_definitions.py: validates definition coverage across all
  65 languages against cached definitions
- tests/fixtures/wiktionary/: 65 fixture files with real Wiktionary API
  responses for reproducible parser testing
- capture_wiktionary_fixtures.py: script to refresh fixtures from live APIs
…rbage filter

Rewrites fetch_definition_cached() to use a priority chain:
  1. kaikki native defs ({lang}.json) — instant, in the word's own language
  2. Wiktionary parser — only for CONFIDENT/PARTIAL languages (39 of 65)
  3. kaikki English glosses ({lang}_en.json) — fallback
  4. LLM (gpt-4o-mini) — last resort

Key changes to wiktionary.py:
- Embed PARSER_CONFIDENCE map (65 languages) to skip unreliable parsers
- Add _is_garbage_definition() filter catching pronunciation, etymology,
  placeholders, and leaked wiki markup from parser output
- Add _filter_language_section() to isolate correct language on shared
  Wiktionary pages (e.g. Norwegian "no" edition has Bokmål + Nynorsk)
- Improve German parser: skip Nebenformen/Aussprache/Worttrennung metadata,
  handle Konjugierte/Deklinierte Form sections
- Split _fetch_wiktionary_definition() from fetch_definition_cached() for
  cleaner tier separation
…text

app.py:
- Return JSON from submit_word_stats() so client can compute percentile
- Switch dedup from IP-based to client_id (localStorage UUID)
- Handle OSError on stats write (disk full graceful degradation)
- Fix word_image error handling (404 not 500 for generation failure)

game.ts:
- Extract calculateCommunityPercentile() to stats.ts with proper edge cases
- Add communityIsTopScore flag for "Top score today!" display
- Add getClientId() for stable dedup across sessions
- Re-fetch community stats on page load when game already complete
- Improve share text format (em dash, double newline, full URL)

game.html:
- Use communityIsTopScore flag instead of percentile === 0 check
@coderabbitai
Copy link

coderabbitai bot commented Feb 24, 2026

📝 Walkthrough

Walkthrough

Adds scripts and large multilingual definition data, many Wiktionary fixtures and parser/quality tests, a stats module and frontend changes to compute/submit community percentiles with client_id, and backend updates to word-image prompts and submit_word_stats error/dedup handling.

Changes

Cohort / File(s) Summary
Build & Processing Scripts
scripts/build_definitions.py, scripts/capture_wiktionary_fixtures.py, .gitignore
New CLI/script to download/process Wiktionary JSONL and capture API extracts; adds .gitignore entry scripts/.kaikki_data/.
Frontend: community stats
frontend/src/stats.ts, frontend/src/game.ts, frontend/src/__tests__/percentile.test.ts
New WordStats types and calculateCommunityPercentile(); game.ts persists client_id, exposes communityPercentile/communityIsTopScore, calls percentile logic and includes client_id when submitting stats; comprehensive unit tests added.
Backend: image & stats handling
webapp/app.py
submit_word_stats now deduplicates by client_id (fallback to IP), returns existing stats on duplicates, adds disk/OSError handling and broader exception logging; word_image prompt includes definition hint and uses 404 for some generation failures.
Wiktionary parser tests & utils
tests/test_wiktionary_parser.py, tests/test_wiktionary_definitions.py, tests/wiktionary_test_utils.py
New offline and network-enabled test suites: golden parser tests, definition-quality checks, confidence mapping and coverage reporting; adds is_quality_definition utility and regex helpers.
Fixtures (many languages)
tests/fixtures/wiktionary/*.json
~65+ per-language JSON fixtures added/updated with extract, parsed, word_type, and tried_word entries for golden/quality tests.
Definition data resources
webapp/data/definitions/*_en.json
40+ large language→English JSON dictionaries added (e.g., az_en.json, bg_en.json, eo_en.json, etc.) as static data assets used by processing/UI.
Integration tests
tests/test_wiktionary_definitions.py, tests/test_wiktionary_parser.py
New test modules that validate fetching/parsing across languages and compute per-language parser confidence metrics.

Sequence Diagram(s)

sequenceDiagram
    participant Player as Player
    participant Game as frontend/src/game.ts
    participant Stats as frontend/src/stats.ts
    participant Storage as LocalStorage
    participant Backend as webapp/app.py

    Player->>Game: game ends
    Game->>Storage: getClientId()
    Storage-->>Game: client_id
    Game->>Stats: calculateCommunityPercentile(attempts, WordStats)
    Stats-->>Game: {percentile, isTopScore}
    Game->>Game: set communityPercentile, communityIsTopScore
    Game->>Backend: submitWordStats(payload incl. client_id)
    Backend->>Backend: deduplicate by client_id (fallback IP)
    Backend-->>Game: updated stats response
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Poem

I nibble at glosses, hop through each tongue,
I stash a client id, sing percentiles strung,
Fixtures grow like carrots in neat little rows,
Scripts fetch definitions where the big wind blows,
A rabbit applauds — data wherever it goes! 🐰

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: introducing a 4-tier definition system with kaikki.org integration and parser hardening.
Docstring Coverage ✅ Passed Docstring coverage is 84.78% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/definition-quality-overhaul

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
frontend/src/game.ts (1)

1326-1348: ⚠️ Potential issue | 🟡 Minor

Unhandled case: resp.json() may throw on non-JSON responses.

If the server returns a 200 with empty body (line 1494 in app.py: return "", 200), calling resp.json() will throw. The .catch(() => {}) at line 1348 silently swallows it, so there's no crash — but the user gets no percentile data. This happens when _load_word_stats returns None after a successful update.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontend/src/game.ts` around lines 1326 - 1348, The fetch handling can throw
when resp.json() is called on an empty (200) response; update the success branch
that currently does (resp.ok ? resp.json() : null) to safely parse only when
there is a non-empty JSON body: call resp.text(), check if the returned text is
non-empty, then JSON.parse it inside a try/catch (or reject to null on
SyntaxError) so empty or non-JSON responses yield null; keep the downstream
logic that only uses stats when non-null (calculateCommunityPercentile,
communityPercentile, communityIsTopScore, communityTotal, communityStatsLink)
unchanged.
♻️ Duplicate comments (4)
tests/fixtures/wiktionary/ga.json (1)

1-26: Same all-null concern already noted for eu.json.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/wiktionary/ga.json` around lines 1 - 26, The fixture entries
"horás", "forar", "diasa", and "bríde" in ga.json currently have extract and
parsed set to null (same issue as eu.json); fix by either supplying the expected
extracted text and parsed structure for each word or removing these placeholder
entries so tests don't rely on all-null records—make the change consistent with
how eu.json was resolved (populate with real extraction/parsed values or delete
the entries).
tests/fixtures/wiktionary/ne.json (1)

1-26: Same all-null concern already noted for eu.json.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/wiktionary/ne.json` around lines 1 - 26, The fixture ne.json
contains several entries ("घोपाइ", "बिसाइ", "सजिवन", "ठोसाउ") where extract and
parsed are null and word_type is "unknown" — same all-null issue noted for
eu.json; update the fixture by either removing these all-null entries or
replacing them with the expected parsed/extract/word_type values produced by the
Wiktionary parser (or regenerate the fixture from the parser output), and ensure
the test that loads ne.json expects the non-null structure produced by the
parsing functions so the entries no longer appear as all-null placeholders.
tests/fixtures/wiktionary/pau.json (1)

1-26: Same all-null concern already noted for eu.json.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/wiktionary/pau.json` around lines 1 - 26, The fixture entries
for keys "chull", "ikrir", "burek", and "modil" contain all-null values for
"extract" and "parsed" and have "word_type":"unknown" which duplicates the same
all-null problem noted for eu.json; either replace those placeholder entries
with real parsed fixture data (fill "extract" and "parsed" with the actual
scraped content and set a correct "word_type" and "tried_word") or remove the
stub entries so tests don't rely on empty fixtures; ensure the updates mirror
the resolution you applied to the eu.json case and keep the JSON schema fields
"extract", "parsed", "word_type", and "tried_word" consistent.
tests/fixtures/wiktionary/ie.json (1)

1-26: Same all-null concern already noted for eu.json.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/wiktionary/ie.json` around lines 1 - 26, The entries "datil",
"salva", "jaspe", and "nudar" are all-null fixtures (extract=null, parsed=null,
word_type="unknown"), which duplicates the issue noted for eu.json; fix by
either (a) populating each key with realistic fixture data (non-null "extract"
text and a valid "parsed" structure and accurate "word_type"), or (b) remove
these all-null entries from tests/fixtures/wiktionary/ie.json if they are not
needed, and update any test expectations that reference these keys to match the
change (apply the same approach used to fix eu.json).
🧹 Nitpick comments (15)
tests/fixtures/wiktionary/eu.json (1)

1-26: All-null fixtures provide no positive-path test coverage for the parser.

Every entry across all added fixture files (eu, ie, la, pau, ne, nds, ga) has "extract": null and "parsed": null. These only assert that the system gracefully handles "word not found" — they do not verify that the parser correctly extracts a definition when Wiktionary data is present. If a regression causes the parser to silently return null for a word it previously parsed correctly, these tests would still pass.

Consider seeding each fixture with at least one word that has a known, non-null definition for the language so the golden tests also cover the happy path.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/wiktionary/eu.json` around lines 1 - 26, The fixtures
currently only contain all-null entries for languages eu, ie, la, pau, ne, nds,
and ga which only assert “not found” behavior; add at least one positive-path
entry per fixture by replacing or adding a word object with non-null "extract"
and "parsed" values (a real Wiktionary extract and the expected parsed
structure) so the golden tests validate successful parsing; ensure the added
entry uses the same JSON shape as existing objects (keys "extract", "parsed",
"word_type", "tried_word") and pick a stable, common word for each language to
avoid flaky updates.
webapp/data/definitions/et_en.json (1)

1-1698: Consider lazy-loading and compression for this large static asset.

Given the size of this dictionary file, ensure it’s fetched on demand (not bundled into initial JS) and served with gzip/brotli + strong caching to avoid frontend load regressions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/definitions/et_en.json` around lines 1 - 1698, The large static
JSON (et_en.json) should be removed from the initial JS bundle and loaded
lazily: create a dedicated async loader (e.g. loadEtEnDictionary or
fetch('/data/definitions/et_en.json')) and replace any direct imports/usages
with await loadEtEnDictionary() in components that need it; ensure the file
lives as a static asset (webapp/data/definitions/et_en.json) rather than
bundled. On the server/static hosting, enable gzip and brotli compression and
add strong caching headers (Cache-Control with long max-age + immutable and
proper ETag/Last-Modified) for that asset so subsequent loads are cached. Also
update build/static config (assetCompression/serveStatic middleware) to exclude
this file from client bundle and verify network waterfall to confirm it's
fetched on demand.
scripts/capture_wiktionary_fixtures.py (1)

103-200: Remove the unused word parameter (or use it).

guess_word_type doesn’t use word, which triggers the lint warning and adds API noise.

♻️ Suggested cleanup
-def guess_word_type(extract, word):
+def guess_word_type(extract):
     """Guess the word type from Wiktionary extract headers."""
     if not extract:
         return "unknown"
-                word_type = guess_word_type(extract, tried_word)
+                word_type = guess_word_type(extract)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/capture_wiktionary_fixtures.py` around lines 103 - 200, The
guess_word_type function declares an unused parameter word which triggers linter
warnings and clutters the API; remove the unused parameter from the function
signature (change def guess_word_type(extract, word): to def
guess_word_type(extract):) and update any callers that pass a second argument to
only pass extract, or alternatively use the word parameter inside
guess_word_type if contextual matching is needed; ensure you update references
to the function name guess_word_type accordingly so signatures and calls remain
consistent.
webapp/data/definitions/bg_en.json (1)

1-40: Consider safe-display handling for sensitive lexicon entries.

Comprehensive dictionaries often include offensive or sensitive terms; if this data is user-facing, ensure the UI/content policy (filters, warnings, or opt-in visibility) is explicit.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/definitions/bg_en.json` around lines 1 - 40, This JSON contains
potentially sensitive/offensive entries (e.g., "абдал", "аборт", "аллах") and
needs metadata for safe-display; add a per-entry flag or field (e.g.,
"sensitive": true or "sensitivity": "offensive|religious|medical") alongside
offending keys in bg_en.json and update the UI rendering path to check that
metadata (or a centralized filter) before showing entries—ensure entries like
"абдал", "аборт", and "аллах" are marked appropriately and that the frontend
uses those flags to apply content warnings, opt-in visibility, or filtering.
webapp/data/definitions/eo_en.json (1)

1-5: Confirm lazy-loading/compression for this large asset.

This dictionary is sizable; please ensure it’s fetched on demand or served as a compressed static asset rather than bundled into initial client payloads.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/definitions/eo_en.json` around lines 1 - 5, The eo_en.json large
dictionary (keys like "abajo", "abela", "abelo", "abioj") must not be bundled
into the initial client payload; instead either (A) serve it as a compressed
static asset (enable gzip/brotli on the server or upload to the CDN and set
proper Content-Encoding/Cache-Control) or (B) lazy-load it at runtime (remove
any static imports and fetch it on demand via dynamic import or
fetch('/assets/eo_en.json') from the client when needed) so the bundle size
stays small and the file is delivered compressed and cached.
webapp/app.py (1)

1469-1484: Dedup still uses _stats_seen_ips variable name despite switching to client_id.

The dict _stats_seen_ips and the constant _STATS_MAX_IPS (lines 807-808) are named for IP-based dedup, but now store client_id-based keys. Consider renaming for clarity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/app.py` around lines 1469 - 1484, The in-memory dedup variables are
still named for IPs even though dedup now uses client_id; rename _stats_seen_ips
(and all references) to something like _stats_seen_clients (or _stats_seen_keys)
and rename _STATS_MAX_IPS to _STATS_MAX_CLIENTS (or _STATS_MAX_KEYS), and update
uses in this block (variables client_id, dedup_key, checks like "if dedup_key in
_stats_seen_*", insertions "_stats_seen_*[dedup_key] = True", and any other
references elsewhere in the file) so names accurately reflect client_id-based
deduping while keeping behavior unchanged.
frontend/src/game.ts (1)

377-382: Re-fetch on page reload triggers a POST to word-stats on every load.

When a completed game is loaded from localStorage, submitWordStats sends a POST request each time the page is mounted. The backend dedup (via client_id) should catch these as duplicates and return existing stats without double-counting, so this is functionally safe — but it generates unnecessary traffic. Consider using a GET endpoint for fetching current stats separately from submitting results, or caching the percentile in localStorage.

Also, the typeof this.attempts === 'number' check on line 379 will always be false since this.attempts is typed as string in the GameData interface (line 113). The parseInt fallback handles this correctly, but the dead branch is confusing.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frontend/src/game.ts` around lines 377 - 382, The current mount path always
calls submitWordStats(this.game_won, attempts) causing a POST to word-stats on
every page load and includes a dead branch checking typeof this.attempts ===
'number'; update the logic in the component that runs on mount to avoid
re-submitting completed-game results: either call a new GET endpoint to fetch
existing stats (instead of POST) or check/cache a submission marker or cached
percentile in localStorage and only call submitWordStats when no cached
submission exists; also simplify attempts parsing by removing the unreachable
typeof this.attempts === 'number' branch and always using
parseInt(String(this.attempts), 10) || 0; touch symbols: submitWordStats,
attempts (GameData), and the mount code that currently calls submitWordStats to
implement this behavior.
tests/fixtures/wiktionary/ru.json (1)

1-26: Russian parsed definitions include example sentences and appear truncated.

The parsed values include citation markers () and full example sentences (e.g., "гнейс" includes a Vernadsky diary quote). Additionally, "холин" and "бланш" appear truncated mid-word ("химическая ст", "под правым г"), suggesting a character limit during fixture capture. If these are intended to document current parser behavior, this is fine — but the truncation might cause fragile test assertions if the cutoff point ever shifts.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/wiktionary/ru.json` around lines 1 - 26, The fixture's
"parsed" fields for entries like "гнейс", "холин", and "бланш" contain citation
markers (◆), full example sentences and are truncated mid-word, indicating a
capture/serialization length limit; fix by either (A) regenerating fixtures with
a longer capture limit so "parsed" contains full intended text and citation
markers are normalized, or (B) update the parser/fixture generation to strip
citation markers and example sentences before writing "parsed" (or produce a
cleaned "parsed_normalized" field), and change tests to assert on stable anchors
(e.g., startsWith/contains of the definition text) rather than exact truncated
strings; locate the code that generates/writes the "parsed" JSON for these
entries and apply one of these fixes.
tests/fixtures/wiktionary/cs.json (1)

1-26: Czech parser extracts grammatical gender instead of actual definitions.

The parsed values for "brňan", "palaš", and "motor" are all grammatical gender strings (e.g., "rod mužský neživotný") rather than semantic definitions. For instance, "palaš" means "heavy cavalry sword" (těžký jezdecký meč...) per the extract, but the parser only captures the gender line. If these fixtures reflect the expected parser behavior, this is fine as-is — but it suggests the Czech parser would benefit from improvement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/fixtures/wiktionary/cs.json` around lines 1 - 26, The fixtures show the
Czech parser is populating the "parsed" field with grammatical gender strings
instead of semantic definitions; update the Czech parsing logic (e.g.,
parseCzechEntry / extractDefinition / getDefinitionFromExtract) to parse the
"extract" text and pull the "==== význam ====" block as the primary value for
"parsed" (use the "extract" field to locate and trim the meaning), fall back to
the gender line only if no "význam" section exists, and ensure outputs still
populate "parsed", "word_type", and "tried_word" consistently for entries like
Brňan, palaš, and motor; then update the fixtures or tests to match the
corrected behavior.
tests/test_wiktionary_parser.py (4)

278-288: Unused loop variables in the categorization loop.

good, has_ext, and details are unpacked but not used in this loop body (lines 282-288). Prefix them with _ to signal intent and silence linter warnings.

Proposed fix
-        for lang, (level, good, has_ext, details) in sorted(confidence.items()):
+        for lang, (level, _good, _has_ext, _details) in sorted(confidence.items()):
             if level == CONFIDENT:
                 confident_langs.append(lang)
             elif level == PARTIAL:
                 partial_langs.append(lang)
             else:
                 unreliable_langs.append(lang)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_wiktionary_parser.py` around lines 278 - 288, The loop that
categorizes languages unpacks (level, good, has_ext, details) but only uses
level and lang; rename the unused unpacked variables to _good, _has_ext, and
_details in the for statement (the loop that builds confident_langs,
partial_langs, and unreliable_langs) so linters understand they are
intentionally unused and to silence warnings.

118-144: Fixture files are loaded from disk 3-4 times at module import time.

generate_golden_test_cases(), generate_quality_test_cases(), and generate_confidence_dict() (via compute_confidence_map()) each independently glob and read all *.json fixture files. With 65 fixture files this happens at import/collection time, slowing down pytest --collect-only and any test run.

Consider loading fixtures once and reusing them:

Proposed approach
+# Load all fixtures once at module level
+ALL_FIXTURES = load_all_fixtures()
+
+
 def generate_golden_test_cases():
     cases = []
-    for f in sorted(FIXTURES_DIR.glob("*.json")):
-        lang = f.stem
-        with open(f, "r", encoding="utf-8") as fh:
-            data = json.load(fh)
+    for lang, data in sorted(ALL_FIXTURES.items()):
         for word, info in data.items():
             extract = info.get("extract")
             if extract is not None:
                 cases.append(
                     (lang, word, extract, info.get("parsed"), info.get("tried_word", word))
                 )
     return cases

Apply the same pattern to generate_quality_test_cases() and compute_confidence_map().

Also applies to: 179-193, 327-336

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_wiktionary_parser.py` around lines 118 - 144, The tests are
re-reading fixture JSONs multiple times at import; centralize loading by calling
load_all_fixtures() once and reusing its result in the generators. Modify
generate_golden_test_cases(), generate_quality_test_cases(), and
compute_confidence_map()/generate_confidence_dict() to accept a fixtures dict
parameter (or read from a single module-level FIXTURES variable populated by
load_all_fixtures()), and change GOLDEN_CASES to be built from that single
fixtures object instead of re-globbing; update references to use fixture_data
lookups (e.g., in generate_golden_test_cases use fixtures[lang][word]) so no
function performs its own file I/O.

298-310: Unused loop variable snippet.

snippet is unpacked at line 300 but never referenced. Rename to _snippet per Python convention.

Proposed fix
-                for word, status, wtype, snippet in details:
+                for word, status, wtype, _snippet in details:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_wiktionary_parser.py` around lines 298 - 310, The loop unpacks
four values in "for word, status, wtype, snippet in details" but never uses
snippet; rename it to _snippet to follow Python convention for unused variables.
Update that loop header in tests/test_wiktionary_parser.py (the block that
builds detail_parts for each entry in confidence) to use _snippet so linters
won't flag an unused variable; no other logic changes needed.

318-319: Hardcoded assertion of exactly 65 languages is brittle.

If a language is added or removed, this test will fail with a confusing message. Consider deriving the expected count from a canonical source (e.g., a config file listing supported languages) rather than a magic number. Same applies to line 343.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_wiktionary_parser.py` around lines 318 - 319, The test currently
asserts a hardcoded count (assert len(confidence) == 65) which is brittle;
change both occurrences (the assertion at len(confidence) and the similar check
around line 343) to compute the expected count from the canonical source of
supported languages used by the app (e.g., a module constant or config such as
SUPPORTED_LANGUAGES or languages_list) and compare with len(confidence) so the
test uses len(SUPPORTED_LANGUAGES) instead of the magic number 65; ensure you
import or reference that canonical variable in tests and update the assertion
messages accordingly.
tests/test_wiktionary_definitions.py (2)

164-183: Network tests that find no definition are silently skipped — could mask regressions.

When fetch_native_wiktionary returns None, the test is skipped (line 169). If an API change or parser bug causes all words to return None, the entire suite would show as "skipped" with no failures. Consider tracking skip rates or asserting a minimum hit rate in the coverage summary test.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_wiktionary_definitions.py` around lines 164 - 183, The test
silently skips when fetch_native_wiktionary returns None, which can hide
regressions; modify test_wiktionary_definition to increment module-level
counters (e.g., WIKTIONARY_HITS and WIKTIONARY_SKIPS) or record outcomes in a
shared list whenever fetch_native_wiktionary(word, lang_code) returns None vs a
dict, then keep the skip call but still record the skip; add a new test (e.g.,
test_wiktionary_definition_coverage) that asserts a minimum hit rate (or maximum
skip count) across the collected results (use the same module-level symbols you
added) to fail the suite if too many lookups return None.

155-155: generate_test_cases() runs at module import time, hitting the filesystem for every language directory.

This means any pytest --collect-only or import of this module triggers the full scan. If fixtures or language dirs are missing (e.g., in a CI job that doesn't check out data), this silently produces zero test cases rather than failing loudly. Consider adding a guard or a warning when TEST_CASES is empty.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_wiktionary_definitions.py` at line 155, The module currently calls
generate_test_cases() at import (TEST_CASES = generate_test_cases()), causing
filesystem hits and silent empty results; modify this by removing the eager
module-level call and either (a) lazily load cases via a helper like
load_test_cases() invoked from pytest collection hooks (e.g.,
pytest_generate_tests) or (b) keep the module-level call but immediately guard
it: after TEST_CASES = generate_test_cases(), add a check if not TEST_CASES:
emit a clear warnings.warn(...) and raise RuntimeError("No test cases found:
check language dirs or CI checkout") so missing fixtures/dirs fail loudly;
update any test parametrization to use the new load_test_cases() or the guarded
TEST_CASES variable and reference generate_test_cases(), TEST_CASES, and
load_test_cases() to locate the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/build_definitions.py`:
- Around line 526-553: cmd_stats currently loads only the native definitions
file (DEFINITIONS_DIR/{lc}.json) and thus undercounts coverage when fallback
English gloss files (DEFINITIONS_DIR/{lc}_en.json or similar *_en.json) exist;
update cmd_stats to look for the fallback file alongside the native file, load
both (using load_word_list for words as before), merge the two definition dicts
(or otherwise deduplicate keys) before computing n_defs/total_defs/coverage and
printing so that coverage reflects native+English glosses; ensure you reference
and update the variables n_defs, total_defs, coverage, and the printed status
accordingly in the cmd_stats function.
- Around line 220-255: In download_file, avoid catching broad Exception and
remove the unnecessary f-string on the 404 message: replace the final broad
except block with specific network/IO exception handlers (e.g., except
(urllib.error.URLError, socket.timeout, OSError) as e:) that log the error and
return False, leaving other exceptions to propagate, and change print(f"  Not
found (404) — skipping") to print("  Not found (404) — skipping") so you don't
use an f-string with no placeholders; keep the existing urllib.error.HTTPError
handling intact.

In `@scripts/capture_wiktionary_fixtures.py`:
- Around line 84-99: The try/except around the MediaWiki fetch (the block that
creates req, calls urlreq.urlopen and json.loads and iterates pages to return
extract/try_word) currently swallows all exceptions; change it to log the
failure instead of silently passing: add or use a module logger (import
logging), and in the except Exception as e: call logger.exception or
logger.error with the URL/try_word and the exception details, then continue (do
not re-raise) so failures are visible while preserving behavior.

In `@tests/fixtures/wiktionary/de.json`:
- Around line 2-6: The fixture entry for "buben" is inconsistent: the Wiktionary
heading shows "=== Verb ===" but the JSON field "word_type" is set to "noun";
update the "buben" object to set "word_type" to "verb" (or "unknown" if the
parser cannot classify) so "word_type" matches the heading and parsed content,
keeping other fields ("parsed", "tried_word") unchanged.

In `@tests/test_wiktionary_definitions.py`:
- Around line 34-105: Extract BAD_START_PATTERNS, SECTION_HEADER,
PHONETIC_PATTERN and the is_quality_definition implementation into a shared test
helper module (e.g., tests/wiktionary_test_utils.py), unify the function
signature to is_quality_definition(word, definition, lang_code=None) so both
callers can pass/ignore lang_code, incorporate the extra checks present in the
parser version (pronunciation line and headword-with-IPA checks) into this
single implementation, and update both test files to import and call the shared
is_quality_definition and regex constants instead of duplicating them.
- Around line 108-141: The comment describing chosen indices in
pick_test_words() is incorrect for the implemented formula (target_indices = [0,
n // 10, n // 5, n // 2, min(n - 1, 1000)]); update the inline comments to
reflect the actual outputs for example n values (e.g., for n=1000 -> [0, 100,
200, 500, 999]; for n=10000 -> [0, 1000, 2000, 5000, 1000]) and note the
min(n-1,1000) behavior, or alternatively change the formula to match the
original intended indices—refer to pick_test_words and target_indices when
making the edit.
- Line 63: The function is_quality_definition currently has an unused parameter
lang_code; remove lang_code from the function signature (change def
is_quality_definition(word, definition):) and update all call sites (e.g., in
test_wiktionary_parser.py) to pass only the two required args, or alternatively
implement language-specific checks inside is_quality_definition using lang_code
if language heuristics are desired—choose one approach and keep the function
signature and all callers consistent with that choice.

In `@tests/test_wiktionary_parser.py`:
- Around line 228-234: The variable "total" is assigned but never used inside
the loop "for lang, words_data in sorted(fixtures.items())"; remove the unused
assignment "total = len(words_data)" (or, if intended for assertions or logging,
replace with a used reference such as using len(words_data) directly in an
assertion/log or store it in a clearly used name) so the unused variable F841 is
eliminated; update the loop block that contains "has_extract", "good_count", and
"details" accordingly.
- Around line 353-357: In test_at_least_some_confident replace the ambiguous
list-comprehension variable `l` with a clear name (e.g., `lang` or
`language_code`) when iterating over PARSER_CONFIDENCE.items() so the
comprehension reads like `[lang for lang, v in PARSER_CONFIDENCE.items() if v ==
CONFIDENT]` and update any references (the f-string uses {confident})
accordingly to avoid the E741 warning; keep the test name, PARSER_CONFIDENCE,
and CONFIDENT unchanged.

In `@webapp/app.py`:
- Around line 1475-1476: Validate and limit the size of client_id before using
it in dedup_key to prevent memory abuse: when building client_id (the variable
used in dedup_key = f"{lang_code}:{day_idx}:{client_id}"), enforce a maximum
length (e.g., 128 characters) and truncate (or hash) any longer input, and
optionally reject empty/invalid values; apply this truncation to both
user-supplied data.get("client_id") and fallback request.remote_addr so
dedup_key always uses the bounded client_id.

In `@webapp/data/definitions/ckb_en.json`:
- Line 99: The JSON definition files contain entries whose values start with
phrases like "alternative form of" or "misspelling of" which bypass the existing
_is_garbage_definition() filter and produce unhelpful game hints; update the
pipeline to remove them by either (A) running the _is_garbage_definition() check
when loading *_en.json at lookup time (call _is_garbage_definition(word,
definition) inside the loader that reads webapp/data/definitions/*_en.json and
skip fields where it returns True) or (B) strip these entries during compilation
inside build_definitions.py (filter out any key/value where the value matches
the garbage regex used by _is_garbage_definition() before writing the compiled
JSON); modify the loader or build script accordingly and ensure tests/fixtures
still pass.

In `@webapp/data/definitions/de_en.json`:
- Line 292: The entries "regle" and "trade" in de_en.json have incomplete
form-of definitions — update each value from "first-person singular present" to
include the base-verb reference (e.g., "first-person singular present of regeln"
for "regle" and "first-person singular present of trade" or the correct German
infinitive for "trade") so the form-of string matches the pattern used elsewhere
and the garbage-filter can recognize the base word; locate the "regle" and
"trade" keys and append the proper "of <base-word>" phrase to their values.

In `@webapp/data/definitions/fo_en.json`:
- Line 37: build_definitions.py currently includes entries whose values are
purely grammatical metadata (e.g., "banka": "indefinite
accusative/dative/genitive singular"); update the build step to filter out these
11 keys or, better, add a general garbage-definition check that skips any
definition matching only grammatical-term tokens (regex match for words like
nominative|accusative|dative|genitive|singular|plural|indefinite|feminine|masculine|neuter
and variants) before emitting definitions; also add the same check to the
parser's garbage-definition filter (e.g., is_garbage_definition or the parser
validation routine) so entries like "banka", "bónda", "dropa", "drúgv", "ferju",
"nakað", "nøkur", "oyrar", "sínum", "ymisk"/"ymiss" are excluded from game
vocabulary.

---

Outside diff comments:
In `@frontend/src/game.ts`:
- Around line 1326-1348: The fetch handling can throw when resp.json() is called
on an empty (200) response; update the success branch that currently does
(resp.ok ? resp.json() : null) to safely parse only when there is a non-empty
JSON body: call resp.text(), check if the returned text is non-empty, then
JSON.parse it inside a try/catch (or reject to null on SyntaxError) so empty or
non-JSON responses yield null; keep the downstream logic that only uses stats
when non-null (calculateCommunityPercentile, communityPercentile,
communityIsTopScore, communityTotal, communityStatsLink) unchanged.

---

Duplicate comments:
In `@tests/fixtures/wiktionary/ga.json`:
- Around line 1-26: The fixture entries "horás", "forar", "diasa", and "bríde"
in ga.json currently have extract and parsed set to null (same issue as
eu.json); fix by either supplying the expected extracted text and parsed
structure for each word or removing these placeholder entries so tests don't
rely on all-null records—make the change consistent with how eu.json was
resolved (populate with real extraction/parsed values or delete the entries).

In `@tests/fixtures/wiktionary/ie.json`:
- Around line 1-26: The entries "datil", "salva", "jaspe", and "nudar" are
all-null fixtures (extract=null, parsed=null, word_type="unknown"), which
duplicates the issue noted for eu.json; fix by either (a) populating each key
with realistic fixture data (non-null "extract" text and a valid "parsed"
structure and accurate "word_type"), or (b) remove these all-null entries from
tests/fixtures/wiktionary/ie.json if they are not needed, and update any test
expectations that reference these keys to match the change (apply the same
approach used to fix eu.json).

In `@tests/fixtures/wiktionary/ne.json`:
- Around line 1-26: The fixture ne.json contains several entries ("घोपाइ",
"बिसाइ", "सजिवन", "ठोसाउ") where extract and parsed are null and word_type is
"unknown" — same all-null issue noted for eu.json; update the fixture by either
removing these all-null entries or replacing them with the expected
parsed/extract/word_type values produced by the Wiktionary parser (or regenerate
the fixture from the parser output), and ensure the test that loads ne.json
expects the non-null structure produced by the parsing functions so the entries
no longer appear as all-null placeholders.

In `@tests/fixtures/wiktionary/pau.json`:
- Around line 1-26: The fixture entries for keys "chull", "ikrir", "burek", and
"modil" contain all-null values for "extract" and "parsed" and have
"word_type":"unknown" which duplicates the same all-null problem noted for
eu.json; either replace those placeholder entries with real parsed fixture data
(fill "extract" and "parsed" with the actual scraped content and set a correct
"word_type" and "tried_word") or remove the stub entries so tests don't rely on
empty fixtures; ensure the updates mirror the resolution you applied to the
eu.json case and keep the JSON schema fields "extract", "parsed", "word_type",
and "tried_word" consistent.

---

Nitpick comments:
In `@frontend/src/game.ts`:
- Around line 377-382: The current mount path always calls
submitWordStats(this.game_won, attempts) causing a POST to word-stats on every
page load and includes a dead branch checking typeof this.attempts === 'number';
update the logic in the component that runs on mount to avoid re-submitting
completed-game results: either call a new GET endpoint to fetch existing stats
(instead of POST) or check/cache a submission marker or cached percentile in
localStorage and only call submitWordStats when no cached submission exists;
also simplify attempts parsing by removing the unreachable typeof this.attempts
=== 'number' branch and always using parseInt(String(this.attempts), 10) || 0;
touch symbols: submitWordStats, attempts (GameData), and the mount code that
currently calls submitWordStats to implement this behavior.

In `@scripts/capture_wiktionary_fixtures.py`:
- Around line 103-200: The guess_word_type function declares an unused parameter
word which triggers linter warnings and clutters the API; remove the unused
parameter from the function signature (change def guess_word_type(extract,
word): to def guess_word_type(extract):) and update any callers that pass a
second argument to only pass extract, or alternatively use the word parameter
inside guess_word_type if contextual matching is needed; ensure you update
references to the function name guess_word_type accordingly so signatures and
calls remain consistent.

In `@tests/fixtures/wiktionary/cs.json`:
- Around line 1-26: The fixtures show the Czech parser is populating the
"parsed" field with grammatical gender strings instead of semantic definitions;
update the Czech parsing logic (e.g., parseCzechEntry / extractDefinition /
getDefinitionFromExtract) to parse the "extract" text and pull the "==== význam
====" block as the primary value for "parsed" (use the "extract" field to locate
and trim the meaning), fall back to the gender line only if no "význam" section
exists, and ensure outputs still populate "parsed", "word_type", and
"tried_word" consistently for entries like Brňan, palaš, and motor; then update
the fixtures or tests to match the corrected behavior.

In `@tests/fixtures/wiktionary/eu.json`:
- Around line 1-26: The fixtures currently only contain all-null entries for
languages eu, ie, la, pau, ne, nds, and ga which only assert “not found”
behavior; add at least one positive-path entry per fixture by replacing or
adding a word object with non-null "extract" and "parsed" values (a real
Wiktionary extract and the expected parsed structure) so the golden tests
validate successful parsing; ensure the added entry uses the same JSON shape as
existing objects (keys "extract", "parsed", "word_type", "tried_word") and pick
a stable, common word for each language to avoid flaky updates.

In `@tests/fixtures/wiktionary/ru.json`:
- Around line 1-26: The fixture's "parsed" fields for entries like "гнейс",
"холин", and "бланш" contain citation markers (◆), full example sentences and
are truncated mid-word, indicating a capture/serialization length limit; fix by
either (A) regenerating fixtures with a longer capture limit so "parsed"
contains full intended text and citation markers are normalized, or (B) update
the parser/fixture generation to strip citation markers and example sentences
before writing "parsed" (or produce a cleaned "parsed_normalized" field), and
change tests to assert on stable anchors (e.g., startsWith/contains of the
definition text) rather than exact truncated strings; locate the code that
generates/writes the "parsed" JSON for these entries and apply one of these
fixes.

In `@tests/test_wiktionary_definitions.py`:
- Around line 164-183: The test silently skips when fetch_native_wiktionary
returns None, which can hide regressions; modify test_wiktionary_definition to
increment module-level counters (e.g., WIKTIONARY_HITS and WIKTIONARY_SKIPS) or
record outcomes in a shared list whenever fetch_native_wiktionary(word,
lang_code) returns None vs a dict, then keep the skip call but still record the
skip; add a new test (e.g., test_wiktionary_definition_coverage) that asserts a
minimum hit rate (or maximum skip count) across the collected results (use the
same module-level symbols you added) to fail the suite if too many lookups
return None.
- Line 155: The module currently calls generate_test_cases() at import
(TEST_CASES = generate_test_cases()), causing filesystem hits and silent empty
results; modify this by removing the eager module-level call and either (a)
lazily load cases via a helper like load_test_cases() invoked from pytest
collection hooks (e.g., pytest_generate_tests) or (b) keep the module-level call
but immediately guard it: after TEST_CASES = generate_test_cases(), add a check
if not TEST_CASES: emit a clear warnings.warn(...) and raise RuntimeError("No
test cases found: check language dirs or CI checkout") so missing fixtures/dirs
fail loudly; update any test parametrization to use the new load_test_cases() or
the guarded TEST_CASES variable and reference generate_test_cases(), TEST_CASES,
and load_test_cases() to locate the change.

In `@tests/test_wiktionary_parser.py`:
- Around line 278-288: The loop that categorizes languages unpacks (level, good,
has_ext, details) but only uses level and lang; rename the unused unpacked
variables to _good, _has_ext, and _details in the for statement (the loop that
builds confident_langs, partial_langs, and unreliable_langs) so linters
understand they are intentionally unused and to silence warnings.
- Around line 118-144: The tests are re-reading fixture JSONs multiple times at
import; centralize loading by calling load_all_fixtures() once and reusing its
result in the generators. Modify generate_golden_test_cases(),
generate_quality_test_cases(), and
compute_confidence_map()/generate_confidence_dict() to accept a fixtures dict
parameter (or read from a single module-level FIXTURES variable populated by
load_all_fixtures()), and change GOLDEN_CASES to be built from that single
fixtures object instead of re-globbing; update references to use fixture_data
lookups (e.g., in generate_golden_test_cases use fixtures[lang][word]) so no
function performs its own file I/O.
- Around line 298-310: The loop unpacks four values in "for word, status, wtype,
snippet in details" but never uses snippet; rename it to _snippet to follow
Python convention for unused variables. Update that loop header in
tests/test_wiktionary_parser.py (the block that builds detail_parts for each
entry in confidence) to use _snippet so linters won't flag an unused variable;
no other logic changes needed.
- Around line 318-319: The test currently asserts a hardcoded count (assert
len(confidence) == 65) which is brittle; change both occurrences (the assertion
at len(confidence) and the similar check around line 343) to compute the
expected count from the canonical source of supported languages used by the app
(e.g., a module constant or config such as SUPPORTED_LANGUAGES or
languages_list) and compare with len(confidence) so the test uses
len(SUPPORTED_LANGUAGES) instead of the magic number 65; ensure you import or
reference that canonical variable in tests and update the assertion messages
accordingly.

In `@webapp/app.py`:
- Around line 1469-1484: The in-memory dedup variables are still named for IPs
even though dedup now uses client_id; rename _stats_seen_ips (and all
references) to something like _stats_seen_clients (or _stats_seen_keys) and
rename _STATS_MAX_IPS to _STATS_MAX_CLIENTS (or _STATS_MAX_KEYS), and update
uses in this block (variables client_id, dedup_key, checks like "if dedup_key in
_stats_seen_*", insertions "_stats_seen_*[dedup_key] = True", and any other
references elsewhere in the file) so names accurately reflect client_id-based
deduping while keeping behavior unchanged.

In `@webapp/data/definitions/bg_en.json`:
- Around line 1-40: This JSON contains potentially sensitive/offensive entries
(e.g., "абдал", "аборт", "аллах") and needs metadata for safe-display; add a
per-entry flag or field (e.g., "sensitive": true or "sensitivity":
"offensive|religious|medical") alongside offending keys in bg_en.json and update
the UI rendering path to check that metadata (or a centralized filter) before
showing entries—ensure entries like "абдал", "аборт", and "аллах" are marked
appropriately and that the frontend uses those flags to apply content warnings,
opt-in visibility, or filtering.

In `@webapp/data/definitions/eo_en.json`:
- Around line 1-5: The eo_en.json large dictionary (keys like "abajo", "abela",
"abelo", "abioj") must not be bundled into the initial client payload; instead
either (A) serve it as a compressed static asset (enable gzip/brotli on the
server or upload to the CDN and set proper Content-Encoding/Cache-Control) or
(B) lazy-load it at runtime (remove any static imports and fetch it on demand
via dynamic import or fetch('/assets/eo_en.json') from the client when needed)
so the bundle size stays small and the file is delivered compressed and cached.

In `@webapp/data/definitions/et_en.json`:
- Around line 1-1698: The large static JSON (et_en.json) should be removed from
the initial JS bundle and loaded lazily: create a dedicated async loader (e.g.
loadEtEnDictionary or fetch('/data/definitions/et_en.json')) and replace any
direct imports/usages with await loadEtEnDictionary() in components that need
it; ensure the file lives as a static asset (webapp/data/definitions/et_en.json)
rather than bundled. On the server/static hosting, enable gzip and brotli
compression and add strong caching headers (Cache-Control with long max-age +
immutable and proper ETag/Last-Modified) for that asset so subsequent loads are
cached. Also update build/static config (assetCompression/serveStatic
middleware) to exclude this file from client bundle and verify network waterfall
to confirm it's fetched on demand.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a9b8b63 and d3ac0f7.

📒 Files selected for processing (146)
  • .gitignore
  • frontend/src/__tests__/percentile.test.ts
  • frontend/src/game.ts
  • frontend/src/stats.ts
  • scripts/build_definitions.py
  • scripts/capture_wiktionary_fixtures.py
  • tests/fixtures/wiktionary/ar.json
  • tests/fixtures/wiktionary/az.json
  • tests/fixtures/wiktionary/bg.json
  • tests/fixtures/wiktionary/br.json
  • tests/fixtures/wiktionary/ca.json
  • tests/fixtures/wiktionary/ckb.json
  • tests/fixtures/wiktionary/cs.json
  • tests/fixtures/wiktionary/da.json
  • tests/fixtures/wiktionary/de.json
  • tests/fixtures/wiktionary/el.json
  • tests/fixtures/wiktionary/en.json
  • tests/fixtures/wiktionary/eo.json
  • tests/fixtures/wiktionary/es.json
  • tests/fixtures/wiktionary/et.json
  • tests/fixtures/wiktionary/eu.json
  • tests/fixtures/wiktionary/fa.json
  • tests/fixtures/wiktionary/fi.json
  • tests/fixtures/wiktionary/fo.json
  • tests/fixtures/wiktionary/fr.json
  • tests/fixtures/wiktionary/fur.json
  • tests/fixtures/wiktionary/fy.json
  • tests/fixtures/wiktionary/ga.json
  • tests/fixtures/wiktionary/gd.json
  • tests/fixtures/wiktionary/gl.json
  • tests/fixtures/wiktionary/he.json
  • tests/fixtures/wiktionary/hr.json
  • tests/fixtures/wiktionary/hu.json
  • tests/fixtures/wiktionary/hy.json
  • tests/fixtures/wiktionary/hyw.json
  • tests/fixtures/wiktionary/ia.json
  • tests/fixtures/wiktionary/ie.json
  • tests/fixtures/wiktionary/is.json
  • tests/fixtures/wiktionary/it.json
  • tests/fixtures/wiktionary/ka.json
  • tests/fixtures/wiktionary/ko.json
  • tests/fixtures/wiktionary/la.json
  • tests/fixtures/wiktionary/lb.json
  • tests/fixtures/wiktionary/lt.json
  • tests/fixtures/wiktionary/ltg.json
  • tests/fixtures/wiktionary/lv.json
  • tests/fixtures/wiktionary/mi.json
  • tests/fixtures/wiktionary/mk.json
  • tests/fixtures/wiktionary/mn.json
  • tests/fixtures/wiktionary/nb.json
  • tests/fixtures/wiktionary/nds.json
  • tests/fixtures/wiktionary/ne.json
  • tests/fixtures/wiktionary/nl.json
  • tests/fixtures/wiktionary/nn.json
  • tests/fixtures/wiktionary/oc.json
  • tests/fixtures/wiktionary/pau.json
  • tests/fixtures/wiktionary/pl.json
  • tests/fixtures/wiktionary/pt.json
  • tests/fixtures/wiktionary/qya.json
  • tests/fixtures/wiktionary/ro.json
  • tests/fixtures/wiktionary/ru.json
  • tests/fixtures/wiktionary/rw.json
  • tests/fixtures/wiktionary/sk.json
  • tests/fixtures/wiktionary/sl.json
  • tests/fixtures/wiktionary/sr.json
  • tests/fixtures/wiktionary/sv.json
  • tests/fixtures/wiktionary/tk.json
  • tests/fixtures/wiktionary/tlh.json
  • tests/fixtures/wiktionary/tr.json
  • tests/fixtures/wiktionary/uk.json
  • tests/fixtures/wiktionary/vi.json
  • tests/test_wiktionary_definitions.py
  • tests/test_wiktionary_parser.py
  • webapp/app.py
  • webapp/data/definitions/ar_en.json
  • webapp/data/definitions/az_en.json
  • webapp/data/definitions/bg_en.json
  • webapp/data/definitions/br_en.json
  • webapp/data/definitions/ca_en.json
  • webapp/data/definitions/ckb_en.json
  • webapp/data/definitions/cs.json
  • webapp/data/definitions/cs_en.json
  • webapp/data/definitions/da_en.json
  • webapp/data/definitions/de.json
  • webapp/data/definitions/de_en.json
  • webapp/data/definitions/el.json
  • webapp/data/definitions/el_en.json
  • webapp/data/definitions/en.json
  • webapp/data/definitions/eo_en.json
  • webapp/data/definitions/es.json
  • webapp/data/definitions/es_en.json
  • webapp/data/definitions/et_en.json
  • webapp/data/definitions/eu_en.json
  • webapp/data/definitions/fa_en.json
  • webapp/data/definitions/fi_en.json
  • webapp/data/definitions/fo_en.json
  • webapp/data/definitions/fr.json
  • webapp/data/definitions/fr_en.json
  • webapp/data/definitions/fur_en.json
  • webapp/data/definitions/fy_en.json
  • webapp/data/definitions/ga_en.json
  • webapp/data/definitions/gd_en.json
  • webapp/data/definitions/gl_en.json
  • webapp/data/definitions/he_en.json
  • webapp/data/definitions/hr_en.json
  • webapp/data/definitions/hu_en.json
  • webapp/data/definitions/hy_en.json
  • webapp/data/definitions/ia_en.json
  • webapp/data/definitions/is_en.json
  • webapp/data/definitions/it.json
  • webapp/data/definitions/it_en.json
  • webapp/data/definitions/ka_en.json
  • webapp/data/definitions/la_en.json
  • webapp/data/definitions/lb_en.json
  • webapp/data/definitions/lt_en.json
  • webapp/data/definitions/ltg_en.json
  • webapp/data/definitions/lv_en.json
  • webapp/data/definitions/mi_en.json
  • webapp/data/definitions/mk_en.json
  • webapp/data/definitions/mn_en.json
  • webapp/data/definitions/nb_en.json
  • webapp/data/definitions/nds_en.json
  • webapp/data/definitions/ne_en.json
  • webapp/data/definitions/nl.json
  • webapp/data/definitions/nl_en.json
  • webapp/data/definitions/nn_en.json
  • webapp/data/definitions/oc_en.json
  • webapp/data/definitions/pl.json
  • webapp/data/definitions/pl_en.json
  • webapp/data/definitions/pt.json
  • webapp/data/definitions/pt_en.json
  • webapp/data/definitions/ro_en.json
  • webapp/data/definitions/ru.json
  • webapp/data/definitions/ru_en.json
  • webapp/data/definitions/sk_en.json
  • webapp/data/definitions/sl_en.json
  • webapp/data/definitions/sr_en.json
  • webapp/data/definitions/sv_en.json
  • webapp/data/definitions/tk_en.json
  • webapp/data/definitions/tr.json
  • webapp/data/definitions/tr_en.json
  • webapp/data/definitions/uk_en.json
  • webapp/data/definitions/vi.json
  • webapp/data/definitions/vi_en.json
  • webapp/templates/game.html
  • webapp/wiktionary.py

Comment on lines +84 to +99
try:
req = urlreq.Request(
api_url, headers={"User-Agent": "WordleGlobal/1.0 (fixture-capture)"}
)
with urlreq.urlopen(req, timeout=10) as resp:
data = json.loads(resp.read())
pages = data.get("query", {}).get("pages", {})
for pid, page in pages.items():
if pid == "-1":
continue
extract = page.get("extract", "").strip()
if extract:
return extract, try_word
except Exception as e:
pass

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Log fetch failures instead of silently swallowing them.

Right now a network or parsing failure produces incomplete fixtures with no signal. Logging the error keeps the script resilient while still surfacing issues.

🔧 Suggested fix
-        except Exception as e:
-            pass
+        except Exception as exc:
+            print(
+                f"  {wikt_lang}:{try_word} fetch failed: {exc}",
+                file=sys.stderr,
+            )
🧰 Tools
🪛 Ruff (0.15.2)

[error] 85-87: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


[error] 88-88: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


[error] 97-98: try-except-pass detected, consider logging the exception

(S110)


[warning] 97-97: Do not catch blind exception: Exception

(BLE001)


[error] 97-97: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/capture_wiktionary_fixtures.py` around lines 84 - 99, The try/except
around the MediaWiki fetch (the block that creates req, calls urlreq.urlopen and
json.loads and iterates pages to return extract/try_word) currently swallows all
exceptions; change it to log the failure instead of silently passing: add or use
a module logger (import logging), and in the except Exception as e: call
logger.exception or logger.error with the URL/try_word and the exception
details, then continue (do not re-raise) so failures are visible while
preserving behavior.

"raver": "raver (person who attends rave parties)",
"regau": "a municipality of Upper Austria, Austria",
"regio": "clipping of Regionalbahn",
"regle": "first-person singular present",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Two entries have incomplete form-of definitions (no base-word reference).

Line 292 "regle""first-person singular present" and Line 358 "trade""first-person singular present" both lack the "of [verb]" reference. Same issue as the content-free entries in fo_en.json noted above — the garbage-filter script there will surface these as well.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/definitions/de_en.json` at line 292, The entries "regle" and
"trade" in de_en.json have incomplete form-of definitions — update each value
from "first-person singular present" to include the base-verb reference (e.g.,
"first-person singular present of regeln" for "regle" and "first-person singular
present of trade" or the correct German infinitive for "trade") so the form-of
string matches the pattern used elsewhere and the garbage-filter can recognize
the base word; locate the "regle" and "trade" keys and append the proper "of
<base-word>" phrase to their values.

Lint fixes:
- Narrow exception handling in build_definitions.py (BLE001)
- Remove f-string without placeholder (F541)
- Log errors instead of silently swallowing in capture_wiktionary_fixtures.py
- Remove unused variable `total` in test_wiktionary_parser.py (F841)
- Rename ambiguous `l` to `lang` in list comprehension (E741)

Code quality:
- Extract shared is_quality_definition() + regex constants into
  tests/wiktionary_test_utils.py (was duplicated between two test files)
- Remove unused `lang_code` param from is_quality_definition()
- Fix misleading index comments in pick_test_words()
- Fix word_type "noun" -> "verb" in de.json fixture

Security:
- Truncate client_id to 64 chars to prevent memory abuse via oversized keys

Definition quality:
- Add _is_unhelpful_gloss() filter in build_definitions.py to strip:
  - Bare grammatical forms ("indefinite accusative/dative/genitive singular")
  - Form-of references ("alternative form of", "misspelling of", "clipping of")
- cmd_stats now counts both native + English definitions (was undercounting)
- Reprocessed all 65 languages with new filters
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
webapp/data/definitions/de_en.json (1)

63-63: ⚠️ Potential issue | 🟡 Minor

Three entries still have incomplete form-of definitions (no base-word reference).

"edele" (Line 63), "rauhe" (Line 241), and "süsse" (Line 281) all map to "strong/mixed nominative/accusative feminine singular" — the same class of issue that was previously flagged for "regle" and "trade". The gloss is a bare inflection label with no "of <base word>" reference, making it useless to the end user. The _is_unhelpful_gloss() filter should be updated to catch this pattern (inflection-only glosses without " of " in the string), and these entries should be removed or completed.

Also applies to: 241-241, 281-281

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/definitions/de_en.json` at line 63, Three German entries
("edele", "rauhe", "süsse") contain only an inflection label ("strong/mixed
nominative/accusative feminine singular") with no " of <base>" reference; update
the _is_unhelpful_gloss() filter to treat glosses that are purely inflection
tokens and do not contain the substring " of " as unhelpful (e.g., return true
when gloss lacks " of " and matches an inflection-pattern regex like
/^(strong|weak|mixed|declin(e|ation)|nominative|accusative|feminine|masculine|singular|plural|etc\.)/i),
then remove or complete the three offending entries ("edele", "rauhe", "süsse")
in the JSON (either delete them or replace the value with a proper "of <base
word>" gloss).
🧹 Nitpick comments (5)
webapp/data/definitions/eo_en.json (1)

1-2516: Consider filtering low-value inflection-only entries to reduce payload.

A large fraction of this file consists of entries like "accusative singular of X", "plural of X", "present of Y", etc. While grammatically complete (they reference the base word), they provide minimal standalone value to users — the user still needs the root definition. For Esperanto, this substantially inflates the file.

If reducing the ~12 MB total bundle size becomes a concern, filtering entries whose gloss matches ^(accusative|nominative|genitive|dative|plural|present|past|future|conditional|imperative) (singular |plural )?(of|passive participle of) \w+$ would trim this file significantly without losing root-word definitions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/definitions/eo_en.json` around lines 1 - 2516, The JSON contains
many low-value inflection-only entries (values like "accusative singular of X",
"plural of X", "present of Y", "past of Y", "imperative of Z") that bloat
eo_en.json; remove or filter keys whose gloss matches the regex
^(accusative|nominative|genitive|dative|plural|present|past|future|conditional|imperative)(
singular| plural)? (of|of the|passive participle of) .+$ (i.e. entries like the
values for keys such as "akron", "agojn", "akvon", "donis", "donos", "donus",
etc.), keeping only root-word definitions so the base lemmas remain and payload
is reduced.
webapp/data/definitions/et_en.json (1)

71-71: Inconsistent capitalisation and terminal punctuation in generated definition strings

Definition values follow two distinct formats with no clear rule:

  • Capitalised + trailing period: "Da-infinitive of elama." (L136), "First-person singular present form of andma." (L71), "Present connegative form of vaatama." (L1537)
  • Lowercase + no period: "first-person singular present indicative of aitama" (L26–28), "third-person singular past indicative of andma" (L64)

Because these strings are rendered directly in the UI, mixed casing/punctuation is visible to users. If the inconsistency originates in the kaikki.org source data (different Wiktionary editors), consider normalising the strings in build_definitions.py (e.g., lowercase the first letter, strip trailing periods from form-of glosses) during post-processing rather than in the raw JSON files.

Also applies to: 136-136, 593-593, 1537-1537

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/definitions/et_en.json` at line 71, The JSON contains mixed
capitalization and trailing periods in short "form of" definition strings;
update the post-processing in build_definitions.py (where definitions are
generated/written—e.g., the function that builds or normalises definition
strings) to normalise them consistently: trim any trailing period, and lowercase
the first character of glosses that appear to be "form-of" or short grammar
glosses (detectable by patterns like "of " or starting with parts-of-speech
phrases), while leaving full-sentence definitions intact; apply this
transformation before emitting values into webapp/data/definitions/et_en.json so
all form-of entries are lowercased and punctuation-consistent.
scripts/capture_wiktionary_fixtures.py (1)

106-108: Drop unused word parameter in guess_word_type.

It’s never read, so removing it reduces confusion and silences lint warnings.

♻️ Proposed refactor
-def guess_word_type(extract, word):
+def guess_word_type(extract):
@@
-                word_type = guess_word_type(extract, tried_word)
+                word_type = guess_word_type(extract)

Also applies to: 231-233

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/capture_wiktionary_fixtures.py` around lines 106 - 108, Remove the
unused second parameter "word" from the function definition of guess_word_type
(and its duplicate at the other occurrence) and update its docstring to reflect
the single-argument signature; then find and update all call sites that
currently pass two arguments so they call guess_word_type(extract) with only the
extract argument. Ensure function signatures and imports/mocks (if any) are
updated consistently so linters no longer report the unused parameter.
tests/test_wiktionary_definitions.py (1)

138-144: Let unexpected exceptions surface in the coverage summary.

fetch_native_wiktionary already swallows network errors; catching broad Exception here can hide real test/code bugs.

♻️ Proposed refactor
-            try:
-                result = fetch_native_wiktionary(word, lang_code)
-            except Exception as e:
-                lang_stats["details"].append((word, "error", str(e)[:80]))
-                results["not_found"] += 1
-                lang_stats["not_found"] += 1
-                continue
+            result = fetch_native_wiktionary(word, lang_code)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_wiktionary_definitions.py` around lines 138 - 144, The test is
currently catching all Exception around the call to fetch_native_wiktionary
which masks real bugs; change the handler so unexpected exceptions surface by
either removing the broad "except Exception as e" entirely or replace it with a
narrow catch of the network/error type(s) that fetch_native_wiktionary
explicitly raises (e.g., requests.exceptions.RequestException or the function's
specific NetworkError), and keep the existing lang_stats/results update only for
those expected network errors; reference fetch_native_wiktionary,
lang_stats["details"], results["not_found"] to locate where to adjust the
try/except.
tests/test_wiktionary_parser.py (1)

198-226: Mark unused tuple elements with underscores.

This avoids B007 warnings and makes intent explicit.

♻️ Proposed refactor
-        for lang, (level, good, has_ext, details) in sorted(confidence.items()):
+        for lang, (level, _good, _has_ext, _details) in sorted(confidence.items()):
@@
-                for word, status, wtype, snippet in details:
+                for word, status, wtype, _snippet in details:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_wiktionary_parser.py` around lines 198 - 226, The first unpacking
of confidence items in the test loop currently binds unused tuple elements
(good, has_ext, details); change that unpack to mark unused elements with
underscores (e.g., unpack as lang, (level, _, _, _) or lang, (level, *_) ) so
only level is used when appending to confident_langs / partial_langs /
unreliable_langs, while leaving the second loop (for lang, (level, good,
has_ext, details) in sorted(confidence.items())) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@webapp/data/definitions/de_en.json`:
- Line 12: The entry "ander" in de_en.json is just a cross-reference ("See
anderer.") instead of a real definition; replace it with a resolved definition
or remove it. Locate the "ander" key in de_en.json and either (a) replace the
value "See anderer." with the actual English definition text for the German
lemma "ander" (pull the definition from the "anderer" entry or authoritative
source), or (b) if you cannot provide a definition, remove the "ander" entry (or
mark it for filtering) so consumer code that looks for real definitions won't
return non-actionable cross-references.

In `@webapp/data/definitions/et_en.json`:
- Line 653: The et_en.json contains harmful/misleading senses (e.g., "lilla"
mapped only to a derogatory slang and "tibla" left unmarked), so update the
generator in build_definitions.py to prefer neutral primary senses and mark or
remove solely-offensive glosses: modify or extend _is_unhelpful_gloss() /
_is_garbage_definition() to perform sense-ranking (prefer non-offensive tags
like neutral/color over tags marked offensive/vulgar/derogatory), add a small
word-level blocklist or whitelist fallback for known problem keys (e.g.,
"lilla", "tibla"), and ensure that if the only captured sense is tagged
offensive you either strip the entry or append a clear "(offensive)" /
"(derogatory)" label to the gloss before writing to et_en.json.
- Line 1277: Update the JSON entry for the key "seiku": change its description
from "partitive plural of tuli" to "partitive plural of seik" so the
cross-reference correctly points to seik (meaning "circumstance, incident,
episode"); locate the "seiku" entry in webapp/data/definitions/et_en.json and
replace the incorrect reference accordingly.

In `@webapp/data/definitions/eu_en.json`:
- Around line 731-733: The JSON entry for the lemma "txema" currently maps to an
inappropriate/corrupted gloss ("MILF"); locate the "txema" key in eu_en.json,
verify the original source/extraction for this lemma, and either replace it with
the correct, neutral English gloss or remove/blank the entry if no trustworthy
gloss exists; ensure the file remains valid JSON and add a short comment or flag
for manual review if the correct gloss is uncertain, and consider adding this
lemma to any content-sanitization/filter list used by the extraction pipeline.

---

Duplicate comments:
In `@webapp/data/definitions/de_en.json`:
- Line 63: Three German entries ("edele", "rauhe", "süsse") contain only an
inflection label ("strong/mixed nominative/accusative feminine singular") with
no " of <base>" reference; update the _is_unhelpful_gloss() filter to treat
glosses that are purely inflection tokens and do not contain the substring " of
" as unhelpful (e.g., return true when gloss lacks " of " and matches an
inflection-pattern regex like
/^(strong|weak|mixed|declin(e|ation)|nominative|accusative|feminine|masculine|singular|plural|etc\.)/i),
then remove or complete the three offending entries ("edele", "rauhe", "süsse")
in the JSON (either delete them or replace the value with a proper "of <base
word>" gloss).

---

Nitpick comments:
In `@scripts/capture_wiktionary_fixtures.py`:
- Around line 106-108: Remove the unused second parameter "word" from the
function definition of guess_word_type (and its duplicate at the other
occurrence) and update its docstring to reflect the single-argument signature;
then find and update all call sites that currently pass two arguments so they
call guess_word_type(extract) with only the extract argument. Ensure function
signatures and imports/mocks (if any) are updated consistently so linters no
longer report the unused parameter.

In `@tests/test_wiktionary_definitions.py`:
- Around line 138-144: The test is currently catching all Exception around the
call to fetch_native_wiktionary which masks real bugs; change the handler so
unexpected exceptions surface by either removing the broad "except Exception as
e" entirely or replace it with a narrow catch of the network/error type(s) that
fetch_native_wiktionary explicitly raises (e.g.,
requests.exceptions.RequestException or the function's specific NetworkError),
and keep the existing lang_stats/results update only for those expected network
errors; reference fetch_native_wiktionary, lang_stats["details"],
results["not_found"] to locate where to adjust the try/except.

In `@tests/test_wiktionary_parser.py`:
- Around line 198-226: The first unpacking of confidence items in the test loop
currently binds unused tuple elements (good, has_ext, details); change that
unpack to mark unused elements with underscores (e.g., unpack as lang, (level,
_, _, _) or lang, (level, *_) ) so only level is used when appending to
confident_langs / partial_langs / unreliable_langs, while leaving the second
loop (for lang, (level, good, has_ext, details) in sorted(confidence.items()))
unchanged.

In `@webapp/data/definitions/eo_en.json`:
- Around line 1-2516: The JSON contains many low-value inflection-only entries
(values like "accusative singular of X", "plural of X", "present of Y", "past of
Y", "imperative of Z") that bloat eo_en.json; remove or filter keys whose gloss
matches the regex
^(accusative|nominative|genitive|dative|plural|present|past|future|conditional|imperative)(
singular| plural)? (of|of the|passive participle of) .+$ (i.e. entries like the
values for keys such as "akron", "agojn", "akvon", "donis", "donos", "donus",
etc.), keeping only root-word definitions so the base lemmas remain and payload
is reduced.

In `@webapp/data/definitions/et_en.json`:
- Line 71: The JSON contains mixed capitalization and trailing periods in short
"form of" definition strings; update the post-processing in build_definitions.py
(where definitions are generated/written—e.g., the function that builds or
normalises definition strings) to normalise them consistently: trim any trailing
period, and lowercase the first character of glosses that appear to be "form-of"
or short grammar glosses (detectable by patterns like "of " or starting with
parts-of-speech phrases), while leaving full-sentence definitions intact; apply
this transformation before emitting values into
webapp/data/definitions/et_en.json so all form-of entries are lowercased and
punctuation-consistent.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d3ac0f7 and c4e34b5.

📒 Files selected for processing (68)
  • scripts/build_definitions.py
  • scripts/capture_wiktionary_fixtures.py
  • tests/fixtures/wiktionary/de.json
  • tests/test_wiktionary_definitions.py
  • tests/test_wiktionary_parser.py
  • tests/wiktionary_test_utils.py
  • webapp/app.py
  • webapp/data/definitions/ar_en.json
  • webapp/data/definitions/az_en.json
  • webapp/data/definitions/bg_en.json
  • webapp/data/definitions/br_en.json
  • webapp/data/definitions/ca_en.json
  • webapp/data/definitions/ckb_en.json
  • webapp/data/definitions/cs_en.json
  • webapp/data/definitions/da_en.json
  • webapp/data/definitions/de.json
  • webapp/data/definitions/de_en.json
  • webapp/data/definitions/el_en.json
  • webapp/data/definitions/en.json
  • webapp/data/definitions/eo_en.json
  • webapp/data/definitions/es_en.json
  • webapp/data/definitions/et_en.json
  • webapp/data/definitions/eu_en.json
  • webapp/data/definitions/fa_en.json
  • webapp/data/definitions/fi_en.json
  • webapp/data/definitions/fo_en.json
  • webapp/data/definitions/fr.json
  • webapp/data/definitions/fr_en.json
  • webapp/data/definitions/fur_en.json
  • webapp/data/definitions/fy_en.json
  • webapp/data/definitions/ga_en.json
  • webapp/data/definitions/gd_en.json
  • webapp/data/definitions/gl_en.json
  • webapp/data/definitions/he_en.json
  • webapp/data/definitions/hr_en.json
  • webapp/data/definitions/hu_en.json
  • webapp/data/definitions/hy_en.json
  • webapp/data/definitions/ia_en.json
  • webapp/data/definitions/is_en.json
  • webapp/data/definitions/it.json
  • webapp/data/definitions/it_en.json
  • webapp/data/definitions/ka_en.json
  • webapp/data/definitions/la_en.json
  • webapp/data/definitions/lb_en.json
  • webapp/data/definitions/lt_en.json
  • webapp/data/definitions/lv_en.json
  • webapp/data/definitions/mi_en.json
  • webapp/data/definitions/mk_en.json
  • webapp/data/definitions/mn_en.json
  • webapp/data/definitions/nb_en.json
  • webapp/data/definitions/nds_en.json
  • webapp/data/definitions/ne_en.json
  • webapp/data/definitions/nl_en.json
  • webapp/data/definitions/nn_en.json
  • webapp/data/definitions/oc_en.json
  • webapp/data/definitions/pl_en.json
  • webapp/data/definitions/pt.json
  • webapp/data/definitions/pt_en.json
  • webapp/data/definitions/ro_en.json
  • webapp/data/definitions/ru_en.json
  • webapp/data/definitions/sk_en.json
  • webapp/data/definitions/sl_en.json
  • webapp/data/definitions/sr_en.json
  • webapp/data/definitions/sv_en.json
  • webapp/data/definitions/tk_en.json
  • webapp/data/definitions/tr_en.json
  • webapp/data/definitions/uk_en.json
  • webapp/data/definitions/vi_en.json
✅ Files skipped from review due to trivial changes (2)
  • webapp/data/definitions/bg_en.json
  • webapp/data/definitions/az_en.json
🚧 Files skipped from review as they are similar to previous changes (8)
  • webapp/data/definitions/gd_en.json
  • webapp/data/definitions/ckb_en.json
  • webapp/data/definitions/fr_en.json
  • webapp/data/definitions/ga_en.json
  • webapp/data/definitions/el_en.json
  • webapp/data/definitions/fur_en.json
  • webapp/data/definitions/fy_en.json
  • webapp/data/definitions/br_en.json

"seent": "partitive singular of seen",
"seepi": "partitive singular of seep",
"seest": "out of, from the inside of (Governs the genitive)",
"seiku": "partitive plural of tuli",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd webapp/data/definitions && wc -l et_en.json

Repository: Hugo0/wordle

Length of output: 71


🏁 Script executed:

# First, let's look at the specific line and surrounding context
cat -n webapp/data/definitions/et_en.json | sed -n '1270,1285p'

Repository: Hugo0/wordle

Length of output: 870


🏁 Script executed:

# Search for all references to "tuli" in the definitions file
rg 'of tuli' webapp/data/definitions/et_en.json | head -20

Repository: Hugo0/wordle

Length of output: 210


🏁 Script executed:

# Search for entries related to "seik" to understand this word better
rg '"seik' webapp/data/definitions/et_en.json

Repository: Hugo0/wordle

Length of output: 93


🏁 Script executed:

# Check if there are other entries that mention "seiku" as a form
rg 'seiku' webapp/data/definitions/et_en.json

Repository: Hugo0/wordle

Length of output: 93


🌐 Web query:

Estonian word "seiku" morphology partitive plural

💡 Result:

seiku is the partitive plural form of the Estonian noun seik (“matter/thing; circumstance”). In the declension table: seik (nom sg), seiga (gen sg), seika (part sg), seiku (part pl). An alternative partitive plural form seikasid is also attested. [1]

Morphologically, this is the leib-type (ÕS type 22u) pattern where the partitive plural can take -u (e.g., linn → linnu, leib → leibu), matching seik → seiku. [1], [2]

Sources: [1] [2]


🌐 Web query:

Estonian word "seik" meaning definition

💡 Result:

Estonian seik (noun; gen. seiga, part. seika) means “asjaolu, asjalugu, asi” — i.e., a circumstance / fact / detail / incident (episode). [1][2]


🌐 Web query:

Estonian word "tuli" fire partitive plural forms

💡 Result:

Estonian noun tuli (“fire”):

  • Partitive plural (mitmuse osastav): tulesid [1]

Source: Wiktionary declension table for tuli [1].


Fix incorrect grammatical cross-reference for "seiku"

The current entry incorrectly identifies seiku as the partitive plural of tuli (fire). In fact, seiku is the partitive plural of seik ("circumstance, incident, episode"). The partitive plural of tuli (fire) is tulesid.

"seiku": "partitive plural of seik"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/definitions/et_en.json` at line 1277, Update the JSON entry for
the key "seiku": change its description from "partitive plural of tuli" to
"partitive plural of seik" so the cross-reference correctly points to seik
(meaning "circumstance, incident, episode"); locate the "seiku" entry in
webapp/data/definitions/et_en.json and replace the incorrect reference
accordingly.

Giscus only supports ~30 languages. Passing unsupported codes (e.g. "hy",
"fo", "ltg") causes it to silently fail. Fall back to 'en' for unsupported
languages.
Address 4 CR comments from second CodeRabbit review:

- Skip senses tagged derogatory/offensive/slur/vulgar in kaikki data.
  Estonian "lilla" now shows "violet, purple (color)" instead of a slur.
  Estonian "tibla" (ethnic slur) and Basque "txema" (vulgar) filtered out.
- Filter "See X." cross-reference entries (German "ander" etc.)
- Note: Estonian "seiku" has incorrect cross-reference in upstream kaikki
  data ("partitive plural of tuli" should be "of seik") — can't fix.
@Hugo0 Hugo0 merged commit 99ea41c into main Feb 24, 2026
4 checks passed
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Nitpick comments (5)
scripts/build_definitions.py (5)

22-28: Add explicit import urllib.error.

urllib.error.HTTPError (line 248) and urllib.error.URLError (line 253) are referenced without urllib.error being explicitly imported. Only urllib.request and urllib.parse are in the import block. While CPython's urllib.request loads urllib.error as an internal side-effect, urllib.error is its own module — "the urllib.error module defines the exception classes for exceptions raised by urllib.request" — meaning relying on the transitive load is fragile and against convention.

🔧 Proposed fix
 import urllib.parse
 import urllib.request
+import urllib.error
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/build_definitions.py` around lines 22 - 28, The code references
urllib.error.HTTPError and urllib.error.URLError but only imports urllib.request
and urllib.parse; add an explicit import for urllib.error at the top of the
import block so the HTTPError and URLError symbols are resolved from
urllib.error (update the import list alongside existing imports like
urllib.request and urllib.parse); ensure any uses of HTTPError/URLError remain
fully qualified as urllib.error.HTTPError and urllib.error.URLError.

337-360: Consolidate the re import — move it to the top-level block and use it consistently.

Line 339 does import re inside _clean_gloss(), while line 360 does import re as _re at module level (but after the function). This creates two names for the same module and buries a module-level import mid-file.

♻️ Proposed fix

Move the import to the top block (lines 22-28):

+import re
 import urllib.parse
 import urllib.request

Remove the in-function import:

 def _clean_gloss(gloss):
     """Clean wiki markup artifacts from a gloss string."""
-    import re
-
-    # Remove {{ ... }} template markup
-    gloss = re.sub(r"\{\{[^}]*\}\}", "", gloss)
+    # Remove {{ ... }} template markup
+    gloss = re.sub(r"\{\{[^}]*\}\}", "", gloss)

Remove the mid-file module-level import at line 360 and update the three regex constants to use the top-level re name directly (drop the _re alias):

-import re as _re
-
-# Bare grammatical form labels ...
-_BARE_FORM_RE = _re.compile(
+# Bare grammatical form labels ...
+_BARE_FORM_RE = re.compile(
     ...
 )
-_FORM_OF_RE = _re.compile(...)
+_FORM_OF_RE = re.compile(...)
-_CROSS_REF_RE = _re.compile(...)
+_CROSS_REF_RE = re.compile(...)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/build_definitions.py` around lines 337 - 360, The file has duplicate
and misplaced imports of the regex module: remove the in-function import in
_clean_gloss(), delete the mid-file "import re as _re", and add a single
top-level "import re" (with the other top imports block) so the module uses the
same re name everywhere; then update any regex constants or usages that
currently reference _re (the three regex constants/uses referenced after the
function and any re.sub calls inside _clean_gloss) to use the top-level re
identifier consistently.

634-658: --force and --edition are silently no-ops for process and stats.

Both flags are registered on all three subparsers but only consumed by cmd_download. Running process --force or stats --edition native silently does nothing, which is confusing.

♻️ Proposed fix
-    for name, help_text in [
-        ("download", "Download kaikki.org JSONL files"),
-        ("process", "Process downloaded data into definition files"),
-        ("stats", "Show coverage statistics"),
-    ]:
-        sp = sub.add_parser(name, help=help_text)
-        sp.add_argument("--langs", ...)
-        sp.add_argument("--force", ...)
-        sp.add_argument("--edition", ...)
+    for name, help_text in [
+        ("download", "Download kaikki.org JSONL files"),
+        ("process", "Process downloaded data into definition files"),
+        ("stats", "Show coverage statistics"),
+    ]:
+        sp = sub.add_parser(name, help=help_text)
+        sp.add_argument("--langs", type=str, default=None,
+                        help="Comma-separated list of language codes (default: all)")
+
+    # download-only options
+    dl_sp = sub.choices["download"]
+    dl_sp.add_argument("--force", action="store_true",
+                       help="Force re-download even if files exist")
+    dl_sp.add_argument("--edition", choices=["native", "english", "both"], default="both",
+                       help="Which editions to download (default: both)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/build_definitions.py` around lines 634 - 658, The --force and
--edition arguments are being added to every subparser in the loop (via
sub.add_parser and sp.add_argument) but only handled by cmd_download, causing
silent no-ops for process and stats; fix by only registering these two args on
the "download" subparser (e.g., check if name == "download" inside the loop
before calling sp.add_argument for --force and --edition) so cmd_process and
cmd_stats don't get unused flags, and ensure cmd_download continues to consume
them as before.

634-658: --force and --edition are silently no-ops for process and stats.

Both flags are registered on all three subparsers (inside the shared for loop) but main() only passes them to cmd_download. A user running process --force or stats --edition native gets no warning that the flags are ignored.

Move these two add_argument calls outside the shared loop and attach them only to the download subparser:

♻️ Proposed fix (sketch)
     for name, help_text in [
         ("download", "Download kaikki.org JSONL files"),
         ("process", "Process downloaded data into definition files"),
         ("stats", "Show coverage statistics"),
     ]:
         sp = sub.add_parser(name, help=help_text)
         sp.add_argument("--langs", type=str, default=None,
                         help="Comma-separated list of language codes (default: all)")
-        sp.add_argument("--force", action="store_true",
-                        help="Force re-download even if files exist (download only)")
-        sp.add_argument("--edition", choices=["native", "english", "both"], default="both",
-                        help="Which editions to download: native, english, or both (default: both)")
+
+    dl = sub.choices["download"]
+    dl.add_argument("--force", action="store_true",
+                    help="Force re-download even if files exist")
+    dl.add_argument("--edition", choices=["native", "english", "both"], default="both",
+                    help="Which editions to download: native, english, or both (default: both)")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/build_definitions.py` around lines 634 - 658, The --force and
--edition arguments are being added to every subparser in the for loop (sp =
sub.add_parser(...)) but only used by cmd_download, so they are silently ignored
for the process and stats commands; remove the sp.add_argument(...) calls that
register "--force" and "--edition" from the shared loop and instead add those
two arguments only to the "download" subparser (the parser returned when name ==
"download") so that process and stats no longer accept those flags and download
still receives them when cmd_download is invoked.

337-360: Consolidate the re import — move it to the top-level block and use it consistently.

_clean_gloss() (line 339) does import re locally, while a module-level import re as _re appears at line 360 — after the function definition and mid-file. This creates two names for the same module and buries a module-level import between function definitions. All three compiled regex constants already use _re; _clean_gloss is the odd one out.

♻️ Proposed fix

Move a single import re to the top-level imports block, remove the in-function import and the misplaced module-level alias, then use re (or _re if you prefer the private alias) consistently throughout:

+import re
 import urllib.parse
 import urllib.request
 def _clean_gloss(gloss):
     """Clean wiki markup artifacts from a gloss string."""
-    import re
-
     # Remove {{ ... }} template markup
     gloss = re.sub(r"\{\{[^}]*\}\}", "", gloss)
-import re as _re
-
 # Bare grammatical form labels ...
-_BARE_FORM_RE = _re.compile(
+_BARE_FORM_RE = re.compile(
     ...
 )
-_FORM_OF_RE = _re.compile(...)
+_FORM_OF_RE = re.compile(...)
-_CROSS_REF_RE = _re.compile(...)
+_CROSS_REF_RE = re.compile(...)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/build_definitions.py` around lines 337 - 360, The local import inside
_clean_gloss creates a duplicate name for the regex module and a misplaced
module-level alias; remove the in-function "import re", move a single "import
re" (or "import re as _re") into the top-level imports block, delete the
existing "import re as _re" that appears later, and update _clean_gloss to call
the module consistently (e.g., replace its re.sub calls with _re.sub if you
choose the _re alias) so the file uses one consolidated regex import and name.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@scripts/build_definitions.py`:
- Around line 229-255: The downloader leaves a truncated file at dest_path on
network/OS errors which causes future runs to skip it and process partial data;
update the exception handlers (the except blocks catching urllib.error.URLError
and OSError — and optionally the general exception path that re-raises) to
delete the partial dest_path if it exists (use os.path.exists(dest_path) and
os.remove(dest_path) inside a try/except to ignore removal errors), so
subsequent runs will re-download a fresh file; reference dest_path and
process_jsonl_gz() to validate behavior after the change.
- Around line 229-255: Partial downloads leave a truncated dest_path file that
poisons later runs; modify the exception handlers around the download loop (the
except urllib.error.HTTPError as e and except (urllib.error.URLError, OSError)
as e blocks) to remove the partial file if it exists before returning or
re-raising (use os.path.exists(dest_path) and os.remove(dest_path)); ensure the
404 branch still behaves as intended (print/return False) but also deletes any
partial dest_path, and keep process_jsonl_gz() behavior intact by ensuring no
truncated .gz remains after a failed download.
- Around line 22-28: The code uses urllib.error.HTTPError and
urllib.error.URLError but only imports urllib.request; add an explicit top-level
import for urllib.error alongside the other imports in
scripts/build_definitions.py so the references to HTTPError and URLError (used
later in the file) resolve without relying on urllib.request's internal imports;
ensure the new import appears with the existing imports (near imports like
urllib.parse and urllib.request) so references to urllib.error.HTTPError and
urllib.error.URLError remain valid.

In `@webapp/data/definitions/az_en.json`:
- Around line 1428-1430: The JSON entry with key "zənci" contains a racial slur
and must be removed or replaced with a neutral, non-derogatory gloss; locate the
"zənci" entry in webapp/data/definitions/az_en.json and either delete the entire
key/value pair or replace the value with an appropriate neutral gloss (e.g.,
"Black person" or "person of African descent"), then run any related
localization/tests and update any references to "zənci" elsewhere in the
codebase to the new key/value if renamed.

In `@webapp/data/definitions/de_en.json`:
- Around line 154-155: Remove or sanitize the harmful gloss for the "kanak"
entry (and similarly "nutte" at 219-220) in webapp/data/definitions/de_en.json
by replacing the explicit slur text with a neutral placeholder or an
allowed-content description (e.g., "[removed]" or "derogatory term — redacted");
then add or enable a pipeline filter (e.g., implement or update a
sanitizeGlosses/filterGlosses step called during extraction or import) that
checks gloss text against the abusive-words list and either strips or replaces
offending phrases to prevent them from being written back into de_en.json in
future imports.

In `@webapp/data/definitions/fur_en.json`:
- Around line 52-53: The entry for the Furlan lemma "coion" contains a
derogatory gloss and should not be surfaced as-is; locate the "coion" key in
fur_en.json and either remove the entire key-value pair or replace its gloss
with a neutral placeholder (e.g., "[derogatory]" or an empty string) and add a
sensitivity flag (e.g., "coion_sensitive": true or a "sensitive": true field in
the entry) so the UI/filtering logic can block or flag it accordingly; ensure
any change follows the existing JSON entry structure used for other words.

In `@webapp/data/definitions/gd_en.json`:
- Around line 142-143: The dataset contains explicit vulgar/derogatory glosses
(e.g., JSON keys "caile" and "galla" with vulgar senses) that must be removed or
sanitized before shipping; update gd_en.json to either drop those senses or
replace the text with a neutral placeholder (e.g., "[derogatory term removed]")
for the entries "caile" and any other flagged keys (see around entries at the
equivalent of lines 426-427), and ensure the ingestion/filtering pipeline (the
component that processes gd_en.json before serving) enforces the new rule by
detecting vulgar/derogatory glosses and omitting or sanitizing them so they
cannot surface to users.

---

Nitpick comments:
In `@scripts/build_definitions.py`:
- Around line 22-28: The code references urllib.error.HTTPError and
urllib.error.URLError but only imports urllib.request and urllib.parse; add an
explicit import for urllib.error at the top of the import block so the HTTPError
and URLError symbols are resolved from urllib.error (update the import list
alongside existing imports like urllib.request and urllib.parse); ensure any
uses of HTTPError/URLError remain fully qualified as urllib.error.HTTPError and
urllib.error.URLError.
- Around line 337-360: The file has duplicate and misplaced imports of the regex
module: remove the in-function import in _clean_gloss(), delete the mid-file
"import re as _re", and add a single top-level "import re" (with the other top
imports block) so the module uses the same re name everywhere; then update any
regex constants or usages that currently reference _re (the three regex
constants/uses referenced after the function and any re.sub calls inside
_clean_gloss) to use the top-level re identifier consistently.
- Around line 634-658: The --force and --edition arguments are being added to
every subparser in the loop (via sub.add_parser and sp.add_argument) but only
handled by cmd_download, causing silent no-ops for process and stats; fix by
only registering these two args on the "download" subparser (e.g., check if name
== "download" inside the loop before calling sp.add_argument for --force and
--edition) so cmd_process and cmd_stats don't get unused flags, and ensure
cmd_download continues to consume them as before.
- Around line 634-658: The --force and --edition arguments are being added to
every subparser in the for loop (sp = sub.add_parser(...)) but only used by
cmd_download, so they are silently ignored for the process and stats commands;
remove the sp.add_argument(...) calls that register "--force" and "--edition"
from the shared loop and instead add those two arguments only to the "download"
subparser (the parser returned when name == "download") so that process and
stats no longer accept those flags and download still receives them when
cmd_download is invoked.
- Around line 337-360: The local import inside _clean_gloss creates a duplicate
name for the regex module and a misplaced module-level alias; remove the
in-function "import re", move a single "import re" (or "import re as _re") into
the top-level imports block, delete the existing "import re as _re" that appears
later, and update _clean_gloss to call the module consistently (e.g., replace
its re.sub calls with _re.sub if you choose the _re alias) so the file uses one
consolidated regex import and name.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4e34b5 and 11de96f.

📒 Files selected for processing (60)
  • scripts/build_definitions.py
  • webapp/data/definitions/ar_en.json
  • webapp/data/definitions/az_en.json
  • webapp/data/definitions/bg_en.json
  • webapp/data/definitions/ca_en.json
  • webapp/data/definitions/cs.json
  • webapp/data/definitions/cs_en.json
  • webapp/data/definitions/da_en.json
  • webapp/data/definitions/de.json
  • webapp/data/definitions/de_en.json
  • webapp/data/definitions/el.json
  • webapp/data/definitions/el_en.json
  • webapp/data/definitions/en.json
  • webapp/data/definitions/eo_en.json
  • webapp/data/definitions/es.json
  • webapp/data/definitions/es_en.json
  • webapp/data/definitions/et_en.json
  • webapp/data/definitions/eu_en.json
  • webapp/data/definitions/fa_en.json
  • webapp/data/definitions/fi_en.json
  • webapp/data/definitions/fr.json
  • webapp/data/definitions/fr_en.json
  • webapp/data/definitions/fur_en.json
  • webapp/data/definitions/ga_en.json
  • webapp/data/definitions/gd_en.json
  • webapp/data/definitions/gl_en.json
  • webapp/data/definitions/he_en.json
  • webapp/data/definitions/hr_en.json
  • webapp/data/definitions/hu_en.json
  • webapp/data/definitions/hy_en.json
  • webapp/data/definitions/ia_en.json
  • webapp/data/definitions/is_en.json
  • webapp/data/definitions/it.json
  • webapp/data/definitions/it_en.json
  • webapp/data/definitions/ka_en.json
  • webapp/data/definitions/la_en.json
  • webapp/data/definitions/lb_en.json
  • webapp/data/definitions/lt_en.json
  • webapp/data/definitions/lv_en.json
  • webapp/data/definitions/mk_en.json
  • webapp/data/definitions/nb_en.json
  • webapp/data/definitions/nl.json
  • webapp/data/definitions/nl_en.json
  • webapp/data/definitions/nn_en.json
  • webapp/data/definitions/oc_en.json
  • webapp/data/definitions/pl.json
  • webapp/data/definitions/pl_en.json
  • webapp/data/definitions/pt.json
  • webapp/data/definitions/pt_en.json
  • webapp/data/definitions/ro_en.json
  • webapp/data/definitions/sk_en.json
  • webapp/data/definitions/sl_en.json
  • webapp/data/definitions/sr_en.json
  • webapp/data/definitions/sv_en.json
  • webapp/data/definitions/tr.json
  • webapp/data/definitions/tr_en.json
  • webapp/data/definitions/uk_en.json
  • webapp/data/definitions/vi.json
  • webapp/data/definitions/vi_en.json
  • webapp/templates/word.html
✅ Files skipped from review due to trivial changes (1)
  • webapp/data/definitions/el_en.json
🚧 Files skipped from review as they are similar to previous changes (6)
  • webapp/data/definitions/fr_en.json
  • webapp/data/definitions/ga_en.json
  • webapp/data/definitions/es_en.json
  • webapp/data/definitions/eo_en.json
  • webapp/data/definitions/et_en.json
  • webapp/data/definitions/bg_en.json

Comment on lines +22 to +28
import argparse
import gzip
import json
import os
import sys
import urllib.parse
import urllib.request
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, check the imports section (lines 22-28)
sed -n '22,28p' scripts/build_definitions.py

Repository: Hugo0/wordle

Length of output: 158


🏁 Script executed:

# Check the lines where urllib.error is supposedly used (248, 253)
sed -n '240,260p' scripts/build_definitions.py

Repository: Hugo0/wordle

Length of output: 885


🏁 Script executed:

# Search the entire file for all references to urllib.error
rg "urllib\.error" scripts/build_definitions.py -n

Repository: Hugo0/wordle

Length of output: 153


🌐 Web query:

Python urllib.error accessible after import urllib.request side effect

💡 Result:

urllib is a package. When you run:

import urllib.request

Python first imports the package urllib, then loads the submodule urllib.request. During its initialization, urllib.request itself imports urllib.error (it defines the exception types that urllib.request raises), so that submodule gets loaded too. Once a submodule is loaded, Python attaches it as an attribute on the package object, which is why this works afterward:

urllib.error

This is an import side-effect of urllib.request importing urllib.error, not a general rule that importing one submodule always makes sibling submodules available. [1][2]

Sources: Python docs on urllib.error [1], explanation of package/submodule attribute population via imports [2].


Add explicit import urllib.error.

urllib.error.HTTPError (line 248) and urllib.error.URLError (line 253) are used but not explicitly imported. While urllib.error is accessible due to urllib.request internally importing it, this relies on an implementation detail of CPython's urllib module internals. Explicit imports are clearer and safer against future changes to the standard library.

🔧 Proposed fix
 import urllib.parse
 import urllib.request
+import urllib.error
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/build_definitions.py` around lines 22 - 28, The code uses
urllib.error.HTTPError and urllib.error.URLError but only imports
urllib.request; add an explicit top-level import for urllib.error alongside the
other imports in scripts/build_definitions.py so the references to HTTPError and
URLError (used later in the file) resolve without relying on urllib.request's
internal imports; ensure the new import appears with the existing imports (near
imports like urllib.parse and urllib.request) so references to
urllib.error.HTTPError and urllib.error.URLError remain valid.

Comment on lines +229 to +255
os.makedirs(os.path.dirname(dest_path), exist_ok=True)
with open(dest_path, "wb") as out:
while True:
chunk = response.read(1024 * 256) # 256KB chunks
if not chunk:
break
out.write(chunk)
downloaded += len(chunk)
if total:
pct = downloaded * 100 // total
mb = downloaded / (1024 * 1024)
total_mb = total / (1024 * 1024)
print(
f"\r {mb:.1f}/{total_mb:.1f} MB ({pct}%)",
end="",
flush=True,
)
print(f"\r Done: {downloaded / (1024 * 1024):.1f} MB" + " " * 20)
return True
except urllib.error.HTTPError as e:
if e.code == 404:
print(" Not found (404) — skipping")
return False
raise
except (urllib.error.URLError, OSError) as e:
print(f" Error downloading: {e}")
return False
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Partial file left on failed download will silently poison future runs.

If the connection drops mid-write, dest_path is left as a truncated .gz. On the next invocation os.path.isfile(dest) and not force returns True, so the file is skipped. process_jsonl_gz() then reads only the partial content and silently produces incomplete definitions.

Remove the incomplete file in the error handler so the next run downloads it fresh:

🔧 Proposed fix
     except (urllib.error.URLError, OSError) as e:
+        if os.path.isfile(dest_path):
+            os.remove(dest_path)
         print(f"  Error downloading: {e}")
         return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/build_definitions.py` around lines 229 - 255, The downloader leaves a
truncated file at dest_path on network/OS errors which causes future runs to
skip it and process partial data; update the exception handlers (the except
blocks catching urllib.error.URLError and OSError — and optionally the general
exception path that re-raises) to delete the partial dest_path if it exists (use
os.path.exists(dest_path) and os.remove(dest_path) inside a try/except to ignore
removal errors), so subsequent runs will re-download a fresh file; reference
dest_path and process_jsonl_gz() to validate behavior after the change.

⚠️ Potential issue | 🟠 Major

Partial download file silently poisons subsequent runs.

If the connection drops during the write loop (lines 231–236), dest_path is left as a truncated .gz. On the next invocation, os.path.isfile(dest) and not force at lines 296 and 322 returns True, so the partial file is never re-fetched. process_jsonl_gz() then reads the truncated gzip and silently produces an incomplete definition set with no warning.

Delete the partial file in the error handler so subsequent runs start clean:

🔧 Proposed fix
     except (urllib.error.URLError, OSError) as e:
+        if os.path.isfile(dest_path):
+            os.remove(dest_path)
         print(f"  Error downloading: {e}")
         return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@scripts/build_definitions.py` around lines 229 - 255, Partial downloads leave
a truncated dest_path file that poisons later runs; modify the exception
handlers around the download loop (the except urllib.error.HTTPError as e and
except (urllib.error.URLError, OSError) as e blocks) to remove the partial file
if it exists before returning or re-raising (use os.path.exists(dest_path) and
os.remove(dest_path)); ensure the 404 branch still behaves as intended
(print/return False) but also deletes any partial dest_path, and keep
process_jsonl_gz() behavior intact by ensuring no truncated .gz remains after a
failed download.

Comment on lines +1428 to +1430
"zəmin": "soil, land, earth",
"zənci": "black person of African descent, Negro",
"zənən": "lady",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove slur-containing glosses (“zənci”).

This entry includes a racial slur and should be filtered or sanitized to align with the PR’s offensive/derogatory filtering intent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/definitions/az_en.json` around lines 1428 - 1430, The JSON entry
with key "zənci" contains a racial slur and must be removed or replaced with a
neutral, non-derogatory gloss; locate the "zənci" entry in
webapp/data/definitions/az_en.json and either delete the entire key/value pair
or replace the value with an appropriate neutral gloss (e.g., "Black person" or
"person of African descent"), then run any related localization/tests and update
any references to "zənci" elsewhere in the codebase to the new key/value if
renamed.

Comment on lines +154 to +155
"kanak": "slang form of Kanake (“wog”)",
"kanns": "contraction of kann + es",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove slur-containing glosses to align with the new filter.

The gloss for "kanak" contains a racial slur, and "nutte" is explicitly derogatory. These should be filtered or sanitized in the extraction pipeline to prevent harmful content from reaching users.

Also applies to: 219-220

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/definitions/de_en.json` around lines 154 - 155, Remove or
sanitize the harmful gloss for the "kanak" entry (and similarly "nutte" at
219-220) in webapp/data/definitions/de_en.json by replacing the explicit slur
text with a neutral placeholder or an allowed-content description (e.g.,
"[removed]" or "derogatory term — redacted"); then add or enable a pipeline
filter (e.g., implement or update a sanitizeGlosses/filterGlosses step called
during extraction or import) that checks gloss text against the abusive-words
list and either strips or replaces offending phrases to prevent them from being
written back into de_en.json in future imports.

Comment on lines +52 to +53
"coion": "idiot, fool, sucker",
"colaç": "ring-shaped cake",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider filtering derogatory glosses (e.g., “coion”).

If the offensive/derogatory filter is intended to catch insults, this entry should likely be removed or flagged to avoid surfacing pejorative content.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/definitions/fur_en.json` around lines 52 - 53, The entry for the
Furlan lemma "coion" contains a derogatory gloss and should not be surfaced
as-is; locate the "coion" key in fur_en.json and either remove the entire
key-value pair or replace its gloss with a neutral placeholder (e.g.,
"[derogatory]" or an empty string) and add a sensitivity flag (e.g.,
"coion_sensitive": true or a "sensitive": true field in the entry) so the
UI/filtering logic can block or flag it accordingly; ensure any change follows
the existing JSON entry structure used for other words.

Comment on lines +142 to +143
"caile": "vulgar girl, quean, hussy",
"cainb": "hemp",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Filter remaining vulgar/derogatory glosses in this dataset.

Entries like "caile" and "galla" still include vulgar/derogatory senses. If the new filter is meant to exclude this class of content, please drop or sanitize these via the filtering pipeline so they don’t surface to users.

Also applies to: 426-427

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@webapp/data/definitions/gd_en.json` around lines 142 - 143, The dataset
contains explicit vulgar/derogatory glosses (e.g., JSON keys "caile" and "galla"
with vulgar senses) that must be removed or sanitized before shipping; update
gd_en.json to either drop those senses or replace the text with a neutral
placeholder (e.g., "[derogatory term removed]") for the entries "caile" and any
other flagged keys (see around entries at the equivalent of lines 426-427), and
ensure the ingestion/filtering pipeline (the component that processes gd_en.json
before serving) enforces the new rule by detecting vulgar/derogatory glosses and
omitting or sanitizing them so they cannot surface to users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant