feat: Wiktionary definition overhaul, stats redesign, word page improvements#129
feat: Wiktionary definition overhaul, stats redesign, word page improvements#129
Conversation
Arabic (52% win rate → should improve): - Add character difficulty filter to improve_word_lists.py (3% threshold) - Remove 212 words with rare chars (آ إ ؤ ى ظ أ ذ غ), 1,838 daily words remain - Blocklist 31 proper nouns/place names (باريس, تركيا, روسيا, etc.) Hebrew (דרוזי/Druze and similar issues): - Expand blocklist by 1,470 entries: suffix variant dedup (209), words not in wordfreq (442), low-frequency words (215), proper nouns/demonyms (28) - Reduce daily pool to 1,000 words — 100% wordfreq-verified, 0 suffix groups - Was: 67% normal words, 89 suffix groups, 442 unknown → now: 100% clean New tooling: - scripts/analyze_word_quality.py: char-freq, difficult-words, hebrew-suffixes, hebrew-quality subcommands for analyzing word list quality - Regression tests for Arabic char difficulty and Hebrew suffix deduplication
Add Daily/Main/Supplement/Blocklist columns to the language table, color-coded daily word counts, curated language count, and a legend explaining the quality indicators.
Three related features:
1. Percentile in game modal: POST to /api/word-stats now returns
community stats JSON. After winning, the TODAY tab shows
"Better than X% of players" linking to the word page.
2. Tabbed /stats page: "My Stats" tab shows personal stats from
localStorage (inline JS, no Vue needed). "Languages" tab shows
server-rendered community data with simplified columns (Daily
Words + Total Words instead of Main/Supplement/Blocklist).
Homepage stats icon now links to /stats instead of opening modal.
3. Language word archive: New /{lang}/words route with paginated
gallery of all historical daily words. Each card shows word tiles,
date, definition snippet, community stats, and lazy-loaded AI art.
Includes SEO (CollectionPage schema, rel=prev/next, sitemap).
Add GitHub Discussions-backed comments (via Giscus) on word pages. Zero backend changes — loads lazily as an iframe. Dark mode syncs with the site's localStorage preference. Also update README TODOs and fix pnpm dev to kill stale processes on port 8000.
Phase 1 — Looser parser: - Add POS headers for 14 new languages (Croatian, Serbian, Greek, Hebrew, Ukrainian, Czech, Slovak, Romanian, Georgian, Catalan, Indonesian, Slovenian, Spanish form headers) - Add fallback heuristic for Wiktionaries without POS subsections (Hebrew) - Skip gender markers, inflection tables, Dutch neuter 'o' Phase 2 — Broader lemma lookup: - Add LEMMA_STRIP_RULES for 22 languages (Romance, Germanic, Slavic, Finno-Ugric suffix stripping) - Refactor candidate generation into _build_candidates() helper - Follow "form of X" definitions to the base word instead of skipping Phase 3 — LLM fallback: - Add gpt-4o-mini as last-resort definition source (45-language allowlist) - Cached permanently alongside Wiktionary results Tests: 47 offline + 61 network = 108 total, all passing.
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (26)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
- Word images after game completion now link to the word page - Remove inline percentile badge from game modal (available on word page) - Track communityTotal for future use
- Replace best-active-streak section with guess distribution bar chart - Add avg attempts and best streak to summary grid - Show per-language avg attempts in language list - Compute distribution from all game results across languages
|
@coderabbitai review Please do an in-depth review of this PR. Key areas to scrutinize:
|
loading="lazy" prevented images from loading until scrolled into view, but combined with on-demand DALL-E generation (15-20s), images would time out or get cancelled on first visit. Removing lazy loading lets all images start loading immediately — most are cached so this is fine.
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain
🏁 Script executed: Length of output: 24949 🏁 Script executed: Length of output: 20471 🏁 Script executed: Length of output: 23283 🏁 Script executed: Length of output: 45
1.
|
| Area | Severity | Action Required |
|---|---|---|
| 1. LEMMA_STRIP_RULES | Low | ✅ Optional: Add length guards |
| 2. Fallback extraction | Low | ✅ Optional: Add transliteration detection |
| 3. LLM prompt injection | Medium | |
| 4. Test coverage | Info | ✅ Optional: Add edge case tests |
| 5. stats.html XSS | Low | ✅ Optional: HTML-escape localStorage |
Critical path: Item #3 (LLM security) should be addressed before exposing any API endpoint that accepts user-provided words.
🏁 Script executed:
#!/bin/bash
# Check if there are any API endpoints that expose the wiktionary functions to user input
rg -n "fetch_llm_definition|fetch_definition_cached" --type=py -A5 -B5 -g '!test*.py' -g '!wiktionary.py'
Length of output: 3978
🏁 Script executed:
#!/bin/bash
# Find the endpoint route definitions that call fetch_definition_cached
rg -n "@app.route" --type=py -A20 webapp/app.py | grep -B20 "fetch_definition_cached"
Length of output: 948
🏁 Script executed:
#!/bin/bash
# Check what validates the word parameter before it reaches fetch_definition_cached
rg -n "def.*definition" --type=py -B5 -A15 webapp/app.py | head -80
Length of output: 3085
Excellent! I found the validation. Let me update my security assessment for item #3:
Updated: 3. 🔒 fetch_llm_definition() — Prompt Injection Risk (LOW Severity)
Critical finding: The API endpoint validates words against a whitelist before calling fetch_definition_cached:
# Lines 1286-1292 in webapp/app.py
word_lower = word.lower()
all_words = set(language_codes_5words[lang_code]) | set(
language_codes_5words_supplements.get(lang_code, [])
)
if word_lower not in all_words:
return jsonify({"error": "unknown word"}), 404
result = fetch_definition_cached(word_lower, lang_code) # Only whitelisted words reach hereRevised security assessment:
- Current risk: LOW (words come from curated wordlists, not arbitrary user input)
- Attack vector: Requires compromising the wordlist files themselves
- Best practice: Still recommend input sanitization for defense-in-depth
The validation significantly reduces the risk. However, I'd still add basic sanitization to fetch_llm_definition() as a safety net if the function is ever called from other contexts:
def fetch_llm_definition(word, lang_code):
# ... existing code ...
# Defense-in-depth: reject words with control characters or quotes
# (should never happen given upstream validation, but safety net for future use)
if any(c in word for c in ['"', "'", '\n', '\r', '\t']):
return None
prompt = (
f"Define the {lang_name} word '{word}' in one short sentence in English. "
...
)This is now optional rather than recommended, but good hygiene for future-proofing.
Final Assessment Summary
All concerns are low severity or informational. The code is production-ready with these findings:
✅ Ship-blocking issues: None
📝 Known limitations: Documented in PR objectives (proper nouns, aggressive lemma rules)
The PR demonstrates excellent engineering with comprehensive tests, layered fallbacks, and thoughtful error handling. The security posture is solid given the whitelist validation. Great work! 🎉
On Android PWA, Chrome's "Translate this page?" bar steals ~40px from the viewport. With overflow-hidden and h-[100dvh], the bottom keyboard row gets clipped. Fixes: - Add translate="no" and <meta name="google" content="notranslate"> to suppress Chrome's translate prompt (game is already localized) - Switch from 100dvh to 100svh (small viewport height) which accounts for browser chrome, translate bars, and gesture navigation bars
- Add preload_app=True to gunicorn dev config so data loads before the port opens, preventing hanging requests during worker boot - Remove OS dark mode fallback in catch block of dark_mode_init to match Vue default (light), preventing flash of dark then light in incognito/private browsing
- Add missing og:type and Twitter Card meta tags to stats.html - Fix words_hub.html JSON-LD numberOfItems to match page item count - Add paginated /words hub URLs to sitemap (3600+ new entries) - Use 4 gunicorn workers in dev config
Summary
Major improvements across three areas: definition coverage, stats page, and word page UX.
1. Wiktionary Definition Overhaul (
webapp/wiktionary.py)Definition coverage was ~26% native, ~16% English fallback, ~56% nothing. Three-layer improvement:
Looser parser — Add POS headers for 14 new languages (Croatian, Serbian, Greek, Hebrew, Ukrainian, Czech, Slovak, Romanian, Georgian, Catalan, Indonesian, Slovenian, Spanish form headers). Add fallback heuristic for Wiktionaries without POS subsections (fixes Hebrew). Skip gender markers, inflection tables, Dutch neuter.
Broader lemma lookup —
LEMMA_STRIP_RULESfor 22 languages strips inflected forms (e.g. Spanish "galas" → "gala", Italian "cani" → "cane", French "beaux" → "beau")._follow_form_of()resolves "plural of X" definitions to the actual meaning instead of skipping them.LLM fallback — gpt-4o-mini as last resort when both Wiktionary sources fail. 45-language allowlist, cached permanently, returns
"source": "ai"for UI labeling. Known limitation: doesn't reject proper nouns/demonyms.2. Stats Page Redesign (
webapp/templates/stats.html)3. Word Page & Game Improvements
4. Tests
Known limitations
Test plan