Skip to content

fix: use word-boundary regex for geo-tagging keyword matching#330

Open
princelevant wants to merge 1184 commits intokoala73:mainfrom
princelevant:fix/geo-tagging-substring-matching
Open

fix: use word-boundary regex for geo-tagging keyword matching#330
princelevant wants to merge 1184 commits intokoala73:mainfrom
princelevant:fix/geo-tagging-substring-matching

Conversation

@princelevant
Copy link

Summary

  • Replaced String.includes() with word-boundary regex (\b...\b) across the entire geo-tagging pipeline to prevent substring false positives
  • Replaced the ambiguous "hts" keyword (matched "rights", "fights", etc.) with "tahrir al-sham" / "hayat tahrir"
  • Added 20 regression tests covering false positive prevention, true positive preservation, and edge cases

Problem

When zooming into Syria on the map, unrelated articles (e.g. French politics mentioning "ambassador") appeared at Syria's coordinates. The keyword "assad" matched as a substring inside "ambassador", and "hts" matched inside "rights", "fights", "flights", etc.

Root cause: keywords >= 5 characters used titleLower.includes(keyword) instead of word-boundary regex.

Files changed

File Change
src/services/geo-hub-index.ts Word-boundary regex for all keyword lengths
src/components/DeckGLMap.ts Hotspot keyword matching uses \b regex
src/components/Map.ts Same fix for mobile map
src/App.ts Flash location matching uses \b regex
src/services/entity-index.ts Entity keyword matching uses \b regex
src/services/country-instability.ts Country keyword matching uses \b regex
src/services/story-data.ts Country keyword matching uses \b regex
src/services/related-assets.ts Asset keyword matching uses \b regex
src/utils/analysis-constants.ts includesKeyword() utility uses \b regex
src/config/geo.ts Replaced "hts" with "tahrir al-sham" / "hayat tahrir"
tests/geo-keyword-matching.test.mjs 20 new test cases

Test plan

  • vite build passes clean
  • All 111 existing tests pass (0 regressions)
  • 20 new tests verify: "ambassador" no longer matches Syria, "rights" no longer matches Damascus, genuine Syria/HTS articles still match correctly

Fixes #324

-KT

🤖 Generated with Claude Code

koala73 and others added 30 commits February 18, 2026 07:43
t() always returns a string (key itself if missing), so || 'English'
fallbacks were unreachable dead code.
t() always returns a string, so || 'English' fallbacks were
unreachable. Removed all 15 instances.
Main variant: NHK World + Nikkei Asia in asia category.
Finance variant: Nikkei Asia in markets category.
Added asia.nikkei.com to RSS proxy allowlist.
Main variant: NHK World + Nikkei Asia in asia category.
Finance variant: Nikkei Asia in markets category.
Added asia.nikkei.com to RSS proxy allowlist.
…keys

- CommunityWidget: add DOM check to prevent duplicate widgets on repeated loadNews() calls
- RuntimeConfigPanel: compare t() result against key path to suppress missing help translations
…glish + Linux AppImage support (koala73#100)

## Summary
- Full i18n system with 14 locales: en, fr, de, es, it, pl, pt, nl, sv,
ru, ar (RTL), zh, ja — all at 1132-key parity
- Eliminated ~110 hardcoded English strings across 50+ source files,
replaced with `t()` calls
- RTL support for Arabic with proper regional code normalization (ar-SA
→ ar)
- Dead English fallback literals (`t() || 'English'`) removed from all
components
- Community discussion floating widget (localized)
- Linux AppImage desktop build support
- Proper noun heuristic fallback for trending keywords when ML
unavailable

## Key changes
- **New**: `src/services/i18n.ts` — i18next setup with language
detection, RTL, locale switching
- **New**: 13 locale JSON files (1132 keys each) in `src/locales/`
- **New**: `src/styles/rtl-overrides.css` +
`src/styles/lang-switcher.css`
- **Modified**: 50+ components/services to use `t()` instead of
hardcoded strings
- **Modified**: `.github/workflows/build-desktop.yml` — Linux CI matrix
- **Modified**: `scripts/desktop-package.mjs` + `download-node.sh` —
Linux target support

## Test plan
- [ ] Verify language switcher shows all 14 languages
- [ ] Switch to Arabic — confirm `dir="rtl"` on `<html>`, layout mirrors
- [ ] Switch to Japanese — confirm all panel labels, tooltips, popups
render in Japanese
- [ ] Switch to French — confirm no English leaks in panels, modals, map
legend
- [ ] Verify `{{count}}` interpolation works in timeAgo strings
- [ ] Verify `tsc --noEmit` passes (confirmed locally)
- [ ] Test community widget dismiss/localStorage persistence
PR koala73#97 only hid the badge itself but the SignalModal kept auto-opening
on new signals. Gate all 5 automatic signalModal.show() calls behind
findingsBadge.isEnabled() so disabling Intelligence Findings also
suppresses the full-screen popup overlay.

Closes koala73#89

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add lasillavacia.com RSS feed to improve Latin American political
coverage. Independent Colombian investigative outlet covering governance,
armed conflict, and regional power dynamics.

Ref koala73#96

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Adds [La Silla Vacía](https://www.lasillavacia.com) RSS feed (`/rss`)
to the `latam` feed category
- Adds source tier entry (Tier 3 — specialty/investigative)
- Colombian independent outlet covering political power structures,
governance, and armed conflict

Ref koala73#96

## Test plan
- [ ] Verify feed loads in LATAM news panel (content is in Spanish)
- [ ] Confirm no duplicate or broken entries in feed list

🤖 Generated with [Claude Code](https://claude.com/claude-code)
## Summary
- PR koala73#97 hid the badge but the `SignalModal` kept auto-opening on new
signals — this is what the reporter was still seeing
- Gates all 5 automatic `this.signalModal?.show()` calls behind
`this.findingsBadge?.isEnabled()` so disabling Intelligence Findings
also suppresses the full-screen popup overlay and sounds
- Signal history is still recorded (`addToSignalHistory`) even when
popup is suppressed, so re-enabling the toggle shows them

Closes koala73#89

## Test plan
- [x] Disable Intelligence Findings via PANELS toggle or right-click
- [x] Wait for signal refresh cycle — no full-screen popup should appear
- [x] Re-enable → popups resume on next signal detection
- [x] Build succeeds with no type errors

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Initializes @sentry/browser early in main.ts with environment
detection (production/preview/development). Disabled on localhost
and Tauri desktop. Traces sampled at 10%.
Resolve instead of reject when the script fails to load (ad blocker,
network issue). Guard initializePlayer against missing YT.Player.
Prevents noisy unhandled rejection errors in Sentry.
…timeout, WebGL context loss, RSS 403s

- storage.ts: add withTransaction() retry wrapper for IndexedDB InvalidStateError on iOS/Safari tab backgrounding
- usa-spending.ts: add 20s AbortController timeout to prevent Safari "Load failed" on stalled POST
- App.ts: add catch to runGuarded() to prevent unhandled rejections from task runner
- main.ts: add Sentry ignoreErrors for WebGL context loss and ResizeObserver loop
- DeckGLMap.ts: add webglcontextlost/restored handlers for graceful GPU recovery
- feeds.ts: route rsshub.app feeds (NHK, MIIT, MOFCOM) through Railway proxy, switch Nikkei Asia and ECFR to Google News proxy
- finance.ts: switch Nikkei Asia to Google News proxy, remove unused railwayRss helper
… extensions)

- Add NotAllowedError, InvalidAccessError, importScripts to Sentry ignoreErrors
- Add global unhandledrejection handler for YouTube IFrame API autoplay blocks
- Add onError handler to deck.gl MapboxOverlay for internal render-cycle races
- withTransaction now returns undefined instead of throwing when
  InvalidStateError persists after retry (transient browser event)
- Add .catch() to fire-and-forget cleanOldSnapshots() call
- Add beforeSend filter to drop minified 1-3 char library errors (e.g., "vd")
- Filter transient network errors (Load failed, Failed to fetch, cancelled)
- Filter browser extension errors (runtime.sendMessage, Java object is gone)
- Filter non-Error promise rejections and SVG image load failures
- Filter MapLibre imageManager null ref during WebGL context restore
- Reset YouTube API promise on load failure to allow retry on next init
- Move USASpending timeout cleanup to finally block
- Log snapshot cleanup errors instead of silently swallowing
…variants

Browser extensions intercept window.fetch causing "Failed to fetch
(gamma-api.polymarket.com)" to leak as unhandled rejection. Remove
the $ anchor so the pattern matches any suffix.
… noise filters

Prevent getProjection null crash when WebGL context is lost by tracking
webglLost flag and skipping all setProps/layer rebuild calls until restored.
Add ignoreErrors for IndexedDB iOS kills, Twitter WebView injection, and
CSP unsafe-eval from extensions.
…List guards

- toggleFullscreen: use void .catch() for Promise-based requestFullscreen/
  exitFullscreen + webkit prefix fallback for iOS Safari (WORLDMONITOR-11/13)
- Narrow /^TypeError: Failed to fetch/ to exact match (was suppressing real
  API failures). Move module-import-failed to beforeSend with extension/
  webview context check instead of blanket ignore (WORLDMONITOR-15)
- Guard classList?.contains and target.closest?. on event targets that may
  not be Elements (WORLDMONITOR-Z/10)
- Add noise filters: Fullscreen request denied, requestFullscreen,
  vc_text_indicators_context (WORLDMONITOR-12)
…er, IndexedDB write-drop

- webkitRequestFullscreen returns void (not Promise) on Safari — use
  try/catch instead of .catch() to avoid undefined.catch() throw
- Module-import beforeSend filter: only suppress when stack frames
  originate from browser extensions, not by URL domain check
- withTransaction: throw on readwrite InvalidStateError after retry
  instead of silently returning undefined (prevents write-drop)
…ections

- Wrap updateBaseline() in try/catch inside loadNewsCategory and intel
  path so IndexedDB write failures don't delete successfully fetched
  and rendered news data (P1)
- Add .catch() to saveCurrentSnapshot() initial call and setInterval
  callback to prevent unhandled promise rejections from IndexedDB
  readwrite failures (P2)
… WebGL link errors

- LiveNewsPanel: player.mute/unMute may not exist before onReady (WORLDMONITOR-16)
- main.ts: add /Program failed to link/ noise filter (WORLDMONITOR-18)
koala73 and others added 22 commits February 24, 2026 05:36
…tion probes (koala73#296)

Sidecar validation probes were missing User-Agent headers, causing
Cloudflare-fronted APIs (e.g. Wingbits) to return 403 which was
incorrectly treated as an auth rejection. Added CHROME_UA to all 13
probes and isCloudflare403() helper to soft-pass CDN blocks.
)

Tauri WKWebView/WebView2 traps target="_blank" navigation, so news
links and other external URLs silently fail to open. Added a global
capture-phase click interceptor that routes cross-origin links through
the existing open_url Tauri command, falling back to window.open.
…ries (koala73#299)

Models like DeepSeek-R1 and QwQ output chain-of-thought as plain text
even with think:false. This caused summaries like "We need to summarize
the top story..." instead of actual news content.

- Remove message.reasoning fallback that used thinking tokens as summary
- Extend tag stripping to <|thinking|>, <reasoning>, <reflection> formats
- Add hasReasoningPreamble() to reject task narration and prompt echoes
- Gate reasoning detection to brief/analysis modes (translate unaffected)
- Bump CACHE_VERSION v3→v4 to invalidate polluted cached summaries
- Add 28 unit tests covering all edge cases
…oala73#285)

* fix: sync YouTube live panel mute state with native player controls

* fix: harden YouTube embed mute sync (postMessage origin, interval cleanup, DRY destroy)

---------

Co-authored-by: Elie Habib <elie.habib@gmail.com>
* test: add Playwright e2e tests for flushStaleRefreshes

4 tests covering: stale services flushed on tab focus (hidden > interval),
no-op when hiddenSince is 0, skips non-stale services (hidden < interval),
and 150ms stagger between re-triggered services.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: convert flushStaleRefreshes to fast unit test, fix timeout leaks and timing flakiness

- Move from Playwright e2e to Node.js unit test (tests/ dir)
- Add source contract tests to detect if App.ts method signature drifts
- Clean up all timeouts in afterEach to prevent leaks
- Assert ordering + minimum gaps instead of absolute time windows (CI-safe)
- Add assertions for refreshTimeoutIds state after flush
- Add test for non-stale service timeout preservation

* test: make flush stale refresh tests deterministic

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Elie Habib <elie.habib@gmail.com>
…koala73#302)

* fix: harden desktop embed messaging and secret validation

* fix: harden embed postMessage origin check and add custom channel validation

Security:
- Block wildcard parentOrigin from query params (server-side sanitizer)
- Validate e.origin on incoming postMessage commands in embed
- Remove misleading asset: protocol from allowed list
- Require 2+ markers for Cloudflare challenge detection (drop overly broad 'cloudflare' marker)
- Add ordering comment on isAuthFailure vs isCloudflareChallenge403
- Strengthen embed test assertions with regex + wildcard rejection test

Channel validation:
- Validate YouTube handle format (@<3-30 chars>) before adding
- Verify channel exists on YouTube via /api/youtube/live before adding
- Show "Verifying…" loading state, red border on invalid, offline tolerance
- Return channelExists flag from /api/youtube/live endpoint
* Simplify RSS freshness update to static import

* Refine vendor chunking for map stack in Vite build

* Patch transitive XML parser vulnerability via npm override

* Shim Node child_process for browser bundle warnings

* Filter known onnxruntime eval warning in Vite build

* test: add loaders XML/WMS parser regression coverage

* chore: align fast-xml-parser override with merged dependency set

---------

Co-authored-by: Elie Habib <elie.habib@gmail.com>
…73#306)

- Add levels, trends, fallback keys to top-level countryBrief in en/el/th/vi
  locales (fixes raw key display in intelligence brief and header badge)
- Add Export PDF option to country brief dropdown using scoped print dialog
- Add exportPdf i18n key to all 17 locale files
…73#308)

- Add levels, trends, fallback keys to top-level countryBrief in en/el/th/vi
  locales (fixes raw key display in intelligence brief and header badge)
- Add Export PDF option to country brief dropdown using scoped print dialog
- Add exportPdf i18n key to all 17 locale files
…lity (koala73#313)

WKWebView (Tauri macOS) doesn't support HTML5 Drag and Drop API.
Replace draggable/dragstart/dragover with mousedown/mousemove/mouseup
across panel grid reorder, live channel tabs, and channel settings.
Uses elementFromPoint with same-row detection for accurate horizontal
and vertical drag positioning.
…ala73#315)

- Add panelDragCleanupHandlers to remove document listeners on destroy
- Suppress channel click/edit after drag-end to prevent accidental actions
…oala73#316)

Adds ignoreErrors patterns for Worker constructor, Facebook in-app
browser, UC Browser, duplicate custom elements, WebGPU device limits,
and stale container. Extends beforeSend to suppress TypeErrors from
deck-stack chunk (same pattern as maplibre map chunk).
)

* feat: add AI analysis settings popup to Insights panel (web-only)

Add a gear icon to the AI Insights panel header that opens a settings
popup giving web users explicit control over the AI analysis pipeline.
Users can now toggle cloud AI (Groq/OpenRouter) and browser local model
independently, with a static CTA for Ollama desktop support.

- New ai-flow-settings.ts state layer with localStorage persistence
- SummarizeOptions param added to generateSummary() (backward-compatible)
- InsightsPanel: gear icon, disabled state, generation token for races
- AiFlowPopup: toggles, 250MB warning, status footer, Ollama CTA
- Remove mlWorker.isAvailable gate in App.ts for cloud-only mode
- CSS: popup, toggles, status indicators, disabled state
- i18n: 16 new keys across all 17 locale files with translations

https://claude.ai/code/session_01AgLDUybKNri83vgZQNC3HF

* fix: reset brief cache on settings change, remove dead code in popup

- Reset cachedBrief and lastBriefUpdate in onAiFlowChanged() so new
  provider settings take effect immediately instead of being blocked
  by the 2-minute cooldown with a stale (possibly null) cached brief
- Remove unused isAnyAiProviderEnabled() import and dead `void any`
  in AiFlowPopup.updateStatus()

https://claude.ai/code/session_01AgLDUybKNri83vgZQNC3HF

* fix: invalidate insights brief cache on AI flow changes

---------

Co-authored-by: Claude <noreply@anthropic.com>
…ala73#317)

Adds islandtimes.org/feed/ to the asia region feeds and allowlists the
domain in the RSS proxy.
…re source regions (koala73#319)

Replace 4 scattered settings UIs (gear popup, panels modal, sources modal,
language dropdown) with a single 3-tab modal (General/Panels/Sources).

Sources tab features region pills that dynamically adapt per variant:
- Full: Worldwide, US, Europe, Middle East, Africa, Latin America, Asia-Pacific, Topical, Intelligence
- Tech: Tech News, AI & ML, Startups & VC, Regional Ecosystems, Developer, Cybersecurity, Policy & Research, Media & Podcasts
- Finance: Markets & Analysis, Fixed Income & FX, Commodities, Crypto & Digital, Central Banks & Economy, Deals & Corporate, Financial Regulation, Gulf & MENA

Also reclassifies full-variant feeds: splits monolithic politics into
politics (worldwide), us, and europe; redistributes misplaced sources.

Additional fixes:
- Variant switcher works on localhost via localStorage (no multiple dev servers)
- mapNewsFlash toggle no longer triggers expensive AI re-analysis
- Remove dead intel-findings toggle from desktop settings window
- LiveNewsPanel uses shared SITE_VARIANT (respects localStorage override)
…3#324)

Keyword matching across the geo-tagging pipeline used String.includes()
(substring matching), causing false positives like "assad" matching
inside "ambassador" and tagging unrelated articles to Syria. Replaced
all instances with word-boundary regex (\b...\b) for accurate matching.

Also replaced the ambiguous 3-char "hts" keyword (matched "rights",
"fights", etc.) with unambiguous "tahrir al-sham" / "hayat tahrir".

Fixes koala73#324

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vercel
Copy link

vercel bot commented Feb 24, 2026

@princelevant is attempting to deploy a commit to the Elie Team on Vercel.

A member of the Team first needs to authorize it.

@koala73
Copy link
Owner

koala73 commented Feb 24, 2026

Lovely
Was on my todo
Thank you
Will review

@koala73
Copy link
Owner

koala73 commented Feb 24, 2026

Plan vs Implementation Review

Thanks for tackling #324! The core goal (fixing substring false positives) is right, but the implementation diverges from the approved plan in ways that introduce new issues. Here's a detailed comparison.

Approach Mismatch

The approved plan uses tokenization-based exact word matching (Set.has()), not \b word-boundary regex. Tokenization was chosen because \b still has edge cases with common English words used as keywords (e.g., 'oil', 'fed', 'house'), while tokenization eliminates ALL substring false positives by design:

"ambassador" → tokens: {"ambassador"} → has("assad")? NO ✓
"Assad regime" → tokens: {"assad","regime"} → has("assad")? YES ✓

Issues

# Severity Issue Detail
1 Critical Wrong approach Uses \b regex, not tokenization. \b still has edge cases with common words like 'oil', 'fed', 'house' (still in the DC keyword list at geo.ts:84)
2 Critical Removes 'hts' keyword Replaced with 'tahrir al-sham'/'hayat tahrir' only. Headlines saying just "HTS" (very common: "HTS forces advance") no longer match Damascus. With tokenization, keeping 'hts' is safe since tokens.has('hts')"rights"
3 Critical No regex cache — performance regression new RegExp() created on EVERY call in the hot loop. DeckGLMap: 100 news × 33 hotspots × 8 keywords = 26,400 RegExp allocations per render cycle (runs every few minutes)
4 High Changed shared includesKeyword() Modified analysis-constants.ts:188 which affects analysis-core.ts:313,347 (correlation/signal generation, not geo-tagging). Plan explicitly creates a separate src/utils/keyword-match.ts to avoid regression in non-geo paths
5 High No centralized utility The escape+regex pattern is copy-pasted 10+ times across 8 files. If matching logic changes, every site needs updating again
6 Medium Missing files tech-hub-index.ts:221 and server-side get-risk-scores.ts not updated — false positives persist there
7 Medium 'us ' and 'house' not fixed DC hotspot (geo.ts:84) still has 'us ' (trailing space hack for .includes()) and standalone 'house'. With \b regex, \bus \b may behave unexpectedly
8 Low Tests lack integration coverage All 20 tests are unit tests on the regex function. No tests against actual inferGeoHubsFromTitle(), normalizeCountryName(), or hotspot matching

What the PR Gets Right

  • Correct file coverage for core geo-tagging paths (8 of 10 files)
  • App.ts:findFlashLocation() included
  • Solid test cases for the "ambassador"/"assad" false positive
  • escapeRegex() used consistently for safety
  • Conflict-topic .includes() in DeckGLMap correctly converted

Recommended Changes

Per the approved plan (/plans/dapper-tinkering-engelbart.md):

  1. Create src/utils/keyword-match.ts with tokenizeForMatch() + matchKeyword() — single source of truth, tokenize once per title then O(1) Set lookups
  2. Keep 'hts' in Damascus keywords — tokenization makes it safe (no "rights" false positive)
  3. Tokenize once per title in hot loops, reuse across all hotspot keyword checks (faster than 26K regex allocations)
  4. Don't touch analysis-constants.ts — isolate geo-matching to avoid blast radius in analysis-core
  5. Add tech-hub-index.ts and get-risk-scores.ts to scope
  6. Remove 'us ' and 'house' from DC hotspot keywords
  7. Add integration tests for inferGeoHubsFromTitle() and normalizeCountryName()

The plan file has the full tokenizeForMatch() and matchKeyword() implementation with contiguous phrase matching for multi-word keywords.

…73#324)

Replace word-boundary regex with tokenization + Set lookups per approved plan:
- Create src/utils/keyword-match.ts as single source of truth
- Tokenize titles once, O(1) Set.has() per keyword (no RegExp allocations)
- Restore 'hts' keyword for Damascus (safe with tokenization)
- Revert shared includesKeyword() in analysis-constants.ts
- Remove 'us ' trailing-space hack and bare 'house' from DC keywords
- Add tech-hub-index.ts to scope (was missing)
- Add integration tests for inferGeoHubsFromTitle flow

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@princelevant
Copy link
Author

Hey @koala73 — this is a great initiative and I'm happy to contribute early on. The impact is huge. Thank you for the quick and prompt responses!

Here's the fix based on your feedback:

Changes in this revision:

  • Tokenization over regex — replaced all \b regex matching with tokenizeForMatch() + Set.has() lookups per the approved plan. Titles are tokenized once, then O(1) keyword checks — zero RegExp allocations in hot loops
  • Centralized utility — new src/utils/keyword-match.ts as single source of truth, all 10 files import from it
  • Restored 'hts' in Damascus keywords — tokenization makes it safe (no more "rights"/"flights" false positives)
  • Reverted analysis-constants.tsincludesKeyword() back to original, geo-matching is now fully isolated
  • Added tech-hub-index.ts to scope (was missing)
  • Removed 'us ' and 'house' from DC hotspot keywords
  • Integration tests added for the full inferGeoHubsFromTitle flow (41 tests, all passing)

Let me know if anything else needs adjusting. Yalla! 🚀

— KT

@koala73
Copy link
Owner

koala73 commented Feb 24, 2026

Hey @princelevant — great improvement switching to tokenization! The architecture is now aligned with the approved plan: keyword-match.ts as single source of truth, Set.has() for O(1) lookups, contiguous phrase matching for multi-word keywords. Nice work.

One critical issue remaining before we can merge:


🔴 CRITICAL: Possessive forms produce false negatives

The tokenizer splits on /[^a-z0-9'-]+/ — preserving apostrophes within words. This means possessive headlines (extremely common in news) miss genuine matches:

"Assad's forces advance in Idlib"  → token: "assad's" → has("assad") = FALSE ❌
"Iran's nuclear program expands"   → token: "iran's"  → has("iran")  = FALSE ❌
"Putin's war enters new phase"     → token: "putin's" → has("putin") = FALSE ❌
"Trump's tariff plan draws criticism" → token: "trump's" → has("trump") = FALSE ❌

The approved plan specified compound + sub-part decomposition to handle this. After adding each cleaned token, split on /[^a-z0-9]+/ and add the sub-parts:

export function tokenizeForMatch(title: string): TokenizedTitle {
  const lower = title.toLowerCase();
  const words = new Set<string>();
  const ordered: string[] = [];
  for (const raw of lower.split(/\s+/)) {
    const cleaned = raw.replace(/^[^a-z0-9]+|[^a-z0-9]+$/g, '');
    if (!cleaned) continue;
    words.add(cleaned);           // "assad's" as compound
    ordered.push(cleaned);
    for (const part of cleaned.split(/[^a-z0-9]+/)) {
      if (part) words.add(part);  // "assad", "s" as sub-parts
    }
  }
  return { words, ordered };
}

This gives tokens.has("assad") === true even when the headline says "Assad's". Same fix covers hyphenated forms like "al-Shabaab" → sub-parts include "shabaab".

Please also add test cases for possessives — that's how this slipped through:

it('"assad" matches "Assad\'s forces advance"', () => {
  assert.equal(matchesAnyKeyword("Assad's forces advance in Idlib", ['assad']), true);
});

🟡 Minor items

  1. entity-index.ts still uses new RegExp(\b...\b) — acceptable since it needs match position, but worth a comment explaining the deviation.
  2. PR title & description still reference "word-boundary regex" from commit 1. Update to reflect the tokenization approach.
  3. server/.../get-risk-scores.ts was in the plan (needs inline copy of tokenization) — can be a follow-up PR if you prefer.

Everything else looks solid — the keyword data fixes in geo.ts, the analysis-constants.ts isolation, tokenize-once-reuse-across-hotspots pattern, and the test coverage. Just need that possessive fix and we're good to go. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Geo-tagging uses substring matching, causing articles to be placed in wrong map regions