fix(merge-tracker): filter seniority+location stopwords and require o… by darshan3131 · Pull Request #356 · santifer/career-ops

darshan3131 · 2026-04-17T15:51:01Z

…verlap ratio in roleFuzzyMatch (#329)

What does this PR do?

Related issue

Type of change

Bug fix
New feature
Documentation / translation
Refactor (no behavior change)

Checklist

I have read CONTRIBUTING.md
I linked a related issue above (required for features and architecture changes)
My PR does not include personal data (CV, email, real names)
I ran node test-all.mjs and all tests pass
My changes respect the Data Contract (no modifications to user-layer files)
My changes align with the project roadmap

Questions? Join the Discord for faster feedback.

Summary by CodeRabbit

Improvements
- Enhanced role matching and deduplication accuracy by filtering out non-essential terms such as geographic locations, seniority levels, work arrangements, and common generic job descriptors. Role comparisons now apply stricter matching criteria requiring stronger token overlap signals to significantly reduce false positives in role similarity detection and improve overall data quality.

…verlap ratio in roleFuzzyMatch (santifer#329)

coderabbitai · 2026-04-17T15:51:16Z

📝 Walkthrough

Walkthrough

Refines role string comparison logic in merge-tracker.mjs by introducing a ROLE_STOPWORDS set and roleTokens() helper function to normalize and filter role strings, then updates roleFuzzyMatch() to base matching solely on filtered tokens with stricter criteria (minimum 2 overlapping tokens and overlap ratio of at least 0.6).

Changes

Cohort / File(s)	Summary
Role Matching Logic `merge-tracker.mjs`	Added `ROLE_STOPWORDS` set containing seniority levels, work-mode terms, generic job words, and locations. Introduced `roleTokens()` helper to normalize role strings via lowercasing, punctuation removal, word splitting, and stopword filtering. Modified `roleFuzzyMatch()` to use exact token matching with `Set` overlap calculation, enforcing minimum 2 overlapping tokens and 0.6 overlap-to-length ratio instead of prior substring/partial-word matching.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main changes: filtering seniority and location stopwords and requiring overlap ratio in roleFuzzyMatch.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@merge-tracker.mjs`:
- Around line 105-121: The roleFuzzyMatch function is too strict with its
current hard requirement overlap >= 2; modify roleFuzzyMatch (and use
roleTokens) so that the overlap threshold is conditional on the smaller token
count: compute wordsA, wordsB, minLen and ratio as before, but require overlap
>= 2 only when minLen >= 3; otherwise accept overlap >= 1 provided the ratio >=
0.6 (or when minLen === 1 require ratio === 1 to avoid weak single-token
matches). Update the return logic to reflect these branches and add unit tests
for the example pairs ("Backend Engineer" vs "Backend Developer", "Frontend
Engineer" vs "Frontend Developer", "Senior Engineer" vs "Engineer", "Software
Engineer" vs "Software Developer") to ensure deduplication behaves as expected.
- Around line 97-103: roleTokens currently drops tokens shorter than 4
characters which removes valid role signals (e.g. qa, ios, api) and causes
roleFuzzyMatch to undercount overlap; change the filter in roleTokens (function
roleTokens) to allow 2–3 character tokens (e.g. use length >= 2) or implement a
whitelist of short role tokens to retain, ensuring the stopword set remains
applied so noisy short words are still removed; update any tests or callers
relying on roleTokens (and re-evaluate roleFuzzyMatch overlap logic if needed)
to prevent duplicate entries in applications.md.
- Around line 76-95: ROLE_STOPWORDS contains a duplicate 'intern' entry; remove
the redundant string so the set literal is deduplicated (edit the ROLE_STOPWORDS
array used to construct the Set), or replace the literal with a deduplicated
source (e.g., build the array and run Array.from(new Set(...)) before passing
into the Set) to ensure each stopword appears only once; refer to the
ROLE_STOPWORDS constant in this module when making the change.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 71abb765-ac43-4835-a987-e2613f0e2e0b

📥 Commits

Reviewing files that changed from the base of the PR and between 2051beb and dc147bc.

📒 Files selected for processing (1)

merge-tracker.mjs

coderabbitai · 2026-04-17T15:53:15Z

+const ROLE_STOPWORDS = new Set([
+  // seniority / level
+  'junior', 'mid', 'middle', 'senior', 'staff', 'principal', 'lead', 'head',
+  'chief', 'associate', 'intern', 'entry', 'level',
+  // contract / mode
+  'remote', 'hybrid', 'onsite', 'contract', 'contractor', 'freelance',
+  'fulltime', 'parttime', 'permanent', 'temporary', 'intern', 'internship',
+  // generic job words
+  'role', 'position', 'opportunity', 'team', 'based',
+  // very common locations (extend in portals.yml later if needed)
+  'bangalore', 'bengaluru', 'mumbai', 'delhi', 'hyderabad', 'pune', 'chennai',
+  'london', 'berlin', 'paris', 'madrid', 'barcelona', 'amsterdam', 'dublin',
+  'york', 'francisco', 'seattle', 'boston', 'austin', 'chicago', 'toronto',
+  'tokyo', 'singapore', 'sydney', 'melbourne', 'lisbon', 'warsaw',
+  // regions / countries
+  'europe', 'emea', 'apac', 'latam', 'americas', 'india', 'spain', 'germany',
+  'france', 'italy', 'canada', 'brazil', 'mexico', 'japan',
+  // prepositions leaking through length filter
+  'with', 'from', 'into', 'over', 'this', 'that',
+]);


⚠️ Potential issue | 🟡 Minor

Duplicate 'intern' in ROLE_STOPWORDS.

'intern' appears on both line 79 (seniority block) and line 82 (contract block). Harmless in a Set but indicates the list wasn't deduplicated.

🧹 Proposed fix

// contract / mode 'remote', 'hybrid', 'onsite', 'contract', 'contractor', 'freelance', - 'fulltime', 'parttime', 'permanent', 'temporary', 'intern', 'internship', + 'fulltime', 'parttime', 'permanent', 'temporary', 'internship',

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@merge-tracker.mjs` around lines 76 - 95, ROLE_STOPWORDS contains a duplicate 'intern' entry; remove the redundant string so the set literal is deduplicated (edit the ROLE_STOPWORDS array used to construct the Set), or replace the literal with a deduplicated source (e.g., build the array and run Array.from(new Set(...)) before passing into the Set) to ensure each stopword appears only once; refer to the ROLE_STOPWORDS constant in this module when making the change.

coderabbitai · 2026-04-17T15:53:16Z

+function roleTokens(s) {
+  return s
+    .toLowerCase()
+    .replace(/[^a-z0-9\s]/g, ' ')
+    .split(/\s+/)
+    .filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));
+}


⚠️ Potential issue | 🟠 Major

length > 3 filter discards meaningful short role tokens.

Common role signals are 2–3 characters and get silently stripped: qa, ml, ai, ux, ui, ios, sre, dev, web, api, aws, gcp, sql, php, go. Combined with the new overlap >= 2 requirement in roleFuzzyMatch, roles like "QA Engineer" vs "QA Engineer" reduce to {engineer} on both sides (overlap = 1) and fail to match — producing duplicate entries in applications.md.

Consider lowering the threshold to length >= 2 (the stopword set already filters the noisy short words like mid, lead, etc.), or whitelisting a set of known short role tokens.

♻️ Proposed fix

- .filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w)); + .filter(w => w.length >= 2 && !ROLE_STOPWORDS.has(w));

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function roleTokens(s) {

return s

.toLowerCase()

.replace(/[^a-z0-9\s]/g, ' ')

.split(/\s+/)

.filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));

}

function roleTokens(s) {

return s

.toLowerCase()

.replace(/[^a-z0-9\s]/g, ' ')

.split(/\s+/)

.filter(w => w.length >= 2 && !ROLE_STOPWORDS.has(w));

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@merge-tracker.mjs` around lines 97 - 103, roleTokens currently drops tokens shorter than 4 characters which removes valid role signals (e.g. qa, ios, api) and causes roleFuzzyMatch to undercount overlap; change the filter in roleTokens (function roleTokens) to allow 2–3 character tokens (e.g. use length >= 2) or implement a whitelist of short role tokens to retain, ensuring the stopword set remains applied so noisy short words are still removed; update any tests or callers relying on roleTokens (and re-evaluate roleFuzzyMatch overlap logic if needed) to prevent duplicate entries in applications.md.

coderabbitai · 2026-04-17T15:53:16Z

 function roleFuzzyMatch(a, b) {
-  const wordsA = a.toLowerCase().split(/\s+/).filter(w => w.length > 3);
-  const wordsB = b.toLowerCase().split(/\s+/).filter(w => w.length > 3);
-  const overlap = wordsA.filter(w => wordsB.some(wb => wb.includes(w) || w.includes(wb)));
-  return overlap.length >= 2;
+  const wordsA = roleTokens(a);
+  const wordsB = roleTokens(b);
+  if (wordsA.length === 0 || wordsB.length === 0) return false;
+
+  const setB = new Set(wordsB);
+  const overlap = wordsA.filter(w => setB.has(w)).length;
+  if (overlap === 0) return false;
+
+  // Jaccard-style ratio on content tokens. Two roles are "the same" only
+  // when the overlap dominates the smaller side — not when they just share
+  // a location + "engineer".
+  const minLen = Math.min(wordsA.length, wordsB.length);
+  const ratio = overlap / minLen;
+
+  return overlap >= 2 && ratio >= 0.6;
 }


⚠️ Potential issue | 🟠 Major

overlap >= 2 is too strict for short role strings and will under-dedupe.

Many legitimate duplicates collapse to a single signal token per side after stopword filtering, so they can never reach overlap >= 2:

"Senior Software Engineer" vs "Software Engineer" → {software, engineer} both sides → overlap = 2 ✅ (ok)

"Backend Engineer" vs "Backend Developer" → {backend, engineer} vs {backend, developer} → overlap = 1 ❌

"Frontend Engineer" vs "Frontend Developer" → overlap = 1 ❌

"Senior Engineer" vs "Engineer" → {engineer} vs {engineer} → overlap = 1 ❌

"Software Engineer" vs "Software Developer" → overlap = 1 ❌

The dedup path at line 307 will now miss these and write duplicate rows into applications.md. Options:

Require overlap >= 2 only when min(|wordsA|, |wordsB|) >= 3; otherwise accept overlap >= 1 with ratio >= 0.6 (or ratio == 1 on the smaller side).

Treat engineer/developer (and similar developer|engineer|programmer) as synonyms before overlap.

♻️ Suggested adjustment

- const minLen = Math.min(wordsA.length, wordsB.length); - const ratio = overlap / minLen; - - return overlap >= 2 && ratio >= 0.6; + const minLen = Math.min(wordsA.length, wordsB.length); + const ratio = overlap / minLen; + + // For very short token sets (1–2 signal tokens per side), a single full + // overlap is meaningful; only require >=2 when both sides are richer. + const minOverlap = minLen >= 3 ? 2 : 1; + return overlap >= minOverlap && ratio >= 0.6;

Worth adding a couple of unit tests covering the pairs above to lock this behavior in.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function roleFuzzyMatch(a, b) {

const wordsA = a.toLowerCase().split(/\s+/).filter(w => w.length > 3);

const wordsB = b.toLowerCase().split(/\s+/).filter(w => w.length > 3);

const overlap = wordsA.filter(w => wordsB.some(wb => wb.includes(w) || w.includes(wb)));

return overlap.length >= 2;

const wordsA = roleTokens(a);

const wordsB = roleTokens(b);

if (wordsA.length === 0 || wordsB.length === 0) return false;

const setB = new Set(wordsB);

const overlap = wordsA.filter(w => setB.has(w)).length;

if (overlap === 0) return false;

// Jaccard-style ratio on content tokens. Two roles are "the same" only

// when the overlap dominates the smaller side — not when they just share

// a location + "engineer".

const minLen = Math.min(wordsA.length, wordsB.length);

const ratio = overlap / minLen;

return overlap >= 2 && ratio >= 0.6;

}

function roleFuzzyMatch(a, b) {

const wordsA = roleTokens(a);

const wordsB = roleTokens(b);

if (wordsA.length === 0 || wordsB.length === 0) return false;

const setB = new Set(wordsB);

const overlap = wordsA.filter(w => setB.has(w)).length;

if (overlap === 0) return false;

// Jaccard-style ratio on content tokens. Two roles are "the same" only

// when the overlap dominates the smaller side — not when they just share

// a location + "engineer".

const minLen = Math.min(wordsA.length, wordsB.length);

const ratio = overlap / minLen;

// For very short token sets (1–2 signal tokens per side), a single full

// overlap is meaningful; only require >=2 when both sides are richer.

const minOverlap = minLen >= 3 ? 2 : 1;

return overlap >= minOverlap && ratio >= 0.6;

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@merge-tracker.mjs` around lines 105 - 121, The roleFuzzyMatch function is too strict with its current hard requirement overlap >= 2; modify roleFuzzyMatch (and use roleTokens) so that the overlap threshold is conditional on the smaller token count: compute wordsA, wordsB, minLen and ratio as before, but require overlap >= 2 only when minLen >= 3; otherwise accept overlap >= 1 provided the ratio >= 0.6 (or when minLen === 1 require ratio === 1 to avoid weak single-token matches). Update the return logic to reflect these branches and add unit tests for the example pairs ("Backend Engineer" vs "Backend Developer", "Frontend Engineer" vs "Frontend Developer", "Senior Engineer" vs "Engineer", "Software Engineer" vs "Software Developer") to ensure deduplication behaves as expected.

fix(merge-tracker): filter seniority+location stopwords and require o…

dc147bc

…verlap ratio in roleFuzzyMatch (santifer#329)

github-actions Bot added the 🔧 scripts label Apr 17, 2026

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Merge branch 'main' into fix/329-rolefuzzy-stopwords

3267aba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(merge-tracker): filter seniority+location stopwords and require o…#356

fix(merge-tracker): filter seniority+location stopwords and require o…#356
darshan3131 wants to merge 2 commits intosantifer:mainfrom
darshan3131:fix/329-rolefuzzy-stopwords

darshan3131 commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 17, 2026

Uh oh!

coderabbitai Bot Apr 17, 2026

Uh oh!

coderabbitai Bot Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

darshan3131 commented Apr 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related issue

Type of change

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

darshan3131 commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading