docs(data-contract): list fr/ja/pt/ru modes alongside de (#338) by darshan3131 · Pull Request #355 · santifer/career-ops

darshan3131 · 2026-04-17T15:50:42Z

What does this PR do?

Related issue

Type of change

Bug fix
New feature
Documentation / translation
Refactor (no behavior change)

Checklist

I have read CONTRIBUTING.md
I linked a related issue above (required for features and architecture changes)
My PR does not include personal data (CV, email, real names)
I ran node test-all.mjs and all tests pass
My changes respect the Data Contract (no modifications to user-layer files)
My changes align with the project roadmap

Questions? Join the Discord for faster feedback.

Summary by CodeRabbit

Chores
- Updated system configuration documentation to include additional language-mode file classifications.

coderabbitai · 2026-04-17T15:50:56Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0cdb19f2-1c89-4b4f-8db0-cf834e56ffb5

📥 Commits

Reviewing files that changed from the base of the PR and between f2b1ffa and 99e6fd8.

📒 Files selected for processing (1)

DATA_CONTRACT.md

📝 Walkthrough

Walkthrough

The pull request adds four new wildcard entries to the System Layer file mapping in DATA_CONTRACT.md, classifying additional language mode directories (French, Japanese, Portuguese, Russian) as system files, extending an existing pattern.

Changes

Cohort / File(s)	Summary
System Layer File Mapping `DATA_CONTRACT.md`	Added four wildcard entries (`modes/fr/`, `modes/ja/`, `modes/pt/`, `modes/ru/`) to classify additional language-mode directories as system files in the System Layer, following the existing `modes/de/*` pattern.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding French, Japanese, Portuguese, and Russian mode entries to the data contract documentation alongside the existing German entry.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@DATA_CONTRACT.md`:
- Around line 48-51: The DATA_CONTRACT now marks modes/fr/*, modes/ja/*,
modes/pt/*, and modes/ru/* as auto-updatable but update-system.mjs still only
includes modes/de/ in the SYSTEM_PATHS array; update SYSTEM_PATHS in
update-system.mjs to include the new language path patterns (e.g., "modes/fr/",
"modes/ja/", "modes/pt/", "modes/ru/" or glob equivalents) so these system-layer
files participate in update and rollback flows, ensuring you reference the
SYSTEM_PATHS constant and any code that iterates over it (e.g., the
updater/rollback logic) so the new mode directories are treated as system-layer,
non-user content.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a976019f-d9c4-4edb-8deb-dc4b9e28edc1

📥 Commits

Reviewing files that changed from the base of the PR and between 2051beb and de7b557.

📒 Files selected for processing (1)

DATA_CONTRACT.md

darshan3131 · 2026-04-18T07:06:12Z

Flag from CodeRabbit is addressed by #337 in this same PR series. #337 adds modes/fr, modes/ja, modes/pt, modes/ru (plus the missing scripts) to SYSTEM_PATHS in update-system.mjs, closing the contract drift.
Recommended merge order: #337 → #338 → #339. #337 lands the fix, #338 documents the contract, #339 clarifies scanner scope. Happy to squash or rebase if preferred.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

update-system.mjs (1)

47-70: ⚠️ Potential issue | 🟠 Major

Rollback will not remove newly introduced system files.

These newly added paths can be created by apply(), but rollback() only does git checkout <backup> -- <path> and ignores failures. If a path did not exist in the backup branch, checkout fails and the new file/directory remains tracked after rollback, so rollback is incomplete.

🛠️ Proposed rollback fix

     // Checkout system files from backup branch
     for (const path of SYSTEM_PATHS) {
       try {
         git('checkout', latest, '--', path);
       } catch {
-        // File may not have existed in backup
+        // File may not have existed in backup; remove newly introduced system files.
+        try {
+          git('rm', '-r', '--ignore-unmatch', '--', path);
+        } catch {
+          // Nothing to remove
+        }
       }
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@update-system.mjs` around lines 47 - 70, The rollback() in update-system.mjs
doesn't remove files that were newly created by apply(), leaving them tracked if
the backup branch lacked those paths; update rollback() to, for each path listed
(the same list used by apply()), attempt the existing git checkout <backup> --
<path>, and if that fails or the path is absent in the backup, explicitly remove
the newly introduced file(s) from the working tree/index (e.g., via git rm
--cached/--force for files and recursive removal for directories, or by running
a targeted git clean for untracked paths) so rollback fully reverts changes
created by apply(); modify the rollback() routine to check the checkout result
and handle cleanup for missing-backup paths accordingly.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@merge-tracker.mjs`:
- Around line 97-120: The tokenization in roleTokens is ASCII-only and yields
empty arrays for non-Latin roles, causing roleFuzzyMatch to bail early; fix
roleTokens by using a Unicode-aware regexp and normalization: replace the
character class /[^a-z0-9\s]/g with /[^\p{L}\p{N}\s]/gu and call
toLocaleLowerCase() instead of toLowerCase() so tokens like "ソフトウェアエンジニア" and
"Инженер данных" are preserved; keep the existing split(/\s+/) and
ROLE_STOPWORDS check but ensure ROLE_STOPWORDS entries are normalized to the
same locale-lowercase form when populated so comparisons align.

In `@update-system.mjs`:
- Line 49: SYSTEM_PATHS currently includes 'modes/interview-prep.md' but
DATA_CONTRACT.md does not list that mode; fix by either adding
"modes/interview-prep.md" to the system-layer section of DATA_CONTRACT.md so it
matches SYSTEM_PATHS, or remove 'modes/interview-prep.md' from SYSTEM_PATHS in
update-system.mjs to keep it a user-managed file; update whichever file you
change (DATA_CONTRACT.md or SYSTEM_PATHS constant in update-system.mjs) so both
sources align and the contract no longer drifts.

---

Outside diff comments:
In `@update-system.mjs`:
- Around line 47-70: The rollback() in update-system.mjs doesn't remove files
that were newly created by apply(), leaving them tracked if the backup branch
lacked those paths; update rollback() to, for each path listed (the same list
used by apply()), attempt the existing git checkout <backup> -- <path>, and if
that fails or the path is absent in the backup, explicitly remove the newly
introduced file(s) from the working tree/index (e.g., via git rm
--cached/--force for files and recursive removal for directories, or by running
a targeted git clean for untracked paths) so rollback fully reverts changes
created by apply(); modify the rollback() routine to check the checkout result
and handle cleanup for missing-backup paths accordingly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 65adeb82-05cc-497b-8e39-73154dcbca76

📥 Commits

Reviewing files that changed from the base of the PR and between de7b557 and f2b1ffa.

📒 Files selected for processing (6)

DATA_CONTRACT.md
VERSION
merge-tracker.mjs
modes/scan.md
package.json
update-system.mjs

coderabbitai · 2026-04-18T07:14:29Z

+function roleTokens(s) {
+  return s
+    .toLowerCase()
+    .replace(/[^a-z0-9\s]/g, ' ')
+    .split(/\s+/)
+    .filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));
+}
+
 function roleFuzzyMatch(a, b) {
-  const wordsA = a.toLowerCase().split(/\s+/).filter(w => w.length > 3);
-  const wordsB = b.toLowerCase().split(/\s+/).filter(w => w.length > 3);
-  const overlap = wordsA.filter(w => wordsB.some(wb => wb.includes(w) || w.includes(wb)));
-  return overlap.length >= 2;
+  const wordsA = roleTokens(a);
+  const wordsB = roleTokens(b);
+  if (wordsA.length === 0 || wordsB.length === 0) return false;
+
+  const setB = new Set(wordsB);
+  const overlap = wordsA.filter(w => setB.has(w)).length;
+  if (overlap === 0) return false;
+
+  // Jaccard-style ratio on content tokens. Two roles are "the same" only
+  // when the overlap dominates the smaller side — not when they just share
+  // a location + "engineer".
+  const minLen = Math.min(wordsA.length, wordsB.length);
+  const ratio = overlap / minLen;
+
+  return overlap >= 2 && ratio >= 0.6;


⚠️ Potential issue | 🟠 Major

❓ Verification inconclusive

Script executed:

#!/bin/bash # Description: Demonstrate that current ASCII-only tokenization drops non-English role signal. # Expected: Japanese/Russian examples show empty or near-empty current tokens. node --input-type=module <<'NODE' const ROLE_STOPWORDS = new Set(['senior', 'remote', 'role', 'position', 'team', 'based']); function currentRoleTokens(s) { return s .toLowerCase() .replace(/[^a-z0-9\s]/g, ' ') .split(/\s+/) .filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w)); } function unicodeRoleTokens(s) { return s .normalize('NFKC') .toLocaleLowerCase() .replace(/[^\p{L}\p{N}\s]/gu, ' ') .split(/\s+/u) .filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w)); } for (const title of [ 'Développeur logiciel senior', 'Engenheiro de Dados', 'ソフトウェアエンジニア', 'Инженер данных', ]) { console.log(title); console.log(' current:', currentRoleTokens(title)); console.log(' unicode:', unicodeRoleTokens(title)); } NODE

Repository: santifer/career-ops

Repository: santifer/career-ops
Exit code: 0

stdout:

Développeur logiciel senior current: [ 'veloppeur', 'logiciel' ] unicode: [ 'développeur', 'logiciel' ] Engenheiro de Dados current: [ 'engenheiro', 'dados' ] unicode: [ 'engenheiro', 'dados' ] ソフトウェアエンジニア current: [] unicode: [ 'ソフトウェアエンジニア' ] Инженер данных current: [] unicode: [ 'инженер', 'данных' ]

Unicode-aware tokenization required for non-Latin role matching.

Current ASCII-only tokenization (/[^a-z0-9\s]/g) produces empty token arrays for Japanese and Russian role titles, breaking deduplication. Example outputs:

Japanese "ソフトウェアエンジニア" → current: [] | proposed: ['ソフトウェアエンジニア']

Russian "Инженер данных" → current: [] | proposed: ['инженер', 'данных']

Line 108 returns false when either token set is empty, so duplicate entries in language-mode trackers go undetected.

Proposed fix

function roleTokens(s) { return s - .toLowerCase() - .replace(/[^a-z0-9\s]/g, ' ') - .split(/\s+/) + .normalize('NFKC') + .toLocaleLowerCase() + .replace(/[^\p{L}\p{N}\s]/gu, ' ') + .split(/\s+/u) .filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w)); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

function roleTokens(s) {

return s

.toLowerCase()

.replace(/[^a-z0-9\s]/g, ' ')

.split(/\s+/)

.filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));

}

function roleFuzzyMatch(a, b) {

const wordsA = a.toLowerCase().split(/\s+/).filter(w => w.length > 3);

const wordsB = b.toLowerCase().split(/\s+/).filter(w => w.length > 3);

const overlap = wordsA.filter(w => wordsB.some(wb => wb.includes(w) || w.includes(wb)));

return overlap.length >= 2;

const wordsA = roleTokens(a);

const wordsB = roleTokens(b);

if (wordsA.length === 0 || wordsB.length === 0) return false;

const setB = new Set(wordsB);

const overlap = wordsA.filter(w => setB.has(w)).length;

if (overlap === 0) return false;

// Jaccard-style ratio on content tokens. Two roles are "the same" only

// when the overlap dominates the smaller side — not when they just share

// a location + "engineer".

const minLen = Math.min(wordsA.length, wordsB.length);

const ratio = overlap / minLen;

return overlap >= 2 && ratio >= 0.6;

function roleTokens(s) {

return s

.normalize('NFKC')

.toLocaleLowerCase()

.replace(/[^\p{L}\p{N}\s]/gu, ' ')

.split(/\s+/u)

.filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@merge-tracker.mjs` around lines 97 - 120, The tokenization in roleTokens is ASCII-only and yields empty arrays for non-Latin roles, causing roleFuzzyMatch to bail early; fix roleTokens by using a Unicode-aware regexp and normalization: replace the character class /[^a-z0-9\s]/g with /[^\p{L}\p{N}\s]/gu and call toLocaleLowerCase() instead of toLowerCase() so tokens like "ソフトウェアエンジニア" and "Инженер данных" are preserved; keep the existing split(/\s+/) and ROLE_STOPWORDS check but ensure ROLE_STOPWORDS entries are normalized to the same locale-lowercase form when populated so comparisons align.

coderabbitai · 2026-04-18T07:14:30Z

  'modes/training.md',
+  'modes/patterns.md',
+  'modes/followup.md',
+  'modes/interview-prep.md',


⚠️ Potential issue | 🟡 Minor

Document modes/interview-prep.md in the data contract.

SYSTEM_PATHS now auto-updates modes/interview-prep.md, but DATA_CONTRACT.md does not list that mode alongside the other system-layer mode files. Add it there too, or remove it from updater scope to avoid contract drift. As per coding guidelines, “DATA_CONTRACT.md: This file defines system vs user file boundaries.”

📄 Proposed contract alignment

| `modes/patterns.md` | Pattern analysis instructions | | `modes/followup.md` | Follow-up cadence instructions | +| `modes/interview-prep.md` | Interview preparation instructions | | `modes/de/*` | German language modes |

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@update-system.mjs` at line 49, SYSTEM_PATHS currently includes 'modes/interview-prep.md' but DATA_CONTRACT.md does not list that mode; fix by either adding "modes/interview-prep.md" to the system-layer section of DATA_CONTRACT.md so it matches SYSTEM_PATHS, or remove 'modes/interview-prep.md' from SYSTEM_PATHS in update-system.mjs to keep it a user-managed file; update whichever file you change (DATA_CONTRACT.md or SYSTEM_PATHS constant in update-system.mjs) so both sources align and the contract no longer drifts.

github-actions Bot added the 🔴 core-architecture label Apr 17, 2026

coderabbitai Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread DATA_CONTRACT.md

darshan3131 force-pushed the fix/338-data-contract-langs branch from de7b557 to f2b1ffa Compare April 18, 2026 07:10

github-actions Bot added ⚠️ agent-behavior 🔧 scripts 📦 dependencies labels Apr 18, 2026

darshan3131 force-pushed the fix/338-data-contract-langs branch from f2b1ffa to de7b557 Compare April 18, 2026 07:12

coderabbitai Bot reviewed Apr 18, 2026

View reviewed changes

docs(data-contract): list fr/ja/pt/ru modes alongside de (santifer#338)

99e6fd8

darshan3131 force-pushed the fix/338-data-contract-langs branch from de7b557 to 99e6fd8 Compare April 18, 2026 07:21

darshan3131 added 2 commits April 18, 2026 12:54

Merge branch 'main' into fix/338-data-contract-langs

6d362a0

Merge branch 'main' into fix/338-data-contract-langs

7f551f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(data-contract): list fr/ja/pt/ru modes alongside de (#338)#355

docs(data-contract): list fr/ja/pt/ru modes alongside de (#338)#355
darshan3131 wants to merge 3 commits intosantifer:mainfrom
darshan3131:fix/338-data-contract-langs

darshan3131 commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

darshan3131 commented Apr 18, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Uh oh!

coderabbitai Bot Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

darshan3131 commented Apr 17, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Related issue

Type of change

Checklist

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

darshan3131 commented Apr 18, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

darshan3131 commented Apr 17, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 17, 2026 •

edited

Loading