Skip to content

docs(data-contract): list fr/ja/pt/ru modes alongside de (#338)#355

Open
darshan3131 wants to merge 3 commits intosantifer:mainfrom
darshan3131:fix/338-data-contract-langs
Open

docs(data-contract): list fr/ja/pt/ru modes alongside de (#338)#355
darshan3131 wants to merge 3 commits intosantifer:mainfrom
darshan3131:fix/338-data-contract-langs

Conversation

@darshan3131
Copy link
Copy Markdown
Contributor

@darshan3131 darshan3131 commented Apr 17, 2026

What does this PR do?

Related issue

Type of change

  • Bug fix
  • New feature
  • Documentation / translation
  • Refactor (no behavior change)

Checklist

  • I have read CONTRIBUTING.md
  • I linked a related issue above (required for features and architecture changes)
  • My PR does not include personal data (CV, email, real names)
  • I ran node test-all.mjs and all tests pass
  • My changes respect the Data Contract (no modifications to user-layer files)
  • My changes align with the project roadmap

Questions? Join the Discord for faster feedback.

Summary by CodeRabbit

  • Chores
    • Updated system configuration documentation to include additional language-mode file classifications.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 17, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 0cdb19f2-1c89-4b4f-8db0-cf834e56ffb5

📥 Commits

Reviewing files that changed from the base of the PR and between f2b1ffa and 99e6fd8.

📒 Files selected for processing (1)
  • DATA_CONTRACT.md

📝 Walkthrough

Walkthrough

The pull request adds four new wildcard entries to the System Layer file mapping in DATA_CONTRACT.md, classifying additional language mode directories (French, Japanese, Portuguese, Russian) as system files, extending an existing pattern.

Changes

Cohort / File(s) Summary
System Layer File Mapping
DATA_CONTRACT.md
Added four wildcard entries (modes/fr/*, modes/ja/*, modes/pt/*, modes/ru/*) to classify additional language-mode directories as system files in the System Layer, following the existing modes/de/* pattern.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding French, Japanese, Portuguese, and Russian mode entries to the data contract documentation alongside the existing German entry.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@DATA_CONTRACT.md`:
- Around line 48-51: The DATA_CONTRACT now marks modes/fr/*, modes/ja/*,
modes/pt/*, and modes/ru/* as auto-updatable but update-system.mjs still only
includes modes/de/ in the SYSTEM_PATHS array; update SYSTEM_PATHS in
update-system.mjs to include the new language path patterns (e.g., "modes/fr/",
"modes/ja/", "modes/pt/", "modes/ru/" or glob equivalents) so these system-layer
files participate in update and rollback flows, ensuring you reference the
SYSTEM_PATHS constant and any code that iterates over it (e.g., the
updater/rollback logic) so the new mode directories are treated as system-layer,
non-user content.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: a976019f-d9c4-4edb-8deb-dc4b9e28edc1

📥 Commits

Reviewing files that changed from the base of the PR and between 2051beb and de7b557.

📒 Files selected for processing (1)
  • DATA_CONTRACT.md

Comment thread DATA_CONTRACT.md
@darshan3131
Copy link
Copy Markdown
Contributor Author

Flag from CodeRabbit is addressed by #337 in this same PR series. #337 adds modes/fr, modes/ja, modes/pt, modes/ru (plus the missing scripts) to SYSTEM_PATHS in update-system.mjs, closing the contract drift.
Recommended merge order: #337#338#339. #337 lands the fix, #338 documents the contract, #339 clarifies scanner scope. Happy to squash or rebase if preferred.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
update-system.mjs (1)

47-70: ⚠️ Potential issue | 🟠 Major

Rollback will not remove newly introduced system files.

These newly added paths can be created by apply(), but rollback() only does git checkout <backup> -- <path> and ignores failures. If a path did not exist in the backup branch, checkout fails and the new file/directory remains tracked after rollback, so rollback is incomplete.

🛠️ Proposed rollback fix
     // Checkout system files from backup branch
     for (const path of SYSTEM_PATHS) {
       try {
         git('checkout', latest, '--', path);
       } catch {
-        // File may not have existed in backup
+        // File may not have existed in backup; remove newly introduced system files.
+        try {
+          git('rm', '-r', '--ignore-unmatch', '--', path);
+        } catch {
+          // Nothing to remove
+        }
       }
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@update-system.mjs` around lines 47 - 70, The rollback() in update-system.mjs
doesn't remove files that were newly created by apply(), leaving them tracked if
the backup branch lacked those paths; update rollback() to, for each path listed
(the same list used by apply()), attempt the existing git checkout <backup> --
<path>, and if that fails or the path is absent in the backup, explicitly remove
the newly introduced file(s) from the working tree/index (e.g., via git rm
--cached/--force for files and recursive removal for directories, or by running
a targeted git clean for untracked paths) so rollback fully reverts changes
created by apply(); modify the rollback() routine to check the checkout result
and handle cleanup for missing-backup paths accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@merge-tracker.mjs`:
- Around line 97-120: The tokenization in roleTokens is ASCII-only and yields
empty arrays for non-Latin roles, causing roleFuzzyMatch to bail early; fix
roleTokens by using a Unicode-aware regexp and normalization: replace the
character class /[^a-z0-9\s]/g with /[^\p{L}\p{N}\s]/gu and call
toLocaleLowerCase() instead of toLowerCase() so tokens like "ソフトウェアエンジニア" and
"Инженер данных" are preserved; keep the existing split(/\s+/) and
ROLE_STOPWORDS check but ensure ROLE_STOPWORDS entries are normalized to the
same locale-lowercase form when populated so comparisons align.

In `@update-system.mjs`:
- Line 49: SYSTEM_PATHS currently includes 'modes/interview-prep.md' but
DATA_CONTRACT.md does not list that mode; fix by either adding
"modes/interview-prep.md" to the system-layer section of DATA_CONTRACT.md so it
matches SYSTEM_PATHS, or remove 'modes/interview-prep.md' from SYSTEM_PATHS in
update-system.mjs to keep it a user-managed file; update whichever file you
change (DATA_CONTRACT.md or SYSTEM_PATHS constant in update-system.mjs) so both
sources align and the contract no longer drifts.

---

Outside diff comments:
In `@update-system.mjs`:
- Around line 47-70: The rollback() in update-system.mjs doesn't remove files
that were newly created by apply(), leaving them tracked if the backup branch
lacked those paths; update rollback() to, for each path listed (the same list
used by apply()), attempt the existing git checkout <backup> -- <path>, and if
that fails or the path is absent in the backup, explicitly remove the newly
introduced file(s) from the working tree/index (e.g., via git rm
--cached/--force for files and recursive removal for directories, or by running
a targeted git clean for untracked paths) so rollback fully reverts changes
created by apply(); modify the rollback() routine to check the checkout result
and handle cleanup for missing-backup paths accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 65adeb82-05cc-497b-8e39-73154dcbca76

📥 Commits

Reviewing files that changed from the base of the PR and between de7b557 and f2b1ffa.

📒 Files selected for processing (6)
  • DATA_CONTRACT.md
  • VERSION
  • merge-tracker.mjs
  • modes/scan.md
  • package.json
  • update-system.mjs

Comment thread merge-tracker.mjs Outdated
Comment on lines +97 to +120
function roleTokens(s) {
return s
.toLowerCase()
.replace(/[^a-z0-9\s]/g, ' ')
.split(/\s+/)
.filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));
}

function roleFuzzyMatch(a, b) {
const wordsA = a.toLowerCase().split(/\s+/).filter(w => w.length > 3);
const wordsB = b.toLowerCase().split(/\s+/).filter(w => w.length > 3);
const overlap = wordsA.filter(w => wordsB.some(wb => wb.includes(w) || w.includes(wb)));
return overlap.length >= 2;
const wordsA = roleTokens(a);
const wordsB = roleTokens(b);
if (wordsA.length === 0 || wordsB.length === 0) return false;

const setB = new Set(wordsB);
const overlap = wordsA.filter(w => setB.has(w)).length;
if (overlap === 0) return false;

// Jaccard-style ratio on content tokens. Two roles are "the same" only
// when the overlap dominates the smaller side — not when they just share
// a location + "engineer".
const minLen = Math.min(wordsA.length, wordsB.length);
const ratio = overlap / minLen;

return overlap >= 2 && ratio >= 0.6;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

❓ Verification inconclusive

Script executed:

#!/bin/bash
# Description: Demonstrate that current ASCII-only tokenization drops non-English role signal.
# Expected: Japanese/Russian examples show empty or near-empty current tokens.

node --input-type=module <<'NODE'
const ROLE_STOPWORDS = new Set(['senior', 'remote', 'role', 'position', 'team', 'based']);

function currentRoleTokens(s) {
  return s
    .toLowerCase()
    .replace(/[^a-z0-9\s]/g, ' ')
    .split(/\s+/)
    .filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));
}

function unicodeRoleTokens(s) {
  return s
    .normalize('NFKC')
    .toLocaleLowerCase()
    .replace(/[^\p{L}\p{N}\s]/gu, ' ')
    .split(/\s+/u)
    .filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));
}

for (const title of [
  'Développeur logiciel senior',
  'Engenheiro de Dados',
  'ソフトウェアエンジニア',
  'Инженер данных',
]) {
  console.log(title);
  console.log(' current:', currentRoleTokens(title));
  console.log(' unicode:', unicodeRoleTokens(title));
}
NODE

Repository: santifer/career-ops


Repository: santifer/career-ops
Exit code: 0

stdout:

Développeur logiciel senior
 current: [ 'veloppeur', 'logiciel' ]
 unicode: [ 'développeur', 'logiciel' ]
Engenheiro de Dados
 current: [ 'engenheiro', 'dados' ]
 unicode: [ 'engenheiro', 'dados' ]
ソフトウェアエンジニア
 current: []
 unicode: [ 'ソフトウェアエンジニア' ]
Инженер данных
 current: []
 unicode: [ 'инженер', 'данных' ]

Unicode-aware tokenization required for non-Latin role matching.

Current ASCII-only tokenization (/[^a-z0-9\s]/g) produces empty token arrays for Japanese and Russian role titles, breaking deduplication. Example outputs:

  • Japanese "ソフトウェアエンジニア" → current: [] | proposed: ['ソフトウェアエンジニア']
  • Russian "Инженер данных" → current: [] | proposed: ['инженер', 'данных']

Line 108 returns false when either token set is empty, so duplicate entries in language-mode trackers go undetected.

Proposed fix
 function roleTokens(s) {
   return s
-    .toLowerCase()
-    .replace(/[^a-z0-9\s]/g, ' ')
-    .split(/\s+/)
+    .normalize('NFKC')
+    .toLocaleLowerCase()
+    .replace(/[^\p{L}\p{N}\s]/gu, ' ')
+    .split(/\s+/u)
     .filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
function roleTokens(s) {
return s
.toLowerCase()
.replace(/[^a-z0-9\s]/g, ' ')
.split(/\s+/)
.filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));
}
function roleFuzzyMatch(a, b) {
const wordsA = a.toLowerCase().split(/\s+/).filter(w => w.length > 3);
const wordsB = b.toLowerCase().split(/\s+/).filter(w => w.length > 3);
const overlap = wordsA.filter(w => wordsB.some(wb => wb.includes(w) || w.includes(wb)));
return overlap.length >= 2;
const wordsA = roleTokens(a);
const wordsB = roleTokens(b);
if (wordsA.length === 0 || wordsB.length === 0) return false;
const setB = new Set(wordsB);
const overlap = wordsA.filter(w => setB.has(w)).length;
if (overlap === 0) return false;
// Jaccard-style ratio on content tokens. Two roles are "the same" only
// when the overlap dominates the smaller side — not when they just share
// a location + "engineer".
const minLen = Math.min(wordsA.length, wordsB.length);
const ratio = overlap / minLen;
return overlap >= 2 && ratio >= 0.6;
function roleTokens(s) {
return s
.normalize('NFKC')
.toLocaleLowerCase()
.replace(/[^\p{L}\p{N}\s]/gu, ' ')
.split(/\s+/u)
.filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@merge-tracker.mjs` around lines 97 - 120, The tokenization in roleTokens is
ASCII-only and yields empty arrays for non-Latin roles, causing roleFuzzyMatch
to bail early; fix roleTokens by using a Unicode-aware regexp and normalization:
replace the character class /[^a-z0-9\s]/g with /[^\p{L}\p{N}\s]/gu and call
toLocaleLowerCase() instead of toLowerCase() so tokens like "ソフトウェアエンジニア" and
"Инженер данных" are preserved; keep the existing split(/\s+/) and
ROLE_STOPWORDS check but ensure ROLE_STOPWORDS entries are normalized to the
same locale-lowercase form when populated so comparisons align.

Comment thread update-system.mjs Outdated
'modes/training.md',
'modes/patterns.md',
'modes/followup.md',
'modes/interview-prep.md',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Document modes/interview-prep.md in the data contract.

SYSTEM_PATHS now auto-updates modes/interview-prep.md, but DATA_CONTRACT.md does not list that mode alongside the other system-layer mode files. Add it there too, or remove it from updater scope to avoid contract drift. As per coding guidelines, “DATA_CONTRACT.md: This file defines system vs user file boundaries.”

📄 Proposed contract alignment
 | `modes/patterns.md` | Pattern analysis instructions |
 | `modes/followup.md` | Follow-up cadence instructions |
+| `modes/interview-prep.md` | Interview preparation instructions |
 | `modes/de/*` | German language modes |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@update-system.mjs` at line 49, SYSTEM_PATHS currently includes
'modes/interview-prep.md' but DATA_CONTRACT.md does not list that mode; fix by
either adding "modes/interview-prep.md" to the system-layer section of
DATA_CONTRACT.md so it matches SYSTEM_PATHS, or remove 'modes/interview-prep.md'
from SYSTEM_PATHS in update-system.mjs to keep it a user-managed file; update
whichever file you change (DATA_CONTRACT.md or SYSTEM_PATHS constant in
update-system.mjs) so both sources align and the contract no longer drifts.

@darshan3131 darshan3131 force-pushed the fix/338-data-contract-langs branch from de7b557 to 99e6fd8 Compare April 18, 2026 07:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant