docs(data-contract): list fr/ja/pt/ru modes alongside de (#338)#355
docs(data-contract): list fr/ja/pt/ru modes alongside de (#338)#355darshan3131 wants to merge 3 commits intosantifer:mainfrom
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe pull request adds four new wildcard entries to the System Layer file mapping in DATA_CONTRACT.md, classifying additional language mode directories (French, Japanese, Portuguese, Russian) as system files, extending an existing pattern. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~2 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@DATA_CONTRACT.md`:
- Around line 48-51: The DATA_CONTRACT now marks modes/fr/*, modes/ja/*,
modes/pt/*, and modes/ru/* as auto-updatable but update-system.mjs still only
includes modes/de/ in the SYSTEM_PATHS array; update SYSTEM_PATHS in
update-system.mjs to include the new language path patterns (e.g., "modes/fr/",
"modes/ja/", "modes/pt/", "modes/ru/" or glob equivalents) so these system-layer
files participate in update and rollback flows, ensuring you reference the
SYSTEM_PATHS constant and any code that iterates over it (e.g., the
updater/rollback logic) so the new mode directories are treated as system-layer,
non-user content.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: a976019f-d9c4-4edb-8deb-dc4b9e28edc1
📒 Files selected for processing (1)
DATA_CONTRACT.md
|
Flag from CodeRabbit is addressed by #337 in this same PR series. #337 adds modes/fr, modes/ja, modes/pt, modes/ru (plus the missing scripts) to SYSTEM_PATHS in update-system.mjs, closing the contract drift. |
de7b557 to
f2b1ffa
Compare
f2b1ffa to
de7b557
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
update-system.mjs (1)
47-70:⚠️ Potential issue | 🟠 MajorRollback will not remove newly introduced system files.
These newly added paths can be created by
apply(), butrollback()only doesgit checkout <backup> -- <path>and ignores failures. If a path did not exist in the backup branch, checkout fails and the new file/directory remains tracked after rollback, so rollback is incomplete.🛠️ Proposed rollback fix
// Checkout system files from backup branch for (const path of SYSTEM_PATHS) { try { git('checkout', latest, '--', path); } catch { - // File may not have existed in backup + // File may not have existed in backup; remove newly introduced system files. + try { + git('rm', '-r', '--ignore-unmatch', '--', path); + } catch { + // Nothing to remove + } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@update-system.mjs` around lines 47 - 70, The rollback() in update-system.mjs doesn't remove files that were newly created by apply(), leaving them tracked if the backup branch lacked those paths; update rollback() to, for each path listed (the same list used by apply()), attempt the existing git checkout <backup> -- <path>, and if that fails or the path is absent in the backup, explicitly remove the newly introduced file(s) from the working tree/index (e.g., via git rm --cached/--force for files and recursive removal for directories, or by running a targeted git clean for untracked paths) so rollback fully reverts changes created by apply(); modify the rollback() routine to check the checkout result and handle cleanup for missing-backup paths accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@merge-tracker.mjs`:
- Around line 97-120: The tokenization in roleTokens is ASCII-only and yields
empty arrays for non-Latin roles, causing roleFuzzyMatch to bail early; fix
roleTokens by using a Unicode-aware regexp and normalization: replace the
character class /[^a-z0-9\s]/g with /[^\p{L}\p{N}\s]/gu and call
toLocaleLowerCase() instead of toLowerCase() so tokens like "ソフトウェアエンジニア" and
"Инженер данных" are preserved; keep the existing split(/\s+/) and
ROLE_STOPWORDS check but ensure ROLE_STOPWORDS entries are normalized to the
same locale-lowercase form when populated so comparisons align.
In `@update-system.mjs`:
- Line 49: SYSTEM_PATHS currently includes 'modes/interview-prep.md' but
DATA_CONTRACT.md does not list that mode; fix by either adding
"modes/interview-prep.md" to the system-layer section of DATA_CONTRACT.md so it
matches SYSTEM_PATHS, or remove 'modes/interview-prep.md' from SYSTEM_PATHS in
update-system.mjs to keep it a user-managed file; update whichever file you
change (DATA_CONTRACT.md or SYSTEM_PATHS constant in update-system.mjs) so both
sources align and the contract no longer drifts.
---
Outside diff comments:
In `@update-system.mjs`:
- Around line 47-70: The rollback() in update-system.mjs doesn't remove files
that were newly created by apply(), leaving them tracked if the backup branch
lacked those paths; update rollback() to, for each path listed (the same list
used by apply()), attempt the existing git checkout <backup> -- <path>, and if
that fails or the path is absent in the backup, explicitly remove the newly
introduced file(s) from the working tree/index (e.g., via git rm
--cached/--force for files and recursive removal for directories, or by running
a targeted git clean for untracked paths) so rollback fully reverts changes
created by apply(); modify the rollback() routine to check the checkout result
and handle cleanup for missing-backup paths accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 65adeb82-05cc-497b-8e39-73154dcbca76
📒 Files selected for processing (6)
DATA_CONTRACT.mdVERSIONmerge-tracker.mjsmodes/scan.mdpackage.jsonupdate-system.mjs
| function roleTokens(s) { | ||
| return s | ||
| .toLowerCase() | ||
| .replace(/[^a-z0-9\s]/g, ' ') | ||
| .split(/\s+/) | ||
| .filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w)); | ||
| } | ||
|
|
||
| function roleFuzzyMatch(a, b) { | ||
| const wordsA = a.toLowerCase().split(/\s+/).filter(w => w.length > 3); | ||
| const wordsB = b.toLowerCase().split(/\s+/).filter(w => w.length > 3); | ||
| const overlap = wordsA.filter(w => wordsB.some(wb => wb.includes(w) || w.includes(wb))); | ||
| return overlap.length >= 2; | ||
| const wordsA = roleTokens(a); | ||
| const wordsB = roleTokens(b); | ||
| if (wordsA.length === 0 || wordsB.length === 0) return false; | ||
|
|
||
| const setB = new Set(wordsB); | ||
| const overlap = wordsA.filter(w => setB.has(w)).length; | ||
| if (overlap === 0) return false; | ||
|
|
||
| // Jaccard-style ratio on content tokens. Two roles are "the same" only | ||
| // when the overlap dominates the smaller side — not when they just share | ||
| // a location + "engineer". | ||
| const minLen = Math.min(wordsA.length, wordsB.length); | ||
| const ratio = overlap / minLen; | ||
|
|
||
| return overlap >= 2 && ratio >= 0.6; |
There was a problem hiding this comment.
❓ Verification inconclusive
Script executed:
#!/bin/bash
# Description: Demonstrate that current ASCII-only tokenization drops non-English role signal.
# Expected: Japanese/Russian examples show empty or near-empty current tokens.
node --input-type=module <<'NODE'
const ROLE_STOPWORDS = new Set(['senior', 'remote', 'role', 'position', 'team', 'based']);
function currentRoleTokens(s) {
return s
.toLowerCase()
.replace(/[^a-z0-9\s]/g, ' ')
.split(/\s+/)
.filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));
}
function unicodeRoleTokens(s) {
return s
.normalize('NFKC')
.toLocaleLowerCase()
.replace(/[^\p{L}\p{N}\s]/gu, ' ')
.split(/\s+/u)
.filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));
}
for (const title of [
'Développeur logiciel senior',
'Engenheiro de Dados',
'ソフトウェアエンジニア',
'Инженер данных',
]) {
console.log(title);
console.log(' current:', currentRoleTokens(title));
console.log(' unicode:', unicodeRoleTokens(title));
}
NODERepository: santifer/career-ops
Repository: santifer/career-ops
Exit code: 0
stdout:
Développeur logiciel senior
current: [ 'veloppeur', 'logiciel' ]
unicode: [ 'développeur', 'logiciel' ]
Engenheiro de Dados
current: [ 'engenheiro', 'dados' ]
unicode: [ 'engenheiro', 'dados' ]
ソフトウェアエンジニア
current: []
unicode: [ 'ソフトウェアエンジニア' ]
Инженер данных
current: []
unicode: [ 'инженер', 'данных' ]
Unicode-aware tokenization required for non-Latin role matching.
Current ASCII-only tokenization (/[^a-z0-9\s]/g) produces empty token arrays for Japanese and Russian role titles, breaking deduplication. Example outputs:
- Japanese "ソフトウェアエンジニア" → current:
[]| proposed:['ソフトウェアエンジニア'] - Russian "Инженер данных" → current:
[]| proposed:['инженер', 'данных']
Line 108 returns false when either token set is empty, so duplicate entries in language-mode trackers go undetected.
Proposed fix
function roleTokens(s) {
return s
- .toLowerCase()
- .replace(/[^a-z0-9\s]/g, ' ')
- .split(/\s+/)
+ .normalize('NFKC')
+ .toLocaleLowerCase()
+ .replace(/[^\p{L}\p{N}\s]/gu, ' ')
+ .split(/\s+/u)
.filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w));
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| function roleTokens(s) { | |
| return s | |
| .toLowerCase() | |
| .replace(/[^a-z0-9\s]/g, ' ') | |
| .split(/\s+/) | |
| .filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w)); | |
| } | |
| function roleFuzzyMatch(a, b) { | |
| const wordsA = a.toLowerCase().split(/\s+/).filter(w => w.length > 3); | |
| const wordsB = b.toLowerCase().split(/\s+/).filter(w => w.length > 3); | |
| const overlap = wordsA.filter(w => wordsB.some(wb => wb.includes(w) || w.includes(wb))); | |
| return overlap.length >= 2; | |
| const wordsA = roleTokens(a); | |
| const wordsB = roleTokens(b); | |
| if (wordsA.length === 0 || wordsB.length === 0) return false; | |
| const setB = new Set(wordsB); | |
| const overlap = wordsA.filter(w => setB.has(w)).length; | |
| if (overlap === 0) return false; | |
| // Jaccard-style ratio on content tokens. Two roles are "the same" only | |
| // when the overlap dominates the smaller side — not when they just share | |
| // a location + "engineer". | |
| const minLen = Math.min(wordsA.length, wordsB.length); | |
| const ratio = overlap / minLen; | |
| return overlap >= 2 && ratio >= 0.6; | |
| function roleTokens(s) { | |
| return s | |
| .normalize('NFKC') | |
| .toLocaleLowerCase() | |
| .replace(/[^\p{L}\p{N}\s]/gu, ' ') | |
| .split(/\s+/u) | |
| .filter(w => w.length > 3 && !ROLE_STOPWORDS.has(w)); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@merge-tracker.mjs` around lines 97 - 120, The tokenization in roleTokens is
ASCII-only and yields empty arrays for non-Latin roles, causing roleFuzzyMatch
to bail early; fix roleTokens by using a Unicode-aware regexp and normalization:
replace the character class /[^a-z0-9\s]/g with /[^\p{L}\p{N}\s]/gu and call
toLocaleLowerCase() instead of toLowerCase() so tokens like "ソフトウェアエンジニア" and
"Инженер данных" are preserved; keep the existing split(/\s+/) and
ROLE_STOPWORDS check but ensure ROLE_STOPWORDS entries are normalized to the
same locale-lowercase form when populated so comparisons align.
| 'modes/training.md', | ||
| 'modes/patterns.md', | ||
| 'modes/followup.md', | ||
| 'modes/interview-prep.md', |
There was a problem hiding this comment.
Document modes/interview-prep.md in the data contract.
SYSTEM_PATHS now auto-updates modes/interview-prep.md, but DATA_CONTRACT.md does not list that mode alongside the other system-layer mode files. Add it there too, or remove it from updater scope to avoid contract drift. As per coding guidelines, “DATA_CONTRACT.md: This file defines system vs user file boundaries.”
📄 Proposed contract alignment
| `modes/patterns.md` | Pattern analysis instructions |
| `modes/followup.md` | Follow-up cadence instructions |
+| `modes/interview-prep.md` | Interview preparation instructions |
| `modes/de/*` | German language modes |🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@update-system.mjs` at line 49, SYSTEM_PATHS currently includes
'modes/interview-prep.md' but DATA_CONTRACT.md does not list that mode; fix by
either adding "modes/interview-prep.md" to the system-layer section of
DATA_CONTRACT.md so it matches SYSTEM_PATHS, or remove 'modes/interview-prep.md'
from SYSTEM_PATHS in update-system.mjs to keep it a user-managed file; update
whichever file you change (DATA_CONTRACT.md or SYSTEM_PATHS constant in
update-system.mjs) so both sources align and the contract no longer drifts.
de7b557 to
99e6fd8
Compare
What does this PR do?
Related issue
Type of change
Checklist
node test-all.mjsand all tests passQuestions? Join the Discord for faster feedback.
Summary by CodeRabbit