Skip to content

fix: strptime day-name removal fails for non-ASCII byte encodings#106

Draft
Koan-Bot wants to merge 1 commit intocpan-authors:mainfrom
Koan-Bot:koan.atoomic/fix-strptime-nonascii-dayname
Draft

fix: strptime day-name removal fails for non-ASCII byte encodings#106
Koan-Bot wants to merge 1 commit intocpan-authors:mainfrom
Koan-Bot:koan.atoomic/fix-strptime-nonascii-dayname

Conversation

@Koan-Bot
Copy link
Copy Markdown

@Koan-Bot Koan-Bot commented Apr 6, 2026

What

Fix strptime parser failing to strip day names in non-ASCII single-byte encodings (KOI8-R, CP1251, GB2312).

Why

Perl's \b word boundary uses ASCII-only \w/\W classification. High bytes (>0x7F) are all \W, so \b never fires between a non-ASCII day name and the following space. This broke ctime/str2time round-trips for Russian, Russian_cp1251, Russian_koi8r, and Chinese_GB language modules.

How

Replace \s*...\b with \s+ in the day-name removal regex (line 87 of Parse.pm). Day names in date strings are always word-separated, so requiring at least one trailing space correctly prevents partial matches (e.g. French "mar" inside "mars") while working for all encodings.

Testing

  • All 1238 tests pass (28 test files)
  • Extended t/lang.t from 18 to all 36 language modules for round-trip coverage
  • Verified French "mars" still parses correctly (the key edge case that \b was protecting)

🤖 Generated with Claude Code


Quality Report

Changes: 2 files changed, 8 insertions(+), 6 deletions(-)

Code scan: clean

Tests: skipped

Branch hygiene: clean

Generated by Kōan post-mission quality pipeline

The \b word boundary anchor in the strptime parser's day-name removal
regex does not work with non-ASCII single-byte encodings (KOI8-R, CP1251,
GB2312). Perl's \b checks \w/\W boundaries using ASCII rules, so high
bytes (>0x7F) are all classified as \W — making \b never match between
a non-ASCII day name and the following space.

This broke ctime/str2time round-trips for Russian, Russian_cp1251,
Russian_koi8r, and Chinese_GB language modules.

Fix: replace \s*...\b with \s+ (require at least one space after the
day name). Day names in date strings are always separated by whitespace,
so requiring \s+ correctly prevents partial matches (e.g. "mar" inside
French "mars") while working for all encodings.

Also extends t/lang.t to test all 36 language modules for round-trip
correctness (was 18, now 36).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant