fix(loader): match ignore patterns on repo-relative paths#13
fix(loader): match ignore patterns on repo-relative paths#13Bbowlby22 wants to merge 2 commits intoHKUDS:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes ignore-directory matching in RepositoryLoader.scan_files() by ensuring gitwildmatch ignore patterns are evaluated against repo-relative (normalized) paths, improving correctness when skipping directories like output/ or .venv/.
Changes:
- Convert walked directory paths to normalized repo-relative paths before applying ignore patterns.
- Check both
rel_pathandrel_path + "/"to reliably match directory-style patterns. - Normalize
relative_pathfor file-level ignore checks.
Comments suppressed due to low confidence (1)
fastcode/loader.py:276
relative_pathis already normalized when assigned, but it’s normalized again when building the metadata dict ("relative_path": normalize_path(relative_path)). This second normalization is redundant; consider storingrelative_pathdirectly to avoid extra work in tight loops.
relative_path = normalize_path(
os.path.relpath(file_path, self.repo_path)
)
# Check if should ignore
if should_ignore_path(relative_path, effective_ignore):
continue
# Check if supported extension
if not is_supported_file(file_path, self.supported_extensions):
continue
# Check file size
try:
file_size = os.path.getsize(file_path)
if file_size > max_file_size_bytes:
self.logger.warning(
f"Skipping large file: {relative_path} "
f"({file_size / 1024 / 1024:.2f} MB)"
)
continue
files.append({
"path": normalize_path(file_path),
"relative_path": normalize_path(relative_path),
"size": file_size,
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| for root, dirs, filenames in os.walk(self.repo_path): | ||
| # Filter out ignored directories | ||
| dirs[:] = [d for d in dirs if not should_ignore_path( | ||
| os.path.join(root, d), effective_ignore | ||
| )] | ||
| # Filter ignored directories using repo-relative paths so gitwildmatch | ||
| # patterns like "output/" or ".venv/" match consistently. | ||
| filtered_dirs = [] | ||
| for d in dirs: | ||
| abs_dir_path = os.path.join(root, d) | ||
| rel_dir_path = normalize_path( | ||
| os.path.relpath(abs_dir_path, self.repo_path) | ||
| ) | ||
| rel_dir_with_trailing = f"{rel_dir_path}/" | ||
| if should_ignore_path( | ||
| rel_dir_path, effective_ignore | ||
| ) or should_ignore_path(rel_dir_with_trailing, effective_ignore): | ||
| continue | ||
| filtered_dirs.append(d) | ||
| dirs[:] = filtered_dirs |
There was a problem hiding this comment.
should_ignore_path() recompiles a PathSpec on every call (see fastcode/utils.py). In this loop it’s now invoked multiple times per directory and once per file, which can be a significant CPU cost on large repos. Consider compiling the ignore spec once in scan_files() (e.g., build a PathSpec from effective_ignore) and then calling spec.match_file(...) for both directory and file checks.
|
Correction to previous comment (shell interpolation noise): Added follow-up perf hardening requested by review:
Validation:
New commit on this PR branch:
|
Summary
RepositoryLoader.scan_files()to use repo-relative paths.gitignoremerge behavior (effective_ignore) intactoutput/,.venv/) are honored reliablyRoot cause
should_ignore_path()evaluates gitwildmatch patterns against paths relative to the repo, but the current directory filter passed absolute paths fromos.walk(). This caused false misses and over-indexing in large repos.Change
self.repo_pathrel_pathandrel_path + "/"for directory-style patternsrelative_pathScope
fastcode/loader.py)Validation
python -m py_compile fastcode/loader.py