feat: Implement the documented "add_repo_metadata" functionality#2304
feat: Implement the documented "add_repo_metadata" functionality#2304asbach wants to merge 14 commits intoThe-PR-Agent:mainfrom
Conversation
Review Summary by QodoImplement add_repo_metadata feature to inject repository metadata files into tool instructions
WalkthroughsDescription• Adds get_repo_file() method to all git providers for reading arbitrary repository files • Implements add_repo_metadata feature to load metadata files (AGENTS.md, QODO.md, CLAUDE.md) from PR's head branch • Injects loaded metadata content into extra_instructions for all supported tools • Includes comprehensive unit tests for metadata loading functionality Diagramflowchart LR
A["PR's Head Branch"] -->|get_repo_file| B["Git Providers"]
B -->|Read Metadata Files| C["AGENTS.md, QODO.md, CLAUDE.md"]
C -->|Combine Content| D["Metadata Content"]
D -->|Append to| E["Tool extra_instructions"]
E -->|Applied to| F["All PR Tools"]
File Changes1. pr_agent/git_providers/git_provider.py
|
Code Review by Qodo
1. add_repo_metadata_file_list not validated
|
|
Persistent review updated to latest commit 10994bc |
Sanitize file paths from add_repo_metadata_file_list before passing them to any git provider, preventing directory traversal attacks (e.g. "../etc/passwd"). Validation rejects absolute paths, ".." components, and backslashes. Removes redundant defensive check from GerritProvider in favor of the single central gate. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Persistent review updated to latest commit 772dbaa |
Catch GitlabGetError specifically for expected not-found cases and log unexpected errors at debug level instead of silently suppressing all exceptions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Persistent review updated to latest commit 38af09e |
| def get_repo_file(self, file_path: str) -> str: | ||
| try: | ||
| head_sha = self.pr.last_merge_commit | ||
| version = GitVersionDescriptor( | ||
| version=head_sha.commit_id, version_type="commit" | ||
| ) if head_sha else None | ||
| contents = self.azure_devops_client.get_item_content( | ||
| repository_id=self.repo_slug, | ||
| project=self.workspace_slug, | ||
| download=False, | ||
| include_content_metadata=False, | ||
| include_content=True, | ||
| path=file_path, | ||
| version_descriptor=version, | ||
| ) | ||
| content = list(contents)[0] | ||
| return content.decode("utf-8") if isinstance(content, bytes) else content |
There was a problem hiding this comment.
1. Azure reads merged commit 🐞 Bug ≡ Correctness
AzureDevOpsProvider.get_repo_file() reads files at self.pr.last_merge_commit (merge result) instead of the PR source/head commit, so metadata may reflect the merged preview rather than the branch under review. This is inconsistent with other providers in this PR that explicitly read from the PR/MR source branch or head SHA.
Agent Prompt
### Issue description
`AzureDevopsProvider.get_repo_file()` fetches repository files using `self.pr.last_merge_commit`, which corresponds to the merge-result/merge-preview commit, not the PR source/head. Repository metadata should reflect the branch under review.
### Issue Context
Other providers in this PR intentionally read from the PR/MR source branch or head SHA.
### Fix Focus Areas
- pr_agent/git_providers/azuredevops_provider.py[177-193]
### Fix approach
Fetch the file using the PR source/head commit instead of `last_merge_commit`.
- Prefer Azure PR fields that represent the source commit (e.g., `last_merge_source_commit` if available on the PR model).
- If the model doesn’t expose it directly, derive the correct commit via the PR’s source ref and latest commit and use that commit id in `GitVersionDescriptor`.
- Ensure the ref used here matches what `get_pr_branch()` identifies as the PR source branch.
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Persistent review updated to latest commit 7b9f60e |
|
Regarding
I agree that this may be problematic. My implementation follows the existing documentation and I think the benefit of not needing two PRs when the instructions change while working on a feature outweighs the problems of adding bad/malicious context to the reviewer instructions. |
…ettings() calls Guard metadata injection with a context flag (Starlette request context in server mode, settings flag as fallback in CLI mode) so that repeated calls within the same request do not re-append metadata to extra_instructions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Persistent review updated to latest commit 9a85654 |
…cation The settings-based _repo_metadata_applied flag persisted across PRs in polling/CLI mode, suppressing metadata loading after the first PR. Replace with content-based deduplication (check if combined_metadata is already present in extra_instructions) and keep the Starlette context flag for server mode only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Persistent review updated to latest commit 5bdc1a9 |
|
Persistent review updated to latest commit 7454325 |
|
/describe |
|
PR Description updated to latest commit (7454325)
|
The _reset_settings fixture now snapshots and restores add_repo_metadata and add_repo_metadata_file_list in addition to extra_instructions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Persistent review updated to latest commit 87672ea |
GerritProvider.get_repo_file() previously used open() on the working directory, allowing access to untracked files like .git/config. Now reads from self.repo.head.commit.tree to match other providers' behavior of only serving committed content. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Persistent review updated to latest commit 524ef53 |
…ile_path URL-decode input before validation so encoded sequences like %2e%2e/ cannot bypass traversal checks. Also reject paths with residual % (double-encoding) and validate segments individually rather than relying solely on prefix/substring matching. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Persistent review updated to latest commit 29b6a1d |
| if not repo_metadata_applied and get_settings().config.get("add_repo_metadata", False): | ||
| try: | ||
| tool_sections = [ | ||
| "pr_reviewer", | ||
| "pr_description", | ||
| "pr_code_suggestions", | ||
| "pr_add_docs", | ||
| "pr_update_changelog", | ||
| "pr_test", | ||
| "pr_improve_component", | ||
| ] | ||
|
|
||
| # In non-context runtimes (CLI, polling), restore extra_instructions to their | ||
| # pre-metadata baseline so metadata from a previous PR doesn't persist. | ||
| global _extra_instructions_baseline | ||
| is_context_mode = False | ||
| try: | ||
| is_context_mode = context.exists() | ||
| except Exception: | ||
| pass | ||
| if not is_context_mode: | ||
| if _extra_instructions_baseline: | ||
| # Restore baseline before applying this PR's metadata | ||
| for section, baseline_value in _extra_instructions_baseline.items(): | ||
| get_settings().set(f"{section}.extra_instructions", baseline_value) | ||
| else: | ||
| # First run: capture the current values as the baseline | ||
| for section in tool_sections: | ||
| section_obj = get_settings().get(section, None) | ||
| if section_obj is not None and hasattr(section_obj, "extra_instructions"): | ||
| _extra_instructions_baseline[section] = section_obj.extra_instructions or "" | ||
|
|
||
| metadata_files = get_settings().config.get("add_repo_metadata_file_list", | ||
| ["AGENTS.md", "QODO.md", "CLAUDE.md"]) | ||
|
|
||
| # Collect contents of all metadata files that exist in the repo | ||
| metadata_content_parts = [] | ||
| for file_name in metadata_files: |
There was a problem hiding this comment.
1. add_repo_metadata_file_list not validated 📘 Rule violation ≡ Correctness
add_repo_metadata_file_list is used directly without validating/normalizing its type/contents, which can lead to incorrect behavior (e.g., iterating over characters if a string is provided) or runtime errors. The checklist requires normalizing and validating user-provided settings before using them in logic.
Agent Prompt
## Issue description
`config.add_repo_metadata_file_list` is consumed without validation. If the setting is mis-typed (e.g., a string, `None`, or a list containing non-strings), metadata loading can behave incorrectly or crash.
## Issue Context
This setting is user-provided via TOML/env/config overrides and must be normalized to a predictable shape before iterating.
## Fix Focus Areas
- pr_agent/git_providers/utils.py[142-186]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| def _is_safe_repo_file_path(file_path: str) -> bool: | ||
| """ | ||
| Validate that a file path is safe to read from a repository root. | ||
| Rejects absolute paths, paths with '..' traversal components, backslashes, | ||
| and percent-encoded bypass attempts. | ||
| """ | ||
| if not file_path or not file_path.strip(): | ||
| return False |
There was a problem hiding this comment.
2. _is_safe_repo_file_path() lacks type guard 📘 Rule violation ☼ Reliability
_is_safe_repo_file_path() calls .strip() on file_path without confirming it is a string, which can raise AttributeError if non-string values reach it (e.g., from configuration). The checklist requires defensive type checks before calling methods on external inputs.
Agent Prompt
## Issue description
`_is_safe_repo_file_path()` calls `file_path.strip()` without first ensuring `file_path` is a string. This can raise at runtime if configuration contains non-string entries.
## Issue Context
Even though the type annotation is `str`, configuration values are external inputs and can be malformed.
## Fix Focus Areas
- pr_agent/git_providers/utils.py[20-49]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| head_sha = self.pr.fromRef['latestCommit'] | ||
| content = self.get_file(file_path, head_sha) |
There was a problem hiding this comment.
3. fromref['latestcommit'] unguarded access 📘 Rule violation ☼ Reliability
bitbucket_server_provider.get_repo_file() directly indexes self.pr.fromRef['latestCommit'] without checking the key exists or the shape is as expected. This can raise KeyError/TypeError on unexpected webhook/provider payload shapes.
Agent Prompt
## Issue description
`self.pr.fromRef['latestCommit']` is accessed via direct indexing, which can crash if `fromRef` is missing, not a dict, or the key is absent.
## Issue Context
Provider payloads are external inputs and may vary across Bitbucket Server versions / API responses.
## Fix Focus Areas
- pr_agent/git_providers/bitbucket_server_provider.py[119-123]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
| def get_repo_file(self, file_path: str) -> str: | ||
| try: | ||
| # Read from the PR's source branch so metadata files reflect the branch under review | ||
| url = (f"https://api.bitbucket.org/2.0/repositories/{self.workspace_slug}/{self.repo_slug}/src/" | ||
| f"{self.pr.source_branch}/{file_path}") | ||
| response = requests.request("GET", url, headers=self.headers) | ||
| if response.status_code == 404: | ||
| return "" | ||
| response.raise_for_status() | ||
| return response.text |
There was a problem hiding this comment.
4. Bitbucket file fetch can hang 🐞 Bug ☼ Reliability
BitbucketProvider.get_repo_file performs an HTTP GET without a timeout, so enabling add_repo_metadata can block apply_repo_settings indefinitely on stalled connections. Because apply_repo_settings is called before request handling logic, this can stall the whole PR-agent flow.
Agent Prompt
### Issue description
`BitbucketProvider.get_repo_file()` makes an outbound HTTP request without a timeout. In network stalls, this can hang indefinitely and block `apply_repo_settings()`.
### Issue Context
`apply_repo_settings()` runs before PR processing and calls `get_repo_file()` for each metadata filename, so this path is on the critical startup path when `add_repo_metadata=true`.
### Fix Focus Areas
- Add an explicit `timeout=` to the Bitbucket `requests.request("GET", ...)` call.
- Prefer a configurable timeout (or a sensible constant) and handle timeout exceptions similarly to other request errors.
- pr_agent/git_providers/bitbucket_provider.py[92-108]
- pr_agent/git_providers/utils.py[174-186]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
User description
Adds the already documented "add_repo_metadata" and "add_repo_metadata_file_list" features.
Support for reading additional instructions from agents.md, claude.md etc. or any other custom instructions file.
fixes #2174
If the instructions become too long, the current pr-agent implementation would start clipping the diff and not the metadata. I can add some kind of sensible max length, if required.
PR Type
Enhancement
Description
Implement documented add_repo_metadata feature to load custom instructions
Add get_repo_file() method across all git providers for file retrieval
Validate file paths to prevent directory traversal security attacks
Support content-based deduplication to prevent metadata duplication
Add comprehensive unit tests for metadata loading functionality
Diagram Walkthrough
File Walkthrough
10 files
Add metadata loading and path validation logicAdd abstract get_repo_file method to base classImplement get_repo_file for GitHub providerImplement get_repo_file for GitLab providerImplement get_repo_file for Bitbucket providerImplement get_repo_file for Bitbucket Server providerImplement get_repo_file for Azure DevOps providerImplement get_repo_file for Gitea providerImplement get_repo_file for CodeCommit providerImplement get_repo_file for Gerrit provider1 files
Add configuration options for metadata feature1 files
Add comprehensive unit tests for metadata feature