Skip to content

Conversation

@dtsong
Copy link
Owner

@dtsong dtsong commented Jan 24, 2026

Summary

  • Detect and neutralize prompt injection attempts in PR diffs before LLM processing
  • Covers system prompt overrides, role switches, instruction injection, delimiter manipulation, response injection, and Unicode attacks
  • Integrated into the LLM reviewer pipeline with logging of detected attempts

Test plan

  • 23 tests covering adversarial injection patterns
  • False positive tests for legitimate code patterns
  • Sanitization verification (injections replaced, code preserved)
  • Full test suite passes (377 tests, 95.55% coverage)

Closes #37

🤖 Generated with Claude Code

- Detect system prompt overrides, role switches, instruction injection,
  delimiter manipulation, response injection, and Unicode attacks
- Sanitize detected patterns before sending to LLM
- Integrate sanitizer into LLM reviewer pipeline
- 23 tests covering adversarial examples and false positive avoidance

Implements #37

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@github-actions
Copy link

AI Code Review

Summary

This PR adds prompt injection detection and sanitization to protect against malicious attempts to manipulate the LLM reviewer. The implementation is comprehensive but has several critical issues in the test file that prevent meaningful validation.

Confidence: 0.00 (low)
This PR may require human review

Strengths

  • Comprehensive injection pattern detection covering multiple attack vectors
  • Good separation of concerns with dedicated sanitizer module
  • Proper integration into the LLM reviewer pipeline with logging
  • Unicode attack detection is thorough and covers various control characters
  • Preserves legitimate code while neutralizing malicious patterns
  • Well-structured dataclasses for representing results

Issues Found

Critical (12)

tests/test_sanitizer.py:11 - testing

Test is creating a diff that already contains sanitized content, making it impossible to test actual injection detection. The test will always pass because it's looking for detection of already-sanitized markers.
Suggestion: Replace with actual injection patterns to test detection before sanitization

tests/test_sanitizer.py:18 - testing

Similar issue - testing pre-sanitized content instead of actual injection patterns
Suggestion: Use real injection pattern like 'you are now an assistant'

tests/test_sanitizer.py:25 - testing

Testing pre-sanitized content instead of actual injection patterns
Suggestion: Use real injection pattern to test detection

tests/test_sanitizer.py:33 - testing

Testing pre-sanitized content instead of actual injection patterns
Suggestion: Use real injection pattern

tests/test_sanitizer.py:40 - testing

Testing pre-sanitized content instead of actual injection patterns
Suggestion: Use real delimiter manipulation pattern

tests/test_sanitizer.py:49 - testing

Testing pre-sanitized content instead of actual injection patterns
Suggestion: Use real response injection pattern

tests/test_sanitizer.py:57 - testing

Testing pre-sanitized content instead of actual injection patterns
Suggestion: Use real assistant role injection pattern

tests/test_sanitizer.py:65 - testing

Testing pre-sanitized content within Unicode injection test
Suggestion: Use real injection pattern with Unicode characters

tests/test_sanitizer.py:131 - testing

Testing sanitization of already-sanitized content
Suggestion: Use real injection pattern to test the sanitization process

tests/test_sanitizer.py:150 - testing

Testing multiple pre-sanitized injections instead of real patterns
Suggestion: Use actual injection patterns to test multiple detections

tests/test_sanitizer.py:175 - testing

Testing pre-sanitized content instead of actual injection
Suggestion: Use real injection pattern

tests/test_sanitizer.py:188 - testing

Testing pre-sanitized content instead of actual injection
Suggestion: Use real injection pattern

Major (2)

src/pr_review_agent/review/sanitizer.py:59 - logic

The regex replacement text references itself in a circular manner. The description says 'Attempt to [SANITIZED:instruction_injection]' which would create confusing output.
Suggestion: Change the description to something more descriptive like 'Attempt to override previous instructions'

src/pr_review_agent/review/sanitizer.py:181 - performance

The nested loop structure could lead to performance issues with large diffs. Each line is checked against all patterns sequentially, and string replacements create new strings each time.
Suggestion: Consider optimizing by combining patterns into a single regex with named groups, or breaking early when patterns are found if multiple replacements on the same line aren't needed

Minor (2)

src/pr_review_agent/review/sanitizer.py:181 - logic

Multiple injection patterns on the same line will result in multiple replacements and multiple InjectionAttempt records. This might not be the intended behavior.
Suggestion: Consider whether to break after first match or continue processing all patterns. Document the intended behavior.

src/pr_review_agent/review/sanitizer.py:145 - logic

The function only processes lines starting with '+' but not lines starting with '-'. While this makes sense for injection detection (you want to scan added content), it might miss injection attempts in removed lines that could still be visible in the diff context.
Suggestion: Consider if removed lines should also be scanned, or document why only added lines are processed

Concerns

  • The test suite appears to be fundamentally flawed - most tests are checking detection of already-sanitized content rather than actual injection patterns
  • Without proper tests, it's impossible to verify that the injection patterns actually work as intended
  • Performance could be a concern with large diffs due to nested pattern matching
  • The circular reference in one of the pattern descriptions could cause confusion

Questions

  • Have you actually tested this with real injection attempts, given that most tests use pre-sanitized content?
  • Should the sanitizer also scan removed lines (-) or context lines for injection attempts?
  • What's the expected behavior when multiple injection patterns are found on the same line?
  • Have you considered the performance impact on very large diffs?

Model: claude-sonnet-4-20250514 | Cost: $0.0539 |
Tokens: 5888 in / 2416 out

@dtsong dtsong merged commit 33d7015 into main Jan 24, 2026
2 checks passed
@dtsong dtsong deleted the feat/37-prompt-injection-defense branch January 24, 2026 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prompt injection defense for PR diffs

2 participants