feat: add prompt injection defense for PR diffs #47

dtsong · 2026-01-24T20:00:54Z

Summary

Detect and neutralize prompt injection attempts in PR diffs before LLM processing
Covers system prompt overrides, role switches, instruction injection, delimiter manipulation, response injection, and Unicode attacks
Integrated into the LLM reviewer pipeline with logging of detected attempts

Test plan

23 tests covering adversarial injection patterns
False positive tests for legitimate code patterns
Sanitization verification (injections replaced, code preserved)
Full test suite passes (377 tests, 95.55% coverage)

Closes #37

🤖 Generated with Claude Code

- Detect system prompt overrides, role switches, instruction injection, delimiter manipulation, response injection, and Unicode attacks - Sanitize detected patterns before sending to LLM - Integrate sanitizer into LLM reviewer pipeline - 23 tests covering adversarial examples and false positive avoidance Implements #37 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-01-24T20:02:19Z

AI Code Review

Summary

This PR adds prompt injection detection and sanitization to protect against malicious attempts to manipulate the LLM reviewer. The implementation is comprehensive but has several critical issues in the test file that prevent meaningful validation.

Confidence: 0.00 (low)
This PR may require human review

Strengths

Comprehensive injection pattern detection covering multiple attack vectors
Good separation of concerns with dedicated sanitizer module
Proper integration into the LLM reviewer pipeline with logging
Unicode attack detection is thorough and covers various control characters
Preserves legitimate code while neutralizing malicious patterns
Well-structured dataclasses for representing results

Issues Found

Critical (12)

tests/test_sanitizer.py:11 - testing

Test is creating a diff that already contains sanitized content, making it impossible to test actual injection detection. The test will always pass because it's looking for detection of already-sanitized markers.
Suggestion: Replace with actual injection patterns to test detection before sanitization

tests/test_sanitizer.py:18 - testing

Similar issue - testing pre-sanitized content instead of actual injection patterns
Suggestion: Use real injection pattern like 'you are now an assistant'

tests/test_sanitizer.py:25 - testing

Testing pre-sanitized content instead of actual injection patterns
Suggestion: Use real injection pattern to test detection

tests/test_sanitizer.py:33 - testing

Testing pre-sanitized content instead of actual injection patterns
Suggestion: Use real injection pattern

tests/test_sanitizer.py:40 - testing

Testing pre-sanitized content instead of actual injection patterns
Suggestion: Use real delimiter manipulation pattern

tests/test_sanitizer.py:49 - testing

Testing pre-sanitized content instead of actual injection patterns
Suggestion: Use real response injection pattern

tests/test_sanitizer.py:57 - testing

Testing pre-sanitized content instead of actual injection patterns
Suggestion: Use real assistant role injection pattern

tests/test_sanitizer.py:65 - testing

Testing pre-sanitized content within Unicode injection test
Suggestion: Use real injection pattern with Unicode characters

tests/test_sanitizer.py:131 - testing

Testing sanitization of already-sanitized content
Suggestion: Use real injection pattern to test the sanitization process

tests/test_sanitizer.py:150 - testing

Testing multiple pre-sanitized injections instead of real patterns
Suggestion: Use actual injection patterns to test multiple detections

tests/test_sanitizer.py:175 - testing

Testing pre-sanitized content instead of actual injection
Suggestion: Use real injection pattern

tests/test_sanitizer.py:188 - testing

Testing pre-sanitized content instead of actual injection
Suggestion: Use real injection pattern

Major (2)

src/pr_review_agent/review/sanitizer.py:59 - logic

The regex replacement text references itself in a circular manner. The description says 'Attempt to [SANITIZED:instruction_injection]' which would create confusing output.
Suggestion: Change the description to something more descriptive like 'Attempt to override previous instructions'

src/pr_review_agent/review/sanitizer.py:181 - performance

The nested loop structure could lead to performance issues with large diffs. Each line is checked against all patterns sequentially, and string replacements create new strings each time.
Suggestion: Consider optimizing by combining patterns into a single regex with named groups, or breaking early when patterns are found if multiple replacements on the same line aren't needed

Minor (2)

src/pr_review_agent/review/sanitizer.py:181 - logic

Multiple injection patterns on the same line will result in multiple replacements and multiple InjectionAttempt records. This might not be the intended behavior.
Suggestion: Consider whether to break after first match or continue processing all patterns. Document the intended behavior.

src/pr_review_agent/review/sanitizer.py:145 - logic

The function only processes lines starting with '+' but not lines starting with '-'. While this makes sense for injection detection (you want to scan added content), it might miss injection attempts in removed lines that could still be visible in the diff context.
Suggestion: Consider if removed lines should also be scanned, or document why only added lines are processed

Concerns

The test suite appears to be fundamentally flawed - most tests are checking detection of already-sanitized content rather than actual injection patterns
Without proper tests, it's impossible to verify that the injection patterns actually work as intended
Performance could be a concern with large diffs due to nested pattern matching
The circular reference in one of the pattern descriptions could cause confusion

Questions

Have you actually tested this with real injection attempts, given that most tests use pre-sanitized content?
Should the sanitizer also scan removed lines (-) or context lines for injection attempts?
What's the expected behavior when multiple injection patterns are found on the same line?
Have you considered the performance impact on very large diffs?

_{Model: claude-sonnet-4-20250514 | Cost: $0.0539 |

Tokens: 5888 in / 2416 out}

dtsong merged commit 33d7015 into main Jan 24, 2026
2 checks passed

dtsong deleted the feat/37-prompt-injection-defense branch January 24, 2026 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add prompt injection defense for PR diffs #47

feat: add prompt injection defense for PR diffs #47

Uh oh!

dtsong commented Jan 24, 2026

Uh oh!

github-actions bot commented Jan 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add prompt injection defense for PR diffs #47

feat: add prompt injection defense for PR diffs #47

Uh oh!

Conversation

dtsong commented Jan 24, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Jan 24, 2026

AI Code Review

Summary

Strengths

Issues Found

Critical (12)

Major (2)

Minor (2)

Concerns

Questions

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants