-
Notifications
You must be signed in to change notification settings - Fork 0
Add Echo Rule watermark analysis for LLM learning article #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Echo Rule watermark analysis for LLM learning article #8
Conversation
johnzfitch
commented
Nov 22, 2025
- Add manual_analysis.py script for watermark detection when spaCy model unavailable
- Include sample article text (LLM learning research) for analysis
- Generate detailed JSON analysis output with 46 clause pairs analyzed
- Result: LIKELY_HUMAN verdict with 0.209 final score (below 0.45 threshold)
- Add manual_analysis.py script for watermark detection when spaCy model unavailable - Include sample article text (LLM learning research) for analysis - Generate detailed JSON analysis output with 46 clause pairs analyzed - Result: LIKELY_HUMAN verdict with 0.209 final score (below 0.45 threshold)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a manual Echo Rule watermark analysis capability for detecting AI-generated text when the full spaCy NLP model is unavailable. The implementation analyzes phonetic, structural, and semantic "echoes" at clause boundaries to determine if text exhibits watermark patterns characteristic of LLM-generated content.
- Implements a standalone watermark detection script with fallback dependencies (cmudict, Levenshtein)
- Includes sample analysis of an LLM learning research article with 46 clause pairs analyzed
- Provides detailed JSON output showing LIKELY_HUMAN verdict (0.209 score below 0.45 threshold)
Reviewed changes
Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| scripts/manual_analysis.py | Core analysis script implementing phonetic, structural, and semantic echo detection with pattern matching algorithms |
| data/analysis_output.json | Generated analysis results with detailed scoring for 46 clause pairs and final classification verdict |
| data/analysis_input.txt | Sample input text (research article on LLM learning) used for watermark analysis demonstration |
Comments suppressed due to low confidence (1)
scripts/manual_analysis.py:39
- Import of 'np' is not used.
import numpy as np
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| input_file = Path('/home/user/specHO/data/analysis_input.txt') | ||
|
|
||
| if not input_file.exists(): | ||
| print("Error: Input file not found") | ||
| sys.exit(1) | ||
|
|
||
| text = input_file.read_text() | ||
| print(f"Analyzing text ({len(text):,} characters)...") | ||
|
|
||
| # Run analysis | ||
| report = analyze_text(text) | ||
|
|
||
| # Print report | ||
| print_report(report, verbose=True) | ||
|
|
||
| # Save JSON report | ||
| output_file = Path('/home/user/specHO/data/analysis_output.json') |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded absolute paths make the script non-portable. Consider using relative paths or command-line arguments to allow users to specify input/output file locations.
Example fix:
import argparse
def main():
parser = argparse.ArgumentParser(description='Manual Echo Rule Watermark Analysis')
parser.add_argument('--input', default='data/analysis_input.txt', help='Input text file')
parser.add_argument('--output', default='data/analysis_output.json', help='Output JSON file')
args = parser.parse_args()
input_file = Path(args.input)
output_file = Path(args.output)
# ...| try: | ||
| import numpy as np | ||
| HAS_NUMPY = True | ||
| except ImportError: | ||
| HAS_NUMPY = False |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The numpy import is unused throughout the script. The HAS_NUMPY flag is set but never checked or used in any conditional logic. Consider removing this unused import to reduce dependencies.
| try: | |
| import numpy as np | |
| HAS_NUMPY = True | |
| except ImportError: | |
| HAS_NUMPY = False |
| from pathlib import Path | ||
| from typing import List, Dict, Tuple, Optional | ||
| from dataclasses import dataclass, field | ||
| from collections import Counter |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Counter import from collections is unused in the script. Consider removing this import.
| from collections import Counter |
| for sentence in sentences: | ||
| # Rule A: Punctuation-linked clauses (semicolon, em-dash, colon) | ||
| for sep in [';', ' – ', ': ', ' — ']: | ||
| if sep in sentence: | ||
| parts = sentence.split(sep) | ||
| for i in range(len(parts) - 1): | ||
| if len(parts[i].strip()) > 10 and len(parts[i+1].strip()) > 10: | ||
| pairs.append(create_clause_pair(parts[i], parts[i+1], sep)) | ||
|
|
||
| # Rule B: Conjunction-linked clauses | ||
| conj_pattern = r',?\s*(but|and|or|yet)\s+' | ||
| matches = list(re.finditer(conj_pattern, sentence, re.IGNORECASE)) | ||
| for match in matches: | ||
| before = sentence[:match.start()] | ||
| after = sentence[match.end():] | ||
| if len(before.strip()) > 10 and len(after.strip()) > 10: | ||
| pairs.append(create_clause_pair(before, after, match.group(1))) |
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The clause splitting logic can create duplicate pairs when a sentence contains multiple separators. For example, if a sentence has both a semicolon and an "and" conjunction, it may generate overlapping clause pairs. Consider deduplicating pairs or tracking already-processed text spans to avoid analyzing the same text multiple times.
| with open(output_file, 'w') as f: | ||
| json.dump(report_dict, f, indent=2, default=str) | ||
|
|
||
| print(f"\nJSON report saved to: {output_file}") | ||
|
|
Copilot
AI
Nov 22, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing error handling for file I/O operations. If the output directory doesn't exist or if there are permission issues, the script will crash with an unhandled exception. Consider adding proper error handling:
try:
output_file.parent.mkdir(parents=True, exist_ok=True)
with open(output_file, 'w') as f:
json.dump(report_dict, f, indent=2, default=str)
print(f"\nJSON report saved to: {output_file}")
except (IOError, PermissionError) as e:
print(f"Error saving output file: {e}", file=sys.stderr)
sys.exit(1)| with open(output_file, 'w') as f: | |
| json.dump(report_dict, f, indent=2, default=str) | |
| print(f"\nJSON report saved to: {output_file}") | |
| try: | |
| output_file.parent.mkdir(parents=True, exist_ok=True) | |
| with open(output_file, 'w') as f: | |
| json.dump(report_dict, f, indent=2, default=str) | |
| print(f"\nJSON report saved to: {output_file}") | |
| except (IOError, PermissionError) as e: | |
| print(f"Error saving output file: {e}", file=sys.stderr) | |
| sys.exit(1) |