Skip to content

Conversation

@johnzfitch
Copy link
Owner

  • Add manual_analysis.py script for watermark detection when spaCy model unavailable
  • Include sample article text (LLM learning research) for analysis
  • Generate detailed JSON analysis output with 46 clause pairs analyzed
  • Result: LIKELY_HUMAN verdict with 0.209 final score (below 0.45 threshold)

- Add manual_analysis.py script for watermark detection when spaCy model unavailable
- Include sample article text (LLM learning research) for analysis
- Generate detailed JSON analysis output with 46 clause pairs analyzed
- Result: LIKELY_HUMAN verdict with 0.209 final score (below 0.45 threshold)
Copilot AI review requested due to automatic review settings November 22, 2025 12:34
@johnzfitch johnzfitch merged commit 728dbdc into main Nov 22, 2025
3 checks passed
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a manual Echo Rule watermark analysis capability for detecting AI-generated text when the full spaCy NLP model is unavailable. The implementation analyzes phonetic, structural, and semantic "echoes" at clause boundaries to determine if text exhibits watermark patterns characteristic of LLM-generated content.

  • Implements a standalone watermark detection script with fallback dependencies (cmudict, Levenshtein)
  • Includes sample analysis of an LLM learning research article with 46 clause pairs analyzed
  • Provides detailed JSON output showing LIKELY_HUMAN verdict (0.209 score below 0.45 threshold)

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 5 comments.

File Description
scripts/manual_analysis.py Core analysis script implementing phonetic, structural, and semantic echo detection with pattern matching algorithms
data/analysis_output.json Generated analysis results with detailed scoring for 46 clause pairs and final classification verdict
data/analysis_input.txt Sample input text (research article on LLM learning) used for watermark analysis demonstration
Comments suppressed due to low confidence (1)

scripts/manual_analysis.py:39

  • Import of 'np' is not used.
    import numpy as np

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +604 to +620
input_file = Path('/home/user/specHO/data/analysis_input.txt')

if not input_file.exists():
print("Error: Input file not found")
sys.exit(1)

text = input_file.read_text()
print(f"Analyzing text ({len(text):,} characters)...")

# Run analysis
report = analyze_text(text)

# Print report
print_report(report, verbose=True)

# Save JSON report
output_file = Path('/home/user/specHO/data/analysis_output.json')
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hardcoded absolute paths make the script non-portable. Consider using relative paths or command-line arguments to allow users to specify input/output file locations.

Example fix:

import argparse

def main():
    parser = argparse.ArgumentParser(description='Manual Echo Rule Watermark Analysis')
    parser.add_argument('--input', default='data/analysis_input.txt', help='Input text file')
    parser.add_argument('--output', default='data/analysis_output.json', help='Output JSON file')
    args = parser.parse_args()
    
    input_file = Path(args.input)
    output_file = Path(args.output)
    # ...

Copilot uses AI. Check for mistakes.
Comment on lines +38 to +42
try:
import numpy as np
HAS_NUMPY = True
except ImportError:
HAS_NUMPY = False
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The numpy import is unused throughout the script. The HAS_NUMPY flag is set but never checked or used in any conditional logic. Consider removing this unused import to reduce dependencies.

Suggested change
try:
import numpy as np
HAS_NUMPY = True
except ImportError:
HAS_NUMPY = False

Copilot uses AI. Check for mistakes.
from pathlib import Path
from typing import List, Dict, Tuple, Optional
from dataclasses import dataclass, field
from collections import Counter
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Counter import from collections is unused in the script. Consider removing this import.

Suggested change
from collections import Counter

Copilot uses AI. Check for mistakes.
Comment on lines +147 to +163
for sentence in sentences:
# Rule A: Punctuation-linked clauses (semicolon, em-dash, colon)
for sep in [';', ' – ', ': ', ' — ']:
if sep in sentence:
parts = sentence.split(sep)
for i in range(len(parts) - 1):
if len(parts[i].strip()) > 10 and len(parts[i+1].strip()) > 10:
pairs.append(create_clause_pair(parts[i], parts[i+1], sep))

# Rule B: Conjunction-linked clauses
conj_pattern = r',?\s*(but|and|or|yet)\s+'
matches = list(re.finditer(conj_pattern, sentence, re.IGNORECASE))
for match in matches:
before = sentence[:match.start()]
after = sentence[match.end():]
if len(before.strip()) > 10 and len(after.strip()) > 10:
pairs.append(create_clause_pair(before, after, match.group(1)))
Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clause splitting logic can create duplicate pairs when a sentence contains multiple separators. For example, if a sentence has both a semicolon and an "and" conjunction, it may generate overlapping clause pairs. Consider deduplicating pairs or tracking already-processed text spans to avoid analyzing the same text multiple times.

Copilot uses AI. Check for mistakes.
Comment on lines +645 to +649
with open(output_file, 'w') as f:
json.dump(report_dict, f, indent=2, default=str)

print(f"\nJSON report saved to: {output_file}")

Copy link

Copilot AI Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing error handling for file I/O operations. If the output directory doesn't exist or if there are permission issues, the script will crash with an unhandled exception. Consider adding proper error handling:

try:
    output_file.parent.mkdir(parents=True, exist_ok=True)
    with open(output_file, 'w') as f:
        json.dump(report_dict, f, indent=2, default=str)
    print(f"\nJSON report saved to: {output_file}")
except (IOError, PermissionError) as e:
    print(f"Error saving output file: {e}", file=sys.stderr)
    sys.exit(1)
Suggested change
with open(output_file, 'w') as f:
json.dump(report_dict, f, indent=2, default=str)
print(f"\nJSON report saved to: {output_file}")
try:
output_file.parent.mkdir(parents=True, exist_ok=True)
with open(output_file, 'w') as f:
json.dump(report_dict, f, indent=2, default=str)
print(f"\nJSON report saved to: {output_file}")
except (IOError, PermissionError) as e:
print(f"Error saving output file: {e}", file=sys.stderr)
sys.exit(1)

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants