A comprehensive Python tool for detecting and extracting repeating patterns, keywords, and structural similarities in text data. This analyzer provides advanced text mining capabilities with support for multiple input formats and configurable analysis parameters.
- N-gram Analysis: Configurable bigrams, trigrams, and custom n-grams
- Phrase Pattern Recognition: Identifies repeating phrases of varying lengths
- Sentence Structure Analysis: Detects recurring sentence patterns and structures
- Statistical Anomaly Detection: Finds words with unusual frequency distributions using z-score analysis
- Plain Text Files: Direct analysis of .txt files
- CSV Files: Extract and analyze text from specific columns
- String Input: Direct string processing for programmatic use
- Large File Handling: Efficient memory usage for processing large documents
- JSON: Structured data export with full analysis details
- CSV: Tabular format for spreadsheet analysis
- Plain Text: Human-readable reports with formatted output
- Highlighted Text: Visual pattern highlighting within original content
- Case Sensitivity: Preserve or normalize text case
- Punctuation Handling: Include or exclude punctuation in analysis
- Pattern Length Control: Set minimum and maximum n-gram sizes
- Frequency Thresholds: Filter patterns by occurrence frequency
- Anomaly Sensitivity: Adjust statistical anomaly detection thresholds
- Python 3.7 or higher
- No external dependencies required (uses only Python standard library)
from text_pattern_analyzer import TextPatternAnalyzer
# Initialize the analyzer
analyzer = TextPatternAnalyzer()
# Analyze text directly
text = "Your text content here..."
results = analyzer.analyze_text(text)
# Print results
print(analyzer.export_results(results, 'txt'))# Analyze a text file
results = analyzer.analyze_file('document.txt')
# Analyze CSV file (specify text column)
results = analyzer.analyze_file('data.csv', text_column='content')
# Save results to file
analyzer.save_results(results, 'analysis_report.json', 'json')# Create analyzer with custom settings
analyzer = TextPatternAnalyzer(
preserve_case=True, # Keep original case
preserve_punctuation=False # Remove punctuation
)
# Perform detailed analysis
results = analyzer.analyze_text(
text,
min_ngram=2, # Start with bigrams
max_ngram=6, # Up to 6-word phrases
min_frequency=3, # Minimum 3 occurrences
detect_anomalies=True, # Find statistical outliers
anomaly_threshold=2.5, # Z-score threshold
track_locations=True # Track pattern positions
)Main analyzer class that orchestrates the analysis process.
TextPatternAnalyzer(preserve_case=False, preserve_punctuation=False)Performs comprehensive text analysis on input string.
Parameters:
text(str): Input text to analyzemin_ngram(int): Minimum n-gram length (default: 2)max_ngram(int): Maximum n-gram length (default: 5)min_frequency(int): Minimum pattern frequency (default: 2)detect_anomalies(bool): Enable anomaly detection (default: True)anomaly_threshold(float): Z-score threshold for anomalies (default: 2.0)track_locations(bool): Track pattern locations (default: True)
Returns: Dictionary containing analysis results
Analyzes text from file (supports .txt and .csv).
Parameters:
file_path(str): Path to input filetext_column(str): CSV column name for text data (CSV files only)- Additional parameters same as
analyze_text()
Highlights specified patterns within text.
Parameters:
text(str): Original textpatterns(List[str]): List of patterns to highlighthighlight_format(str): Format string for highlighting
Exports results in specified format.
Parameters:
results(dict): Analysis resultsoutput_format(str): 'json', 'csv', or 'txt'
Saves results to file.
{
"text_stats": {
"total_characters": 1500,
"total_words": 250,
"total_sentences": 15
},
"phrase_patterns": {
"machine learning": {
"frequency": 5,
"length": 2,
"words": ["machine", "learning"],
"locations": [
{
"line": 3,
"char_position": 45,
"context": "Machine learning algorithms are powerful...",
"pattern_start": 0,
"pattern_end": 16
}
]
}
},
"sentence_structures": {
"WORD-WORD-WORD-WORD": [
{
"sentence": "Data science is important",
"position": 2,
"word_count": 4
}
]
},
"anomalies": {
"algorithm": {
"frequency": 15,
"z_score": 3.2,
"type": "high_frequency"
}
}
}from text_pattern_analyzer import TextPatternAnalyzer
# Read document
with open('research_paper.txt', 'r') as f:
document = f.read()
# Analyze
analyzer = TextPatternAnalyzer()
results = analyzer.analyze_text(
document,
min_ngram=3,
max_ngram=5,
min_frequency=2
)
# Export to different formats
json_report = analyzer.export_results(results, 'json')
text_report = analyzer.export_results(results, 'txt')
csv_report = analyzer.export_results(results, 'csv')
print(text_report)# Analyze customer feedback from CSV
results = analyzer.analyze_file(
'customer_feedback.csv',
text_column='feedback_text',
min_frequency=5,
detect_anomalies=True
)
# Find most common patterns
top_patterns = sorted(
results['phrase_patterns'].items(),
key=lambda x: x[1]['frequency'],
reverse=True
)[:10]
for pattern, data in top_patterns:
print(f"'{pattern}': {data['frequency']} occurrences")# Find patterns and highlight them
results = analyzer.analyze_text(document)
common_patterns = [
pattern for pattern, data in results['phrase_patterns'].items()
if data['frequency'] >= 3
]
# Create highlighted version
highlighted = analyzer.highlight_text(
document,
common_patterns,
highlight_format="[{}]" # Use brackets for highlighting
)
print(highlighted)# Focus on statistical anomalies
results = analyzer.analyze_text(
text,
detect_anomalies=True,
anomaly_threshold=1.5 # More sensitive detection
)
# Print anomalous words
for word, data in results['anomalies'].items():
print(f"{word}: {data['frequency']} occurrences "
f"(z-score: {data['z_score']:.2f})")- Identify recurring themes in literature
- Analyze citation patterns and methodology descriptions
- Detect plagiarism indicators through pattern matching
- Extract common phrases from customer feedback
- Identify trending topics in support tickets
- Analyze marketing copy for consistent messaging
- Find recurring themes in news articles
- Analyze speech patterns in transcripts
- Detect template usage in generated content
- Identify duplicate or semi-duplicate content
- Find systematic errors in text data
- Detect anomalous entries in large datasets
- Uses generators for large file processing
- Streams data where possible to minimize memory footprint
- Configurable analysis depth to balance speed vs. comprehensiveness
- Optimized for large documents (tested on files up to 100MB)
- Efficient n-gram generation using sliding windows
- Parallel processing ready (can be extended with multiprocessing)
- For very large files, consider processing in chunks
- Use higher
min_frequencythresholds to reduce noise - Disable location tracking for faster processing when not needed
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure code follows PEP 8 standards
- Submit a pull request
The modular design makes it easy to add new features:
- Add new pattern detection algorithms to
PatternDetector - Implement additional input formats in
InputHandler - Create custom output formatters in
OutputFormatter
This project is open source. Feel free to use, modify, and distribute according to your needs.
For questions, bug reports, or feature requests, please create an issue in the project repository or contact the development team.
Version: 1.0.0
Python Compatibility: 3.7+
Dependencies: None (pure Python standard library)