Text Pattern Analyzer

A comprehensive Python tool for detecting and extracting repeating patterns, keywords, and structural similarities in text data. This analyzer provides advanced text mining capabilities with support for multiple input formats and configurable analysis parameters.

Features

🔍 Pattern Detection

N-gram Analysis: Configurable bigrams, trigrams, and custom n-grams
Phrase Pattern Recognition: Identifies repeating phrases of varying lengths
Sentence Structure Analysis: Detects recurring sentence patterns and structures
Statistical Anomaly Detection: Finds words with unusual frequency distributions using z-score analysis

📁 Input Support

Plain Text Files: Direct analysis of .txt files
CSV Files: Extract and analyze text from specific columns
String Input: Direct string processing for programmatic use
Large File Handling: Efficient memory usage for processing large documents

📊 Output Formats

JSON: Structured data export with full analysis details
CSV: Tabular format for spreadsheet analysis
Plain Text: Human-readable reports with formatted output
Highlighted Text: Visual pattern highlighting within original content

⚙️ Configuration Options

Case Sensitivity: Preserve or normalize text case
Punctuation Handling: Include or exclude punctuation in analysis
Pattern Length Control: Set minimum and maximum n-gram sizes
Frequency Thresholds: Filter patterns by occurrence frequency
Anomaly Sensitivity: Adjust statistical anomaly detection thresholds

Installation

Prerequisites

Python 3.7 or higher
No external dependencies required (uses only Python standard library)

Quick Start

Basic Usage

from text_pattern_analyzer import TextPatternAnalyzer

# Initialize the analyzer
analyzer = TextPatternAnalyzer()

# Analyze text directly
text = "Your text content here..."
results = analyzer.analyze_text(text)

# Print results
print(analyzer.export_results(results, 'txt'))

Analyze Files

# Analyze a text file
results = analyzer.analyze_file('document.txt')

# Analyze CSV file (specify text column)
results = analyzer.analyze_file('data.csv', text_column='content')

# Save results to file
analyzer.save_results(results, 'analysis_report.json', 'json')

Advanced Configuration

# Create analyzer with custom settings
analyzer = TextPatternAnalyzer(
    preserve_case=True,          # Keep original case
    preserve_punctuation=False   # Remove punctuation
)

# Perform detailed analysis
results = analyzer.analyze_text(
    text,
    min_ngram=2,              # Start with bigrams
    max_ngram=6,              # Up to 6-word phrases
    min_frequency=3,          # Minimum 3 occurrences
    detect_anomalies=True,    # Find statistical outliers
    anomaly_threshold=2.5,    # Z-score threshold
    track_locations=True      # Track pattern positions
)

API Reference

TextPatternAnalyzer

Main analyzer class that orchestrates the analysis process.

Constructor

TextPatternAnalyzer(preserve_case=False, preserve_punctuation=False)

Key Methods

`analyze_text(text, **kwargs)`

Performs comprehensive text analysis on input string.

Parameters:

text (str): Input text to analyze
min_ngram (int): Minimum n-gram length (default: 2)
max_ngram (int): Maximum n-gram length (default: 5)
min_frequency (int): Minimum pattern frequency (default: 2)
detect_anomalies (bool): Enable anomaly detection (default: True)
anomaly_threshold (float): Z-score threshold for anomalies (default: 2.0)
track_locations (bool): Track pattern locations (default: True)

Returns: Dictionary containing analysis results

`analyze_file(file_path, **kwargs)`

Analyzes text from file (supports .txt and .csv).

Parameters:

file_path (str): Path to input file
text_column (str): CSV column name for text data (CSV files only)
Additional parameters same as analyze_text()

`highlight_text(text, patterns, highlight_format="{}")`

Highlights specified patterns within text.

Parameters:

text (str): Original text
patterns (List[str]): List of patterns to highlight
highlight_format (str): Format string for highlighting

`export_results(results, output_format='json')`

Exports results in specified format.

Parameters:

results (dict): Analysis results
output_format (str): 'json', 'csv', or 'txt'

`save_results(results, output_file, output_format='json')`

Saves results to file.

Output Structure

Analysis Results Dictionary

{
    "text_stats": {
        "total_characters": 1500,
        "total_words": 250,
        "total_sentences": 15
    },
    "phrase_patterns": {
        "machine learning": {
            "frequency": 5,
            "length": 2,
            "words": ["machine", "learning"],
            "locations": [
                {
                    "line": 3,
                    "char_position": 45,
                    "context": "Machine learning algorithms are powerful...",
                    "pattern_start": 0,
                    "pattern_end": 16
                }
            ]
        }
    },
    "sentence_structures": {
        "WORD-WORD-WORD-WORD": [
            {
                "sentence": "Data science is important",
                "position": 2,
                "word_count": 4
            }
        ]
    },
    "anomalies": {
        "algorithm": {
            "frequency": 15,
            "z_score": 3.2,
            "type": "high_frequency"
        }
    }
}

Examples

Example 1: Basic Document Analysis

from text_pattern_analyzer import TextPatternAnalyzer

# Read document
with open('research_paper.txt', 'r') as f:
    document = f.read()

# Analyze
analyzer = TextPatternAnalyzer()
results = analyzer.analyze_text(
    document,
    min_ngram=3,
    max_ngram=5,
    min_frequency=2
)

# Export to different formats
json_report = analyzer.export_results(results, 'json')
text_report = analyzer.export_results(results, 'txt')
csv_report = analyzer.export_results(results, 'csv')

print(text_report)

Example 2: CSV Data Analysis

# Analyze customer feedback from CSV
results = analyzer.analyze_file(
    'customer_feedback.csv',
    text_column='feedback_text',
    min_frequency=5,
    detect_anomalies=True
)

# Find most common patterns
top_patterns = sorted(
    results['phrase_patterns'].items(),
    key=lambda x: x[1]['frequency'],
    reverse=True
)[:10]

for pattern, data in top_patterns:
    print(f"'{pattern}': {data['frequency']} occurrences")

Example 3: Highlighting Patterns

# Find patterns and highlight them
results = analyzer.analyze_text(document)
common_patterns = [
    pattern for pattern, data in results['phrase_patterns'].items()
    if data['frequency'] >= 3
]

# Create highlighted version
highlighted = analyzer.highlight_text(
    document, 
    common_patterns,
    highlight_format="[{}]"  # Use brackets for highlighting
)

print(highlighted)

Example 4: Statistical Analysis

# Focus on statistical anomalies
results = analyzer.analyze_text(
    text,
    detect_anomalies=True,
    anomaly_threshold=1.5  # More sensitive detection
)

# Print anomalous words
for word, data in results['anomalies'].items():
    print(f"{word}: {data['frequency']} occurrences "
          f"(z-score: {data['z_score']:.2f})")

Use Cases

📚 Academic Research

Identify recurring themes in literature
Analyze citation patterns and methodology descriptions
Detect plagiarism indicators through pattern matching

💼 Business Intelligence

Extract common phrases from customer feedback
Identify trending topics in support tickets
Analyze marketing copy for consistent messaging

📰 Content Analysis

Find recurring themes in news articles
Analyze speech patterns in transcripts
Detect template usage in generated content

🔍 Data Quality

Identify duplicate or semi-duplicate content
Find systematic errors in text data
Detect anomalous entries in large datasets

Performance Considerations

Memory Efficiency

Uses generators for large file processing
Streams data where possible to minimize memory footprint
Configurable analysis depth to balance speed vs. comprehensiveness

Processing Speed

Optimized for large documents (tested on files up to 100MB)
Efficient n-gram generation using sliding windows
Parallel processing ready (can be extended with multiprocessing)

Scalability Tips

For very large files, consider processing in chunks
Use higher min_frequency thresholds to reduce noise
Disable location tracking for faster processing when not needed

Contributing

Development Setup

Fork the repository
Create a feature branch
Add tests for new functionality
Ensure code follows PEP 8 standards
Submit a pull request

Extending the Analyzer

The modular design makes it easy to add new features:

Add new pattern detection algorithms to PatternDetector
Implement additional input formats in InputHandler
Create custom output formatters in OutputFormatter

License

This project is open source. Feel free to use, modify, and distribute according to your needs.

Support

For questions, bug reports, or feature requests, please create an issue in the project repository or contact the development team.

Version: 1.0.0
Python Compatibility: 3.7+
Dependencies: None (pure Python standard library)

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
text_pattern_analyzer.py		text_pattern_analyzer.py

Folders and files

Latest commit

History

Repository files navigation

Text Pattern Analyzer

Features

🔍 Pattern Detection

📁 Input Support

📊 Output Formats

⚙️ Configuration Options

Installation

Prerequisites

Quick Start

Basic Usage

Analyze Files

Advanced Configuration

API Reference

TextPatternAnalyzer

Constructor

Key Methods

analyze_text(text, **kwargs)

analyze_file(file_path, **kwargs)

highlight_text(text, patterns, highlight_format="**{}**")

export_results(results, output_format='json')

save_results(results, output_file, output_format='json')

Output Structure

Analysis Results Dictionary

Examples

Example 1: Basic Document Analysis

Example 2: CSV Data Analysis

Example 3: Highlighting Patterns

Example 4: Statistical Analysis

Use Cases

📚 Academic Research

💼 Business Intelligence

📰 Content Analysis

🔍 Data Quality

Performance Considerations

Memory Efficiency

Processing Speed

Scalability Tips

Contributing

Development Setup

Extending the Analyzer

License

Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`analyze_text(text, **kwargs)`

`analyze_file(file_path, **kwargs)`

`highlight_text(text, patterns, highlight_format="{}")`

`export_results(results, output_format='json')`

`save_results(results, output_file, output_format='json')`

Packages