Skip to content

mtk339900/Text-Pattern-Analyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Text Pattern Analyzer

A comprehensive Python tool for detecting and extracting repeating patterns, keywords, and structural similarities in text data. This analyzer provides advanced text mining capabilities with support for multiple input formats and configurable analysis parameters.

Features

πŸ” Pattern Detection

  • N-gram Analysis: Configurable bigrams, trigrams, and custom n-grams
  • Phrase Pattern Recognition: Identifies repeating phrases of varying lengths
  • Sentence Structure Analysis: Detects recurring sentence patterns and structures
  • Statistical Anomaly Detection: Finds words with unusual frequency distributions using z-score analysis

πŸ“ Input Support

  • Plain Text Files: Direct analysis of .txt files
  • CSV Files: Extract and analyze text from specific columns
  • String Input: Direct string processing for programmatic use
  • Large File Handling: Efficient memory usage for processing large documents

πŸ“Š Output Formats

  • JSON: Structured data export with full analysis details
  • CSV: Tabular format for spreadsheet analysis
  • Plain Text: Human-readable reports with formatted output
  • Highlighted Text: Visual pattern highlighting within original content

βš™οΈ Configuration Options

  • Case Sensitivity: Preserve or normalize text case
  • Punctuation Handling: Include or exclude punctuation in analysis
  • Pattern Length Control: Set minimum and maximum n-gram sizes
  • Frequency Thresholds: Filter patterns by occurrence frequency
  • Anomaly Sensitivity: Adjust statistical anomaly detection thresholds

Installation

Prerequisites

  • Python 3.7 or higher
  • No external dependencies required (uses only Python standard library)

Quick Start

Basic Usage

from text_pattern_analyzer import TextPatternAnalyzer

# Initialize the analyzer
analyzer = TextPatternAnalyzer()

# Analyze text directly
text = "Your text content here..."
results = analyzer.analyze_text(text)

# Print results
print(analyzer.export_results(results, 'txt'))

Analyze Files

# Analyze a text file
results = analyzer.analyze_file('document.txt')

# Analyze CSV file (specify text column)
results = analyzer.analyze_file('data.csv', text_column='content')

# Save results to file
analyzer.save_results(results, 'analysis_report.json', 'json')

Advanced Configuration

# Create analyzer with custom settings
analyzer = TextPatternAnalyzer(
    preserve_case=True,          # Keep original case
    preserve_punctuation=False   # Remove punctuation
)

# Perform detailed analysis
results = analyzer.analyze_text(
    text,
    min_ngram=2,              # Start with bigrams
    max_ngram=6,              # Up to 6-word phrases
    min_frequency=3,          # Minimum 3 occurrences
    detect_anomalies=True,    # Find statistical outliers
    anomaly_threshold=2.5,    # Z-score threshold
    track_locations=True      # Track pattern positions
)

API Reference

TextPatternAnalyzer

Main analyzer class that orchestrates the analysis process.

Constructor

TextPatternAnalyzer(preserve_case=False, preserve_punctuation=False)

Key Methods

analyze_text(text, **kwargs)

Performs comprehensive text analysis on input string.

Parameters:

  • text (str): Input text to analyze
  • min_ngram (int): Minimum n-gram length (default: 2)
  • max_ngram (int): Maximum n-gram length (default: 5)
  • min_frequency (int): Minimum pattern frequency (default: 2)
  • detect_anomalies (bool): Enable anomaly detection (default: True)
  • anomaly_threshold (float): Z-score threshold for anomalies (default: 2.0)
  • track_locations (bool): Track pattern locations (default: True)

Returns: Dictionary containing analysis results

analyze_file(file_path, **kwargs)

Analyzes text from file (supports .txt and .csv).

Parameters:

  • file_path (str): Path to input file
  • text_column (str): CSV column name for text data (CSV files only)
  • Additional parameters same as analyze_text()
highlight_text(text, patterns, highlight_format="**{}**")

Highlights specified patterns within text.

Parameters:

  • text (str): Original text
  • patterns (List[str]): List of patterns to highlight
  • highlight_format (str): Format string for highlighting
export_results(results, output_format='json')

Exports results in specified format.

Parameters:

  • results (dict): Analysis results
  • output_format (str): 'json', 'csv', or 'txt'
save_results(results, output_file, output_format='json')

Saves results to file.

Output Structure

Analysis Results Dictionary

{
    "text_stats": {
        "total_characters": 1500,
        "total_words": 250,
        "total_sentences": 15
    },
    "phrase_patterns": {
        "machine learning": {
            "frequency": 5,
            "length": 2,
            "words": ["machine", "learning"],
            "locations": [
                {
                    "line": 3,
                    "char_position": 45,
                    "context": "Machine learning algorithms are powerful...",
                    "pattern_start": 0,
                    "pattern_end": 16
                }
            ]
        }
    },
    "sentence_structures": {
        "WORD-WORD-WORD-WORD": [
            {
                "sentence": "Data science is important",
                "position": 2,
                "word_count": 4
            }
        ]
    },
    "anomalies": {
        "algorithm": {
            "frequency": 15,
            "z_score": 3.2,
            "type": "high_frequency"
        }
    }
}

Examples

Example 1: Basic Document Analysis

from text_pattern_analyzer import TextPatternAnalyzer

# Read document
with open('research_paper.txt', 'r') as f:
    document = f.read()

# Analyze
analyzer = TextPatternAnalyzer()
results = analyzer.analyze_text(
    document,
    min_ngram=3,
    max_ngram=5,
    min_frequency=2
)

# Export to different formats
json_report = analyzer.export_results(results, 'json')
text_report = analyzer.export_results(results, 'txt')
csv_report = analyzer.export_results(results, 'csv')

print(text_report)

Example 2: CSV Data Analysis

# Analyze customer feedback from CSV
results = analyzer.analyze_file(
    'customer_feedback.csv',
    text_column='feedback_text',
    min_frequency=5,
    detect_anomalies=True
)

# Find most common patterns
top_patterns = sorted(
    results['phrase_patterns'].items(),
    key=lambda x: x[1]['frequency'],
    reverse=True
)[:10]

for pattern, data in top_patterns:
    print(f"'{pattern}': {data['frequency']} occurrences")

Example 3: Highlighting Patterns

# Find patterns and highlight them
results = analyzer.analyze_text(document)
common_patterns = [
    pattern for pattern, data in results['phrase_patterns'].items()
    if data['frequency'] >= 3
]

# Create highlighted version
highlighted = analyzer.highlight_text(
    document, 
    common_patterns,
    highlight_format="[{}]"  # Use brackets for highlighting
)

print(highlighted)

Example 4: Statistical Analysis

# Focus on statistical anomalies
results = analyzer.analyze_text(
    text,
    detect_anomalies=True,
    anomaly_threshold=1.5  # More sensitive detection
)

# Print anomalous words
for word, data in results['anomalies'].items():
    print(f"{word}: {data['frequency']} occurrences "
          f"(z-score: {data['z_score']:.2f})")

Use Cases

πŸ“š Academic Research

  • Identify recurring themes in literature
  • Analyze citation patterns and methodology descriptions
  • Detect plagiarism indicators through pattern matching

πŸ’Ό Business Intelligence

  • Extract common phrases from customer feedback
  • Identify trending topics in support tickets
  • Analyze marketing copy for consistent messaging

πŸ“° Content Analysis

  • Find recurring themes in news articles
  • Analyze speech patterns in transcripts
  • Detect template usage in generated content

πŸ” Data Quality

  • Identify duplicate or semi-duplicate content
  • Find systematic errors in text data
  • Detect anomalous entries in large datasets

Performance Considerations

Memory Efficiency

  • Uses generators for large file processing
  • Streams data where possible to minimize memory footprint
  • Configurable analysis depth to balance speed vs. comprehensiveness

Processing Speed

  • Optimized for large documents (tested on files up to 100MB)
  • Efficient n-gram generation using sliding windows
  • Parallel processing ready (can be extended with multiprocessing)

Scalability Tips

  • For very large files, consider processing in chunks
  • Use higher min_frequency thresholds to reduce noise
  • Disable location tracking for faster processing when not needed

Contributing

Development Setup

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure code follows PEP 8 standards
  5. Submit a pull request

Extending the Analyzer

The modular design makes it easy to add new features:

  • Add new pattern detection algorithms to PatternDetector
  • Implement additional input formats in InputHandler
  • Create custom output formatters in OutputFormatter

License

This project is open source. Feel free to use, modify, and distribute according to your needs.

Support

For questions, bug reports, or feature requests, please create an issue in the project repository or contact the development team.


Version: 1.0.0
Python Compatibility: 3.7+
Dependencies: None (pure Python standard library)

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages