This project implements an automated data analysis system using multiple specialized agents. The system can download datasets from Hugging Face, perform comprehensive data analysis, and generate detailed reports automatically.
- Automated dataset download from Hugging Face repositories
- Comprehensive data quality assessment
- Statistical and qualitative analysis
- Automated visualization generation
- PDF report generation
- Multi-agent architecture for specialized tasks
- Python 3.x
- Git LFS (for downloading datasets)
- Required Python packages:
- autogen
- pandas
- numpy
- matplotlib
- seaborn
- fpdf
├── agents.py # Agent definitions and initialization
├── config.py # Configuration settings
├── main.py # Main application entry point
├── system_messages.py # Agent system messages/instructions
└── utils.py # Utility functions for data processing
- Data Source Manager: Handles dataset downloading and file management
- Data Quality Agent: Assesses dataset quality and completeness
- Statistical Analysis Agent: Performs numerical analysis and statistics
- Qualitative Analysis Agent: Analyzes data structure and patterns
- Visualization Agent: Creates data visualizations
- Report Generation Agent: Compiles findings into PDF reports
- Clone the repository:
git clone [repository-url]
cd [repository-name]- Install dependencies:
pip install autogen pandas numpy matplotlib seaborn fpdf- Configure the API key:
- Open
config.py - Replace
"ENTER YOUR API KEY"with your actual API key
- Open
- Basic usage:
from main import process_dataset
# Process a dataset from Hugging Face
process_dataset("https://huggingface.co/datasets/scikit-learn/iris")- Output structure:
datasets/
├── quality_assessment/
│ └── quality_assessment.txt
├── insights/
│ └── insights.txt
├── visualizations/
│ ├── correlation_heatmap.png
│ └── feature_distributions.png
└── output/
└── analysis_report.pdf
- Defines and initializes all agent types
- Configures agent behaviors and capabilities
- Registers execution functions for each agent
- Contains LLM configuration
- API key settings
- Model specifications
- Sets up the group chat between agents
- Manages the orchestration of the analysis workflow
- Provides the main entry point for processing datasets
- Defines the role and responsibilities of each agent
- Contains system prompts for agent behavior
- Establishes workflow protocols
- Implements core functionality for:
- Dataset downloading
- Quality assessment
- Statistical analysis
- Visualization generation
- Report creation
The system evaluates datasets on four key dimensions:
- Completeness (25 points)
- Consistency (25 points)
- Accuracy (25 points)
- Uniqueness (25 points)
Total quality score is calculated out of 100 points.
The system includes comprehensive error handling for:
- Dataset download failures
- File access issues
- Data processing errors
- Report generation problems
Each function includes try-catch blocks with detailed error messages.
- Fork the repository
- Create a feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request