A comprehensive machine learning analysis dashboard for the 20 Newsgroups dataset, featuring advanced hyperparameter tuning, model comparison, and interactive visualizations built with Next.js and Material-UI.
๐ Access the Dashboard: https://machine-learning-project-theta.vercel.app
This project demonstrates a complete machine learning pipeline for text classification using the classic 20 Newsgroups dataset. It includes:
- 10+ Machine Learning Algorithms with comprehensive evaluation
- Advanced Hyperparameter Tuning using Grid Search CV
- Interactive Visualizations with real-time model comparison
- Responsive Dashboard optimized for all devices
- Static Data Architecture for instant loading
- Multi-algorithm comparison (Logistic Regression, Random Forest, SVM, XGBoost, LightGBM, etc.)
- Hyperparameter optimization with cross-validation
- Performance metrics (Accuracy, Precision, Recall, F1-Score, AUC)
- Confusion matrices and ROC curves for detailed analysis
- Real-time model comparison charts
- Interactive confusion matrices with smart label handling
- Multi-class ROC curves with individual class performance
- Performance metrics tables with sorting and filtering
- Dataset statistics with comprehensive breakdowns
- Instant loading with pre-computed static data
- Responsive design for mobile and desktop
- No computational delays during user interaction
- Consistent performance across all sessions
The 20 Newsgroups dataset contains 18,846 documents across 20 categories:
| Category | Documents | Category | Documents |
|---|---|---|---|
| alt.atheism | 942 | comp.graphics | 942 |
| comp.os.ms-windows.misc | 942 | comp.sys.ibm.pc.hardware | 942 |
| comp.sys.mac.hardware | 942 | comp.windows.x | 942 |
| misc.forsale | 942 | rec.autos | 942 |
| rec.motorcycles | 942 | rec.sport.baseball | 942 |
| rec.sport.hockey | 942 | sci.crypt | 942 |
| sci.electronics | 942 | sci.med | 942 |
| sci.space | 942 | soc.religion.christian | 942 |
| talk.politics.guns | 942 | talk.politics.mideast | 942 |
| talk.politics.misc | 942 | talk.religion.misc | 942 |
- Average text length: 221.3 words
- Vocabulary size: 4,876 unique words
- Feature engineering: TF-IDF with unigrams and bigrams
- Data split: 64% train, 16% validation, 20% test
| Rank | Model | Accuracy | Precision | Recall | F1-Score | Training Time |
|---|---|---|---|---|---|---|
| ๐ฅ | LightGBM | 92.1% | 92.3% | 92.1% | 92.2% | 2.3s |
| ๐ฅ | XGBoost | 91.5% | 91.7% | 91.5% | 91.6% | 3.1s |
| ๐ฅ | Gradient Boosting | 90.3% | 90.5% | 90.3% | 90.4% | 4.2s |
| 4 | Random Forest | 89.2% | 89.4% | 89.2% | 89.3% | 1.8s |
| 5 | Logistic Regression | 84.7% | 84.9% | 84.7% | 84.8% | 0.9s |
graph TD
A[20 Newsgroups Dataset] --> B[Text Preprocessing]
B --> C[TF-IDF Vectorization]
C --> D[Model Training]
D --> E[Hyperparameter Tuning]
E --> F[Performance Evaluation]
F --> G[Visualization Dashboard]
- Next.js 14 - React framework with App Router
- React 18 - UI library with hooks
- TypeScript - Type-safe development
- Material-UI 5 - Component library and theming
- Chart.js - Interactive charts and visualizations
- Python 3.8+ - Machine learning pipeline
- Scikit-learn - ML algorithms and preprocessing
- XGBoost/LightGBM - Advanced gradient boosting
- Pandas/NumPy - Data manipulation and analysis
- ESLint - Code linting
- Prettier - Code formatting
- Git - Version control
machine-learning-project/
โโโ ๐ app/ # Next.js app directory
โ โโโ ๐ page.tsx # Main dashboard page
โ โโโ ๐ layout.tsx # App layout and metadata
โ โโโ ๐ globals.css # Global styles
โโโ ๐ components/ # React components
โ โโโ ๐ DatasetInfo.tsx # Dataset statistics card
โ โโโ ๐ HyperparameterTuning.tsx # Hyperparameter results
โ โโโ ๐ ModelComparisonChart.tsx # Model comparison visualization
โ โโโ ๐ MetricsTable.tsx # Performance metrics table
โ โโโ ๐ ConfusionMatrix.tsx # Confusion matrix visualization
โ โโโ ๐ ROCCurve.tsx # ROC curve visualization
โ โโโ ๐ ThemeRegistry.tsx # Material-UI theme provider
โโโ ๐ data/ # Static data files
โ โโโ ๐ ml_results.json # Pre-computed ML results
โโโ ๐ scripts/ # Data generation scripts
โ โโโ ๐ ml_processor.py # Full ML pipeline
โ โโโ ๐ generate_sample_data.py # Sample data generator
โ โโโ ๐ precompute_results.py # Results precomputation
โโโ ๐ pages/api/ # API endpoints
โ โโโ ๐ ml-results.ts # Serves static ML data
โโโ ๐ public/ # Static assets
โโโ ๐ package.json # Node.js dependencies
โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ README.md # Project documentation
- Node.js 18.0.0 or higher
- Python 3.8 or higher
- npm or yarn package manager
-
Clone the repository
git clone https://github.com/davidagustin/machine-learning-project.git cd machine-learning-project -
Install Node.js dependencies
npm install
-
Install Python dependencies
pip install -r requirements.txt
-
Generate sample data (optional)
cd scripts python generate_sample_data.py cd ..
-
Start the development server
npm run dev
-
Open your browser Navigate to http://localhost:3000
The application comes with pre-computed results for instant loading. To regenerate the data:
# Option 1: Using the API endpoint
curl -X POST http://localhost:3000/api/clear-cache
# Option 2: Running the script directly
cd scripts
python ml_processor.py
cd ..To use your own dataset, modify the scripts/ml_processor.py file and update the data loading function.
- Interactive charts showing performance metrics
- Sortable tables with detailed statistics
- Real-time filtering by algorithm type
- Export functionality for results
- Grid Search CV optimization for all models
- Best parameters display with confidence intervals
- Parameter importance analysis
- Cross-validation results visualization
- Confusion matrices with smart label truncation
- Multi-class ROC curves with AUC scores
- Feature importance rankings
- Training time comparisons
- Mobile-optimized layouts
- Touch-friendly interactions
- Adaptive charts for different screen sizes
- Progressive enhancement for older browsers
Create a .env.local file in the root directory:
# Optional: Custom API endpoints
NEXT_PUBLIC_API_URL=http://localhost:3000/api
# Optional: Analytics (if using)
NEXT_PUBLIC_GA_ID=your-google-analytics-id- Theme: Modify
components/ThemeRegistry.tsxfor custom colors - Charts: Update chart configurations in individual components
- Data: Modify
scripts/ml_processor.pyfor different datasets
- First Contentful Paint: < 1.5s
- Largest Contentful Paint: < 2.5s
- Time to Interactive: < 3s
- Bundle Size: < 500KB (gzipped)
- Chart Rendering: < 100ms
- Data Filtering: < 50ms
- Model Switching: < 200ms
- Memory Usage: < 50MB
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch
git checkout -b feature/amazing-feature
- Make your changes
- Add tests (if applicable)
- Commit your changes
git commit -m 'Add amazing feature' - Push to the branch
git push origin feature/amazing-feature
- Open a Pull Request
- Follow TypeScript best practices
- Use Material-UI components consistently
- Maintain responsive design principles
- Write clear commit messages
- Add documentation for new features
Q: Charts not rendering properly A: Ensure Chart.js is properly imported and the data format matches the expected structure.
Q: Python scripts failing
A: Check that all required packages are installed: pip install -r requirements.txt
Q: Build errors
A: Clear the Next.js cache: rm -rf .next && npm run build
Q: Performance issues A: The application uses static data for optimal performance. Regenerate data if needed.
Enable debug logging by setting the environment variable:
DEBUG=* npm run devReturns the pre-computed machine learning results.
Response:
{
"dataset_info": { ... },
"model_results": { ... },
"hyperparameter_tuning": { ... },
"data_split_info": { ... }
}Regenerates the sample data (development only).
- Real-time model training interface
- Custom dataset upload functionality
- Advanced visualizations (SHAP plots, feature importance)
- Export capabilities (PDF reports, CSV data)
- User authentication and saved analyses
- API endpoints for external integrations
- Performance optimization for large datasets
- Additional algorithms (CatBoost, Neural Networks)
- Interactive model comparison tools
- Automated hyperparameter tuning with Optuna
- Model deployment capabilities
This project is licensed under the MIT License - see the LICENSE file for details.
- 20 Newsgroups Dataset - Classic text classification benchmark
- Scikit-learn - Comprehensive machine learning library
- Next.js - React framework for production
- Material-UI - Beautiful React components
- Chart.js - Flexible charting library
- Issues: GitHub Issues
- Discussions: GitHub Discussions
Made with โค๏ธ by David Agustin