- Overview
- Features
- How It Works
- Tech Stack
- Installation
- Usage
- Feature Engineering
- Model Performance
- API Documentation
- Contributing
- License
A machine learning-based cybersecurity system that detects and classifies malicious URLs by analyzing structural and statistical features without inspecting webpage content. This approach provides:
- β‘ Fast Detection - Real-time URL analysis
- π― High Accuracy - 95%+ detection rate
- π Privacy-Focused - No content inspection required
- π Feature-Rich - 30+ extracted URL features
# Quick start
python app.py
# Access at http://localhost:8501- π Advanced Feature Extraction - 30+ URL-based features
- π€ Multiple ML Models - Random Forest, XGBoost, SVM
- π Real-time Classification - Instant URL safety assessment
- π¨ Interactive Dashboard - Streamlit-powered web interface
- π Confidence Scoring - Probability-based predictions
- π Batch Processing - Analyze multiple URLs at once
- π± API Ready - RESTful API for integration
- π Visualization - Feature importance and decision trees
def extract_url_features(url):
features = {
'url_length': len(url),
'num_dots': url.count('.'),
'num_hyphens': url.count('-'),
'num_underscores': url.count('_'),
'num_slashes': url.count('/'),
'num_questionmarks': url.count('?'),
'num_equals': url.count('='),
'num_ats': url.count('@'),
'num_digits': sum(c.isdigit() for c in url),
'has_ip': check_ip_address(url),
'has_https': url.startswith('https'),
'domain_length': len(extract_domain(url)),
# ... 20+ more features
}
return features# Ensemble of models
models = {
'Random Forest': RandomForestClassifier(n_estimators=100),
'XGBoost': XGBClassifier(max_depth=6),
'SVM': SVC(kernel='rbf', probability=True)
}
# Predict with confidence
prediction, confidence = model.predict_proba(features)| Component | Technology |
|---|---|
| Machine Learning | |
| Data Processing | |
| Web Interface | |
| Visualization | |
| API |
# Clone repository
git clone https://github.com/ares-coding/malicious-url-detection-using-ml.git
cd malicious-url-detection-using-ml
# Install dependencies
pip install -r requirements.txt
# Run the app
streamlit run app.py# Build image
docker build -t url-detector .
# Run container
docker run -p 8501:8501 url-detectorstreamlit run app.pyVisit http://localhost:8501 and enter a URL to analyze.
from url_detector import URLDetector
# Initialize detector
detector = URLDetector(model='xgboost')
# Analyze single URL
result = detector.predict('https://suspicious-site.com')
print(f"Malicious: {result['is_malicious']}")
print(f"Confidence: {result['confidence']:.2%}")
# Batch analysis
urls = ['url1.com', 'url2.com', 'url3.com']
results = detector.predict_batch(urls)# Start API server
python api.py
# Make request
curl -X POST http://localhost:5000/predict \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'| Category | Features |
|---|---|
| Length-based | URL length, domain length, path length |
| Character-based | Dots, hyphens, slashes, special chars |
| Domain | Has IP, subdomain count, TLD type |
| Path | Directory depth, file extension |
| Query | Parameter count, suspicious patterns |
| Security | HTTPS, certificate validity |
| Entropy | Character distribution randomness |
| Blacklist | Domain age, reputation scores |
Top 10 Features:
1. url_length (0.142)
2. has_ip_address (0.128)
3. num_subdomains (0.095)
4. domain_length (0.087)
5. num_dots (0.076)
6. has_https (0.068)
7. entropy (0.062)
8. num_hyphens (0.055)
9. path_depth (0.051)
10. num_digits (0.048)
| Model | Accuracy | Precision | Recall | F1-Score | AUC-ROC |
|---|---|---|---|---|---|
| Random Forest | 94.2% | 93.8% | 94.6% | 94.2% | 0.97 |
| XGBoost | 96.5% | 96.2% | 96.8% | 96.5% | 0.98 |
| SVM (RBF) | 92.8% | 92.3% | 93.2% | 92.7% | 0.96 |
| Ensemble | 97.1% | 96.9% | 97.3% | 97.1% | 0.99 |
Predicted
Benign Malicious
Actual Benign 4,823 152
Malicious 118 4,907
Analyze a single URL.
Request:
{
"url": "https://example.com/path?param=value"
}Response:
{
"url": "https://example.com/path?param=value",
"is_malicious": false,
"confidence": 0.923,
"risk_score": "low",
"features": {
"url_length": 38,
"has_https": true,
"num_dots": 1
},
"timestamp": "2025-02-13T10:30:00Z"
}Analyze multiple URLs.
Request:
{
"urls": [
"https://google.com",
"http://suspicious-site.tk"
]
}malicious-url-detection/
βββ π data/
β βββ raw/ # Original datasets
β βββ processed/ # Cleaned data
β βββ models/ # Trained models
βββ π src/
β βββ feature_extraction.py
β βββ model_training.py
β βββ prediction.py
β βββ utils.py
βββ π notebooks/
β βββ 01_data_analysis.ipynb
β βββ 02_feature_engineering.ipynb
β βββ 03_model_evaluation.ipynb
βββ π api/
β βββ app.py # Flask API
β βββ schemas.py
βββ app.py # Streamlit app
βββ train.py # Training script
βββ requirements.txt
βββ README.md
We welcome contributions! Please see our Contributing Guidelines.
MIT License - see LICENSE for details.
Au Amores
β If this project helped you, please star it!
Made with π and β by Ares
