🚨 DDoS Attack Detection using Machine Learning

Machine Learning system for detecting DDoS attacks achieving 99.95% accuracy on the CIC-DDoS 2019 dataset with 598,440+ network flows.

📋 Table of Contents

Project Overview
Key Results
Dataset
Methodology
Tech Stack
Project Structure
Getting Started
Visualizations
Related work
Limitations & Future Work
Academic Context
License
Contact

🎯 Project Overview

This project implements and compares three Machine Learning algorithms for network-based DDoS attack detection:

Random Forest with threshold optimization (0.5 → 0.3 → 0.1)
K-Nearest Neighbors (KNN) with PCA dimensionality reduction
AdaBoost achieving the best overall performance

The models were trained on the CIC-DDoS 2019 dataset, a comprehensive collection of network traffic data containing both benign flows and multiple DDoS attack types (Syn, LDAP, UDP, MSSQL, NetBIOS, UDPLag).

📊 Key Results

Performance Comparison

Model	Accuracy	Precision	Recall	F1-Score	AUC
Random Forest (threshold 0.1)	99.3%	0.993	0.999	0.996	0.9976
KNN + PCA (21 components)	99.91%	0.999	0.999	0.999	0.998
AdaBoost ⭐	99.95%	0.9999	0.9996	0.9997	1.000

Dataset Statistics

Total Flows: 598,440 (488,041 train + 110,399 test)
Features: 86 initial → 35 after feature engineering
Class Distribution: 90.3% DDoS / 9.7% Benign (highly imbalanced)
Attack Types: Syn, LDAP, UDP, MSSQL, NetBIOS, UDPLag

📊 Dataset

CIC-DDoS 2019 is a contemporary dataset for DDoS attack detection containing realistic network traffic captures.

Download Instructions

Visit: CIC-DDoS 2019 Dataset
Download both files:
- training_dataset_CIC_DDoS_2019.csv (488K flows)
- testing_dataset_CIC_DDoS_2019.csv (110K flows)
Place them in the data/ directory

Note: Dataset files are not included in this repository due to size constraints (ignored by .gitignore).

🔍 Methodology

1. Exploratory Data Analysis (EDA)

Data Quality Assessment:

✅ No missing values detected
✅ No duplicate entries
⚠️ Severe class imbalance: 90% DDoS / 10% Benign

Feature Analysis:

86 total features (45 float, 35 int, 6 categorical)
12 constant columns identified and removed
50 highly correlated pairs (r > 0.95) detected

2. Feature Engineering

Removed 57 redundant features:

Category	Count	Reason
Constant columns	12	Zero variance (e.g., Bwd PSH Flags, FIN Flag Count)
Session identifiers	18	Data leakage risk (Flow ID, IPs, Timestamps)
Highly correlated	27	Multicollinearity (r > 0.95)

Final feature set: 35 features → 21 principal components (PCA for KNN)

Top predictive features:

Flow IAT Mean
Fwd Packet Length Mean
Bwd Packet Length Std
Active Mean
Idle Mean

3. Handling Class Imbalance

The severe 90/10 class imbalance required multiple mitigation strategies:

Strategy 1: Stratified Sampling

train_test_split(X, y, stratify=y, test_size=0.3)

Maintains 90/10 ratio across train/validation/test splits
Prevents distribution shift between sets
Essential for reliable performance metrics

Strategy 2: Class Weighting Exploration

Tested in GridSearchCV:

param_grid = {
    'class_weight': [None, 'balanced']
}

balanced: Automatically adjusts weights inversely proportional to class frequencies
Result: No significant improvement over threshold optimization
Kept class_weight=None in final model

Strategy 3: Threshold Optimization ⭐

Systematically tested decision thresholds:

Threshold	Accuracy	Recall	Impact
0.5 (default)	74.1%	71.1%	❌ Misses 29% of attacks
0.3	78.5%	76.1%	⚠️ Still gaps in detection
0.1	99.3%	99.9%	✅ Detects almost all attacks

Key Insight: In cybersecurity, false positives (false alarms) are acceptable, but false negatives (missed attacks) are critical failures. Lowering the threshold to 0.1 maximizes attack detection (99.9% recall) while maintaining high precision.

Strategy 4: Algorithm Selection

AdaBoost: Naturally handles imbalance through adaptive boosting
Automatically increases weight on misclassified examples
Robust performance without manual class weighting

4. Preprocessing Pipeline

For Random Forest:

StandardScaler normalization
No PCA (trees handle high dimensions well)
54 features retained

For K-Nearest Neighbors:

StandardScaler normalization (critical for distance metrics)
PCA: 95% variance → 21 components
Reduces curse of dimensionality

For AdaBoost:

StandardScaler normalization
No PCA (preserves feature interpretability)
35 features after redundancy removal

5. Model Training & Hyperparameter Tuning

GridSearchCV configuration:

param_grid = {
    'C': [0.1, 1, 10, 100],
    'penalty': ['l2'],
    'solver': ['lbfgs', 'saga'],
    'class_weight': [None, 'balanced'],
    'max_iter': [1000]
}

Best parameters found:

C=10 (regularization strength)
penalty='l2' (Ridge regularization)
solver='lbfgs' (optimization algorithm)
class_weight=None (threshold optimization more effective)

🛠️ Tech Stack

Core Technologies

Python 3.8+ - Programming language
Scikit-learn 1.3.0 - ML algorithms and preprocessing
Pandas 2.0+ - Data manipulation and analysis
NumPy 1.24+ - Numerical computations
Matplotlib 3.7+ - Data visualization
Seaborn 0.12+ - Statistical visualizations
Jupyter Notebook - Interactive development environment

Machine Learning Models

Random Forest Classifier - Ensemble decision trees
K-Nearest Neighbors - Instance-based learning
AdaBoost Classifier - Adaptive boosting
PCA - Principal Component Analysis for dimensionality reduction
StandardScaler - Feature normalization

📁 Project Structure

DDoS-ML-Detector/
├── notebooks/
│   └── main.ipynb              # Complete ML pipeline and analysis
├── data/
│   └── README.md               # Dataset download instructions
├── images/
│   └── figures/                # Confusion matrices, ROC curves, etc.
├── models/                     # Trained models (not tracked by git)
├── .gitignore                  # Ignore data files and models
├── LICENSE                     # MIT License
├── requirements.txt            # Python dependencies
└── README.md                   # This file

🚀 Getting Started

Prerequisites

Python 3.8 or higher
pip package manager
4GB+ RAM (for dataset processing)

Installation

Clone the repository

git clone https://github.com/VOTRE_USERNAME/DDoS-ML-Detector.git
cd DDoS-ML-Detector

Install dependencies

pip install -r requirements.txt

Download the dataset Follow instructions in data/README.md to download CIC-DDoS 2019 dataset

Run the Analysis

# Launch Jupyter Notebook
jupyter notebook notebooks/main.ipynb

# Run all cells to reproduce results

Expected Runtime

Data loading: ~30 seconds
Preprocessing: ~2 minutes
Model training:
- Random Forest: ~5 minutes
- KNN: ~10 minutes
- AdaBoost: ~3 minutes
Total: ~20-25 minutes

📊 Visualizations

Class Distribution

The dataset exhibits severe class imbalance, with DDoS attacks representing 90% of the training data:

Feature Correlation Analysis

Correlation heatmap revealing 50+ highly correlated feature pairs (r > 0.95) that were removed during preprocessing:

Model Performance: Confusion Matrices

AdaBoost (Best Model)

AdaBoost Test Results:

True Positives: 98,899 (99.96%)
False Negatives: 43 (0.04%)
True Negatives: 11,443 (99.88%)
False Positives: 14 (0.12%)

K-Nearest Neighbors + PCA

KNN Test Results:

True Positives: 98,899 (99.96%)
False Negatives: 43 (0.04%)
True Negatives: 11,402 (99.52%)
False Positives: 55 (0.48%)

Random Forest (Threshold Optimization)

Threshold 0.5 (Default):

Issue: 690 false positives (6% benign traffic misclassified)
Accuracy: 94%

Threshold 0.3 (Optimized):

Improvement: Better balance, reduced false negatives
Accuracy: 96%

Threshold 0.1 (Maximum Recall):

Optimal for Security: 99.9% attack detection
Acceptable false positive rate for cybersecurity context

ROC Curves

ROC curves demonstrate near-perfect classification performance across all models:

AdaBoost

AUC = 1.0000 (Perfect classification)

K-Nearest Neighbors

AUC = 0.9982 (Test) | AUC = 0.9998 (Validation)

Random Forest

AUC = 0.9976 (Test) | AUC = 1.0000 (Validation)

Key Takeaways from Visualizations

Severe Class Imbalance: 90% DDoS vs 10% Benign requires specialized handling
High Feature Correlation: 50+ pairs removed to reduce multicollinearity
Threshold Impact: Random Forest performance improved from 94% → 99.3% with threshold tuning
Near-Perfect ROC: All models achieve AUC > 0.99, indicating excellent discrimination
AdaBoost Superiority: Best confusion matrix with minimal false negatives/positives

📚 Related Work

Commercial Solutions Comparison

Our model performance is comparable to commercial DDoS detection systems:

Cloudflare Magic Transit
Akamai Prolexic
AWS Shield Advanced
Arbor Sightline

These systems typically combine ML-based detection with volumetric filtering and scrubbing centers.

Emerging Attack Vectors (2024)

Based on Cloudflare Q4 2024 DDoS Report:

HTTP/2 Rapid Reset attacks (+600% increase)
QUIC DDoS exploitation
TCP Middlebox Reflection attacks
IoT botnet proliferation

Future work could extend this model to detect these emerging patterns.

⚠️ Limitations & Future Work

Current Limitations

While this project achieves excellent performance on the CIC-DDoS 2019 benchmark dataset (99.95% accuracy), several limitations must be addressed before production deployment:

1. Temporal Analysis Gap

Current approach: Each network flow is analyzed independently at a single point in time.

Limitation: Real-world DDoS attacks exhibit temporal patterns (gradual ramp-up, sustained duration, coordinated waves) that cannot be detected when analyzing isolated flows.

Example scenario:

Normal traffic: 1,000 req/min → 1,200 req/min ✅
DDoS attack: 1,000 req/min → 5,000 req/min → 50,000 req/min 🔴

Current model: Analyzes request #5,001 individually → may classify as "Benign"
Missing: No sliding window to detect the 5x traffic spike trend

Impact on production: Slow-building attacks (slow HTTP, application-layer DDoS) would likely evade detection until reaching critical mass.

2. Dataset Generalization

Current approach: Trained and validated on CIC-DDoS 2019, a controlled laboratory dataset.

Limitation: The model has only seen 6 attack types (Syn, LDAP, UDP, MSSQL, NetBIOS, UDPLag) captured in simulated environments. Real-world traffic includes:

Legitimate traffic variations (CDN, load balancers, NAT, VPNs)
Emerging attack vectors not present in 2019 data (HTTP/2 Rapid Reset, QUIC DDoS, IoT botnets)
Mixed benign/malicious traffic patterns
Geographic and protocol diversity

Expected production accuracy: 80-90% (vs 99.95% on benchmark) due to distribution shift between lab and real-world data.

3. Adversarial Robustness

Current approach: Standard supervised learning without adversarial training.

Limitation: Sophisticated attackers can craft attacks specifically designed to evade ML-based detection by:

Randomizing packet sizes and timing to mimic legitimate distributions
Gradually ramping up attack intensity to avoid detection thresholds
Mixing attack traffic with legitimate requests
Exploiting knowledge of common ML features (e.g., IAT, packet length distributions)

Vulnerability: The model has not been tested against adaptive adversaries who actively try to evade detection.

4. Infrastructure Requirements

Current approach: Batch processing on static CSV files (offline analysis).

Limitation: Production DDoS detection requires:

Real-time streaming: Processing flows as they arrive (<10ms latency)
Scalability: Handling 10,000+ requests/second on commodity hardware
Stateful analysis: Maintaining sliding windows and traffic baselines in memory
Integration: API for SIEM systems, automated mitigation triggers, alerting workflows

Current pipeline: Processes 598K flows in ~7 minutes batch mode. A real attack would have caused complete service disruption before detection completes.

5. Model Drift & Continuous Learning

Current approach: Static model trained once on 2019 data.

Limitation: Network traffic patterns and attack techniques evolve constantly. Without continuous retraining:

New legitimate applications (e.g., emerging video codecs, IoT protocols) may trigger false positives
Novel attack variants will go undetected
Model accuracy degrades over time (concept drift)

Expected degradation: 5-10% accuracy drop per year without retraining on recent data.

Future Work

To bridge the gap between academic prototype and production-ready system, the following enhancements are proposed:

Phase 1: Temporal Intelligence (Priority: High)

Implement sliding window analysis (5s, 30s, 5min windows)
Add temporal features: traffic rate acceleration, spike detection, baseline deviation
Integrate time-series anomaly detection (LSTM, Prophet, Isolation Forest)
Develop multi-scale temporal aggregation (per-IP, per-subnet, global)

Expected impact: +10-15% recall on slow-building attacks

Phase 2: Production Infrastructure (Priority: High)

Migrate from batch (Pandas) to streaming architecture (Apache Kafka, Flink)
Optimize feature extraction pipeline for <5ms latency
Implement distributed processing for 100K+ req/s throughput
Deploy model inference with TensorFlow Serving / ONNX Runtime
Build monitoring dashboard (Prometheus, Grafana)

Expected impact: <10ms end-to-end detection latency

Phase 3: Robustness & Generalization (Priority: Medium)

Collect real-world traffic data for validation (with privacy compliance)
Implement adversarial training to defend against evasion attacks
Build ensemble model combining ML + rule-based + anomaly detection
Add Graph Neural Network for IP relationship analysis
Integrate threat intelligence feeds (known malicious IPs, ASNs)

Expected impact: +5-10% production accuracy, adversarial robustness

Phase 4: Continuous Learning Pipeline (Priority: Medium)

Implement MLOps workflow (DVC, MLflow, Kubeflow)
Build automated drift detection monitoring
Create weekly retraining pipeline with A/B testing
Develop human-in-the-loop labeling workflow for edge cases
Version control for models and datasets

Expected impact: Sustained 90%+ accuracy over time

Phase 5: Operational Integration (Priority: Low)

Build REST API for SIEM integration
Implement automated mitigation triggers (rate limiting, IP blocking)
Add explainability layer (SHAP, LIME) for incident investigation
Develop false positive review workflow for SOC analysts
Create compliance documentation (audit trails, GDPR)

Expected impact: Production-grade deployment capability

Comparison with Industrial Solutions

Feature	This Project	Cloudflare / Akamai	Gap
Benchmark Accuracy	99.95%	~99%	✅ Comparable
Production Accuracy	~85% (estimated)	95-98%	⚠️ Training data gap
Detection Latency	N/A (batch)	<3 seconds	🔴 Infrastructure gap
Temporal Analysis	❌	✅ Sliding windows	🔴 Feature gap
Throughput	598K flows (offline)	46M req/s	🔴 Scalability gap
Adversarial Defense	❌ Untested	✅ Red team validated	🔴 Robustness gap
Continuous Learning	❌ Static model	✅ Weekly retraining	🔴 MLOps gap

Key insight: Academic performance is excellent, but the "last mile" to production requires significant infrastructure, operational, and robustness work—this is typical of ML research projects and highlights the difference between proof-of-concept and deployable systems.

Suitable Use Cases

Despite limitations, this model is viable for:

✅ Security Operations Center (SOC) - Tier 1 Triage

Pre-filter high-confidence DDoS alerts before human analysis
Acceptable false positive rate with analyst oversight

✅ Small/Medium Business (SMB) Networks

Basic DDoS detection for organizations with <10,000 req/s traffic
Cost-effective alternative to enterprise solutions

✅ Research & Education

Benchmark for comparing ML algorithms on DDoS detection
Teaching platform for cybersecurity + ML intersection

❌ Not suitable for:

Large-scale CDN/Edge networks (requires streaming + scalability)
Mission-critical infrastructure without human oversight
Adversarial environments without robustness enhancements

Lessons Learned

This project demonstrates that achieving high accuracy on benchmark datasets is the first 20% of the work—the remaining 80% involves:

Building production-grade infrastructure (streaming, monitoring, APIs)
Ensuring robustness against adaptive adversaries
Maintaining performance over time through continuous learning
Integrating with existing security workflows and compliance requirements

Understanding these gaps is crucial for transitioning from academic research to operational cybersecurity systems.

🎓 Academic Context

Course: Machine Learning for Cybersecurity
Institution: Télécom Paris, Institut Polytechnique de Paris
Academic Year: 2024-2025
Project Type: Practical Lab Assignment (TP)

Learning Objectives Achieved

✅ Handle severely imbalanced datasets
✅ Apply feature engineering for network data
✅ Compare multiple ML algorithms systematically
✅ Optimize decision thresholds for security contexts
✅ Interpret model performance in business terms

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📧 Contact

Thierry Armel TCHOMO KOMBOU

🎓 Cybersecurity Engineering Student @ Télécom Paris
🔬 Specialization: Cybersecurity and AI

📧 Email: tchomokombou@telecom-paris.fr
🐙 GitHub: 0xTchomo

🌟 Acknowledgments

Dataset: Canadian Institute for Cybersecurity (CIC), University of New Brunswick
Course Instructors: Télécom Paris Cybersecurity Department
Tools: Scikit-learn, Pandas, Jupyter communities

⭐ If you find this project useful, please consider giving it a star!

📝 Feedback and contributions are welcome - Feel free to open an issue or pull request.

Last Updated: January 2026

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
images/figures		images/figures
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🚨 DDoS Attack Detection using Machine Learning

📋 Table of Contents

🎯 Project Overview

📊 Key Results

Performance Comparison

Dataset Statistics

📊 Dataset

Download Instructions

🔍 Methodology

1. Exploratory Data Analysis (EDA)

2. Feature Engineering

3. Handling Class Imbalance

Strategy 1: Stratified Sampling

Strategy 2: Class Weighting Exploration

Strategy 3: Threshold Optimization ⭐

Strategy 4: Algorithm Selection

4. Preprocessing Pipeline

5. Model Training & Hyperparameter Tuning

🛠️ Tech Stack

Core Technologies

Machine Learning Models

📁 Project Structure

🚀 Getting Started

Prerequisites

Installation

Run the Analysis

Expected Runtime

📊 Visualizations

Class Distribution

Feature Correlation Analysis

Model Performance: Confusion Matrices

AdaBoost (Best Model)

K-Nearest Neighbors + PCA

Random Forest (Threshold Optimization)

ROC Curves

AdaBoost

K-Nearest Neighbors

Random Forest

Key Takeaways from Visualizations

📚 Related Work

Commercial Solutions Comparison

Emerging Attack Vectors (2024)

⚠️ Limitations & Future Work

Current Limitations

1. Temporal Analysis Gap

2. Dataset Generalization

3. Adversarial Robustness

4. Infrastructure Requirements

5. Model Drift & Continuous Learning

Future Work

Phase 1: Temporal Intelligence (Priority: High)

Phase 2: Production Infrastructure (Priority: High)

Phase 3: Robustness & Generalization (Priority: Medium)

Phase 4: Continuous Learning Pipeline (Priority: Medium)

Phase 5: Operational Integration (Priority: Low)

Comparison with Industrial Solutions

Suitable Use Cases

Lessons Learned

🎓 Academic Context

Learning Objectives Achieved

📄 License

📧 Contact

🌟 Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages