Advanced malware detection system using Deep Learning, Machine Learning, and interpretability techniques
Features โข Installation โข Usage โข Results โข Application
- Overview
- Problem Statement
- Architectures
- Comparative Analysis
- Results
- Installation
- Dataset
- Training Models
- Application
- Project Structure
- Perspectives
- References
This project implements a state-of-the-art malware detection system for Windows PE (Portable Executable) files using multiple complementary approaches:
- Deep Learning: MalConv CNN operating on raw bytes
- Classical ML: N-Grams with SGD, Tabular features with Gradient Boosting
- Interpretability: Sparse-CAM for all three models
- Production-ready: Modern GUI application with real-time analysis
- ๐ง Multi-model ensemble (MalConv, N-Grams, Tabular)
- ๐ 93.5% accuracy on 201K samples (SOREL-20M dataset)
- ๐ Interpretability via Sparse-CAM for all models
- โก Fast inference (<200ms per file)
- ๐ป Modern GUI with model selection and visualization
- ๐ฆ Easy deployment with pre-trained models
Malware detection is a critical challenge in cybersecurity:
- 450,000+ new malware samples discovered daily according to the AV-TEST Institute
- Traditional signature-based antiviruses are ineffective against:
- Obfuscation techniques
- Polymorphic and metamorphic malware
- Zero-day threats
- Packing and encryption
Intelligent, automated detection using Deep Learning to:
- Detect complex patterns invisible to human analysts
- Robust against obfuscation and packing
- Scalable with increasing data volumes
- Interpretable results for security analysts
We developed three complementary approaches, each with unique strengths:
โโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ Input โโโโ>โ Embedding โโโโ>โ Gated Conv1D โโโโ>โ Global Max โโโโ>โ FC+Sigmoid โ
โ 2 MiB โ โ 8D โ โ 128 filters โ โ Pool โ โ Malware? โ
โโโโโโโโโโโ โโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
Why MalConv?
- โ End-to-end learning (no manual feature engineering)
- โ Robust to obfuscation (learns deep patterns)
- โ State-of-the-art performance (93.5% accuracy)
- โ Scales with data (+2.3% improvement on larger dataset)
Comparison with Practice:
- Used by Endgame (now Elastic Security) in production
- Similar to Cylance's approach (acquired by BlackBerry)
- Outperforms traditional signature-based methods by 8-10%
Training:
python -u scripts/train_malconv.py \
--train_csv data/splits/train.csv \
--val_csv data/splits/val.csv \
--out_dir outputs/malconv_2MiB \
--max_len 2097152 \
--epochs 10 \
--batch_size 8โโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโ
โ File โโโโ>โ Extract โโโโ>โ Hashing โโโโ>โ SGD โโโโ>โ Malware/ โ
โ PE โ โ bi/trigramsโ โ 2ยฒโฐ โ โ Classifier โ โ Goodware โ
โโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโ
Why N-Grams?
- โ Fast (45.8 ms per file)
- โ Streaming learning (updates without full retraining)
- โ Compact model (small memory footprint)
- โ Good performance (90.4% accuracy)
Comparison with Practice:
- Inspired by Kaspersky's sequential feature approach
- Similar to Sophos's n-gram analysis
- 5-10x faster than deep learning models
- Ideal for real-time scanning scenarios
Training:
python -u scripts/train_ngrams.py \
--train_csv data/splits/train.csv \
--val_csv data/splits/val.csv \
--out_dir outputs/ngrams_2MiB \
--ngram_size 4 \
--max_len 262144โโโโโโโโโโโโโโโโโโโโ
โ 301 Features โ
โโโโโโโโโโโโโโโโโโโโค
โ โข 7 Stats โ โโโโโโโ
โ โข 260 Histogram โ โ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโ
โ โข 34 PE Headers โ โโโโโโโผโโโ>โ Hist. Gradient โโโโ>โ Malware/ โ
โ โข Entropie โ โ โ Boosting โ โ Goodware โ
โ โข Log transforms โ โโโโโโโ โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโ
โโโโโโโโโโโโโโโโโโโโ
Why Tabular?
- โ Interpretable (feature importances)
- โ Expert knowledge (domain-specific features)
- โ Fast training (CPU-only)
- โ Solid baseline (84.5% accuracy)
Comparison with Practice:
- Based on EMBER dataset methodology (Endgame/Elastic)
- Similar to VirusTotal's static analysis features
- Used by many enterprise antiviruses as complementary layer
- Feature engineering allows domain expert input
Training:
python -u scripts/train_tabular.py \
--model hgb \
--out_dir outputs/tabular_hgb_full| Model | Accuracy | Inference Time | Memory | Best For |
|---|---|---|---|---|
| MalConv | 93.5% | ~180ms | High (GPU) | Maximum accuracy |
| N-Grams | 90.4% | 45ms | Low | Real-time scanning |
| Tabular | 84.5% | ~60ms | Low | Interpretability |
๐ด MalConv
- Deep analysis of unknown files
- Batch processing with GPU available
- Maximum detection rate required
- Research and forensics
๐ก N-Grams
- Real-time endpoint protection
- High-throughput scanning
- Resource-constrained environments
- Streaming/online learning needed
๐ข Tabular
- Explainable decisions for analysts
- Integration with existing SIEM systems
- When domain features are critical
- Compliance/audit requirements
โโโโโโโโโโโโโฆโโโโโโโโโโโฆโโโโโโโโโโโโฆโโโโโโโโโโฆโโโโโโโโโโโฆโโโโโโโโโโ
โ Model โ Accuracy โ Precision โ Recall โ F1-Score โ AUC โ
โ โโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโฃ
โ MalConv โ 93.5% โ 92.2% โ 91.3% โ 91.8% โ 98.0% โ
โ N-Grams โ 90.4% โ 86.4% โ 89.7% โ 88.0% โ 96.8% โ
โ Tabular โ 84.5% โ 89.2% โ 67.7% โ 77.0% โ 89.7% โ
โโโโโโโโโโโโโฉโโโโโโโโโโโฉโโโโโโโโโโโโฉโโโโโโโโโโฉโโโโโโโโโโโฉโโโโโโโโโโ
- ๐ MalConv achieves best overall performance (93.5%)
- โก N-Grams offers best speed-accuracy trade-off (90.4% at 45ms)
- ๐ All models benefit from larger datasets (average +4.2% improvement)
- ๐ฏ Ensemble voting could push accuracy to 95%+
MalConv:
Confusion Matrix: Predicted
Goodware Malware
Actual Goodware 8,211 433
Malware 487 5,122
ROC-AUC: 98.0% | Precision: 92.2% | Recall: 91.3%
N-Grams:
Confusion Matrix: Predicted
Goodware Malware
Actual Goodware 7,853 791
Malware 576 5,033
ROC-AUC: 96.8% | Precision: 86.4% | Recall: 89.7%
Tabular (HGB):
Confusion Matrix: Predicted
Goodware Malware
Actual Goodware 8,185 459
Malware 1,811 3,798
ROC-AUC: 89.7% | Precision: 89.2% | Recall: 67.7%
Analysis:
- MalConv achieves the best balance across all metrics
- N-Grams has high recall (89.7%) but more false positives
- Tabular has high precision (89.2%) but lower recall (67.7%) - conservative model
- Python 3.8+
- CUDA 11.0+ (optional, for GPU acceleration)
- 16GB+ RAM recommended
git clone https://github.com/YOUR_USERNAME/pe-rawbytes-malware.git
cd pe-rawbytes-malware# Using venv
python -m venv venv
source venv/bin/activate # Linux/Mac
# OR
venv\Scripts\activate # Windows
# Using conda (recommended)
conda create -n malware python=3.8
conda activate malware# Core dependencies
pip install -r requirements.txt
# For GPU support (optional)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
# For application GUI
pip install tkinter pillow matplotlibrequirements.txt:
torch>=2.0.0
numpy>=1.21.0
pandas>=1.3.0
scikit-learn>=1.0.0
joblib>=1.1.0
tqdm>=4.62.0
matplotlib>=3.4.0
pefile>=2021.9.3 # Optional, for PE header parsing
We use the SOREL-20M dataset (Sophos-ReversingLabs):
- Total samples: 201,549 PE files
- Malware: 114,737 (57%)
- Goodware: 86,812 (43%)
- Size: 43.8 GB compressed, 117 GB uncompressed
# Download from official source
wget https://sorel-20m.s3.amazonaws.com/09-DEC-2020/binaries/sorel-20m-binaries.tar.gz
# Extract
tar -xzf sorel-20m-binaries.tar.gz -C data/raw/
# Organize files
python scripts/prepare_dataset.py \
--input data/raw/ \
--output data/organized/ \
--subset 201549 # Use subset of 201KIf you have the pre-processed CSV splits:
data/
โโโ splits/
โ โโโ train.csv # 114,018 samples (80%)
โ โโโ val.csv # 14,252 samples (10%)
โ โโโ test.csv # 14,253 samples (10%)Each CSV contains:
path,label,sha256,size
data/organized/malware/file1.exe,1,abc123...,1234567
data/organized/goodware/file2.exe,0,def456...,2345678# Verify dataset
python scripts/dataset_stats.py --splits data/splits/
# Output:
# Train: 114,018 samples (57.5% malware)
# Validation: 14,252 samples (56.8% malware)
# Test: 14,253 samples (57.2% malware)
# Total: 201,549 samplesBasic Training:
python -u scripts/train_malconv.py \
--train_csv data/splits/train.csv \
--val_csv data/splits/val.csv \
--out_dir outputs/malconv_2MiB \
--max_len 2097152 \
--epochs 10 \
--batch_size 8 \
--lr 0.001 \
|& tee logs/train_malconv.logAdvanced Options:
python -u scripts/train_malconv.py \
--train_csv data/splits/train.csv \
--val_csv data/splits/val.csv \
--out_dir outputs/malconv_2MiB_advanced \
--max_len 2097152 \
--epochs 15 \
--batch_size 16 \
--lr 0.0005 \
--weight_decay 1e-5 \
--embedding_dim 8 \
--num_filters 128 \
--kernel_size 512 \
--early_stopping 3 \
--device cuda:0Training Time:
- GPU (Tesla K80): ~8 hours for 10 epochs
- GPU (RTX 3090): ~3 hours for 10 epochs
- CPU: Not recommended (days)
Basic Training:
python -u scripts/train_ngrams.py \
--train_csv data/splits/train.csv \
--val_csv data/splits/val.csv \
--out_dir outputs/ngrams_256KiB \
--ngram_size 4 \
--max_len 262144 \
|& tee logs/train_ngrams.logOptions:
--ngram_size: 2 (bigrams), 3 (trigrams), 4 (recommended)--max_len: Maximum bytes to read (256KB recommended)--hash_size: Feature space size (default: 2^20)
Training Time:
- CPU: ~30 minutes for 114K samples
- Memory: ~4GB RAM
Histogram Gradient Boosting (Recommended):
python -u scripts/train_tabular.py \
--model hgb \
--out_dir outputs/tabular_hgb_full \
|& tee logs/train_tabular_hgb.logOther Models:
# Random Forest
python scripts/train_tabular.py --model rf --out_dir outputs/tabular_rf
# LightGBM
python scripts/train_tabular.py --model lgbm --out_dir outputs/tabular_lgbm
# XGBoost
python scripts/train_tabular.py --model xgb --out_dir outputs/tabular_xgbTraining Time:
- CPU: ~15 minutes
- Memory: ~8GB RAM
MalConv:
python -u scripts/eval.py \
--split_csv data/splits/test.csv \
--scores_npy reports/malconv_scores_test_2MiB.npy \
--out_json reports/malconv_eval_2MiB.jsonN-Grams:
python -u scripts/eval.py \
--split_csv data/splits/test.csv \
--scores_npy reports/ngrams_scores_test_256KiB.npy \
--out_json reports/ngrams_eval_256KiB.json# MalConv
python scripts/predict_malconv.py \
--model_path outputs/malconv_2MiB/best.pt \
--test_csv data/splits/test.csv \
--out_npy reports/malconv_scores_test.npy
# N-Grams
python scripts/predict_ngrams.py \
--model_path outputs/ngrams_256KiB/best.joblib \
--test_csv data/splits/test.csv \
--out_npy reports/ngrams_scores_test.npyWe provide a modern GUI application with:
- โ Multi-model analysis (MalConv, N-Grams, Tabular)
- โ Model selection (analyze with specific model)
- โ Sparse-CAM visualization for all models
- โ Export results to JSON
- โ Threading for stability
- โ Professional interface
python malware_detector_app.py1. File Analysis
- Browse and select PE file
- Analyze with one or all models
- View prediction confidence
- See inference time
2. Interpretability (Sparse-CAM)
- MalConv: Chunk-level importance
- N-Grams: N-gram perturbation analysis
- Tabular: Feature importances (top 20)
3. Export Results
- JSON format with:
- Predictions (malware/goodware)
- Confidence scores
- Model details
- Timestamp
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ ๐ก๏ธ Dรฉtecteur de Malware - Deep Learning โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ ๐ Fichier ร Analyser ๐ฏ Modรจle pour Analyse โ
โ โโโโโโโโโโโโโโโโโโโโโโโ โช Tous les modรจles โ
โ โ Aucun fichier โ โช MalConv โ
โ โโโโโโโโโโโโโโโโโโโโโโโ โช N-Grams โ
โ [๐ Parcourir] โช Tabular โ
โ [๐ Analyser] โ
โ โ
โ ๐ง Modรจles Chargรฉs ๐ Rรฉsultats | ๐ Sparse-CAM โ
โ โ
MalConv (CNN) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ
N-Grams โ โ โ
โ โ
Tabular (HGB) โ Verdict: ๐ฆ MALWARE โ โ
โ โ Confiance: 94.2% โ โ
โ โ๏ธ Actions โ โ โ
โ [๐พ Exporter JSON] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
pe-rawbytes-malware/
โโโ data/
โ โโโ raw/ # Raw SOREL-20M files
โ โโโ organized/ # Organized by label
โ โ โโโ malware/
โ โ โโโ goodware/
โ โโโ splits/ # Train/val/test splits
โ โโโ train.csv
โ โโโ val.csv
โ โโโ test.csv
โ
โโโ scripts/
โ โโโ train_malconv.py # Train MalConv
โ โโโ train_ngrams.py # Train N-Grams
โ โโโ train_tabular.py # Train Tabular
โ โโโ eval.py # Evaluation script
โ โโโ predict_malconv.py # MalConv predictions
โ โโโ predict_ngrams.py # N-Grams predictions
โ โโโ dataset_stats.py # Dataset statistics
โ
โโโ models/
โ โโโ malconv.py # MalConv architecture
โ โโโ ngrams.py # N-Grams model
โ โโโ tabular.py # Tabular features
โ
โโโ outputs/
โ โโโ malconv_2MiB/
โ โ โโโ best.pt # Trained MalConv model
โ โโโ ngrams_256KiB/
โ โ โโโ best.joblib # Trained N-Grams model
โ โโโ tabular_hgb_full/
โ โโโ tabular_model.joblib # Trained Tabular model
โ
โโโ reports/
โ โโโ malconv_eval_2MiB.json # Evaluation results
โ โโโ ngrams_eval_256KiB.json
โ โโโ figures/ # Generated plots
โ
โโโ logs/
โ โโโ train_malconv.log # Training logs
โ โโโ train_ngrams.log
โ โโโ train_tabular_hgb.log
โ
โโโ malware_detector_app.py # GUI Application
โโโ requirements.txt # Python dependencies
โโโ README.md # This file
โโโ LICENSE # MIT License
-
๐ฏ Ensemble Model
- Combine predictions from all three models
- Weighted voting based on confidence
- Expected improvement: +1-2% accuracy
-
๐ก๏ธ Adversarial Robustness
- Train with adversarial examples
- Gradient masking techniques
- Robust loss functions
-
โก Model Distillation
- Compress MalConv for faster inference
- Knowledge distillation to smaller CNN
- Target: <50ms inference time
-
๐ Incremental Learning
- Online learning for new malware families
- Avoid full retraining
- Continuous model updates
-
๐ Graph Neural Networks
- Model PE imports/exports as graph
- Function call graphs
- Control flow graph analysis
-
๐ค Transformer Architectures
- Self-attention on byte sequences
- BERT-style pre-training
- Transfer learning from large corpora
-
๐ Production Deployment
- REST API for model serving
- Docker containerization
- Kubernetes orchestration
- Real-time monitoring dashboard
-
๐ Explainable AI
- SHAP values for feature attribution
- LIME for local explanations
- Counterfactual explanations
-
๐ Privacy-Preserving ML
- Federated learning across organizations
- Differential privacy guarantees
- Secure multi-party computation
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
-
MalConv: Raff et al. (2018) - "Malware Detection by Eating a Whole EXE"
-
EMBER: Anderson & Roth (2018) - "EMBER: An Open Dataset for Training Static PE Malware ML Models"
-
Sparse-CAM: Anderson & Raff (2019) - "Explaining Malware Detection with Gradient-based Visualizations"
-
SOREL-20M: Harang & Rudd (2020) - "SOREL-20M: A Large Scale Benchmark Dataset for Malicious PE Detection"
- SOREL-20M: https://github.com/sophos-ai/SOREL-20M
- EMBER: https://github.com/endgameinc/ember
- PyTorch: https://pytorch.org/
- scikit-learn: https://scikit-learn.org/
- pefile: https://github.com/erocarrera/pefile
This project is licensed under the MIT License - see the LICENSE file for details.
- SOREL-20M dataset by Sophos AI and ReversingLabs
- EMBER project by Endgame (now Elastic Security)
- MalConv architecture by Edward Raff et al.
- Open-source community for amazing tools and libraries
For questions, issues, or collaboration:
- ๐ง Email: [tchomokombou@telecom-paris.fr]
โญ If you find this project useful, please consider giving it a star! โญ
Made with โค๏ธ for the cybersecurity community