The Forensic Media Detection (FMD) Tool is a comprehensive AI & ML-driven system designed to detect deepfake artifacts across multiple media modalities. The tool employs state-of-the-art deep learning techniques to analyze images, videos, and audio files for signs of manipulation or synthetic generation.
The FMD tool follows a modular architecture with the following key components:
fmd_tool/
├── image_forensics/
│ └── image_detector.py
├── video_forensics/
│ └── video_detector.py
├── audio_forensics/
│ └── audio_detector.py
├── multimodal_analysis/
│ └── multimodal_detector.py
├── cli/
│ └── fmd_cli.py
├── tests/
│ ├── test_image_detector.py
│ ├── test_video_detector.py
│ ├── test_audio_detector.py
│ └── test_multimodal_detector.py
├── data/
│ └── (test files and datasets)
└── documentation/
├── user_manual.md
└── technical_documentation.md
The Image Forensics module implements three primary detection approaches:
- Vision Transformer (ViT) Classifier: A PyTorch-based ViT model for state-of-the-art generalization.
- XceptionNet-based Classifier: A TensorFlow-based Xception model fine-tuned for deepfake detection.
- Autoencoder-based Anomaly Detection: An autoencoder that learns to reconstruct authentic images and flags anomalies.
- Pixel-level Inconsistency Detection: Uses gradient analysis to identify anomalous pixel patterns
- Lighting Consistency Analysis: Analyzes lighting distribution across image regions
- Preprocessing Pipeline: Standardized image preprocessing for consistent model input
XceptionNet Configuration:
- Input Shape: (256, 256, 3)
- Base Model: Xception (pre-trained on ImageNet)
- Fine-tuning: Last 20 layers unfrozen
- Output: Single sigmoid neuron for binary classification
Autoencoder Configuration:
- Encoder: 4 Conv2D layers with max pooling
- Decoder: 4 Conv2DTranspose layers with upsampling
- Loss Function: Mean Squared Error (MSE)
The Video Forensics module employs a CNN-LSTM hybrid architecture for temporal anomaly detection:
- Frame Extraction: Systematic sampling of video frames
- CNN Feature Extraction: Spatial feature extraction from individual frames
- LSTM Temporal Modeling: Analysis of temporal patterns across frames
- Temporal Anomaly Detection: Identifies inconsistencies in frame sequences
- Lip-Sync Analysis: Detects audio-visual synchronization mismatches
- Frame-by-Frame Analysis: Detailed examination of individual frames
CNN-LSTM Configuration:
- Input Shape: (sequence_length, 256, 256, 3)
- CNN Layers: 3 Conv2D layers (64, 128, 256 filters)
- LSTM Layers: 2 LSTM layers (128, 64 units)
- Output: Single sigmoid neuron for binary classification
The Audio Forensics module implements x-vector inspired architecture for speaker verification and synthetic speech detection:
- Feature Extraction: MFCC (Mel-Frequency Cepstral Coefficients) extraction
- Frame-level Processing: Conv1D layers for local feature extraction
- Statistics Pooling: Mean and standard deviation pooling across time
- Segment-level Processing: Dense layers for final classification
- Spectral Anomaly Detection: Analysis of frequency domain characteristics
- Voice Synthesis Detection: Identification of synthetic speech artifacts
- Speaker Verification: Comparison against known speaker profiles
X-Vector Configuration:
- Input Shape: (None, 20) - Variable length MFCC sequences
- Frame-level: 3 Conv1D layers (512 filters each)
- Statistics Pooling: Mean and standard deviation concatenation
- Segment-level: 2 Dense layers (512 units each)
- Output: Single sigmoid neuron for binary classification
The Multimodal Analysis module implements late fusion for combining predictions from individual modality detectors:
- Individual Modality Analysis: Separate analysis using specialized detectors
- Cross-Modal Consistency Check: Analysis of prediction consistency across modalities
- Ensemble Prediction: Weighted combination of individual predictions
- Late Fusion: Combines high-level predictions from individual modalities
- Consistency Analysis: Identifies outlier predictions that may indicate manipulation
- Confidence Scoring: Provides confidence estimates based on cross-modal agreement
Late Fusion Configuration:
- Inputs: 3 prediction scores (image, video, audio)
- Hidden Layers: 2 Dense layers (64, 32 units) with dropout
- Output: Single sigmoid neuron for final prediction
The CLI is built using the Click library and provides a user-friendly interface for accessing all FMD functionalities:
- Modular Commands: Separate commands for each analysis type
- Flexible Options: Configurable models and output formats
- Error Handling: Comprehensive error messages and validation
- Click Framework: Provides command parsing and help generation
- JSON Output: Structured output format for programmatic access
- Progress Indicators: User feedback during analysis
Each module includes comprehensive unit tests covering:
- Model initialization and building
- Feature extraction and preprocessing
- Prediction functionality
- Error handling
End-to-end testing of the complete pipeline:
- CLI command execution
- File I/O operations
- Cross-module interactions
- Model Quantization: Reduced precision for faster inference
- Batch Processing: Efficient handling of multiple files
- Memory Management: Careful handling of large media files
- Modular Design: Easy addition of new detection methods
- Configurable Models: Support for different model architectures
- Extensible Framework: Plugin architecture for custom detectors
You can use the following pretrained models with the FMD tool for each modality:
- Vision Transformer (ViT) (recommended for generalization):
- Based on the latest research, ViT models offer superior cross-dataset generalization.
- Uses a pre-trained ViT-B/16 backbone (PyTorch).
- XceptionNet:
- Pretrained on ImageNet, widely used for deepfake detection (e.g., FaceForensics++, DeepFakeDetection).
- Download from: Keras Applications or DeepFakeDetection Challenge
- Autoencoder:
- You can use any convolutional autoencoder trained on authentic images from your domain.
- CNN-LSTM:
- Use a CNN (e.g., Xception, ResNet50) for frame feature extraction, pretrained on ImageNet.
- LSTM weights can be trained on video deepfake datasets (e.g., FaceForensics++, DFDC).
- Example: DeepFakeDetection or Kaggle DFDC
- X-Vector:
- Pretrained x-vector models for speaker verification (e.g., Kaldi x-vector, SpeechBrain).
- Download from: SpeechBrain Pretrained Models or Kaldi VoxCeleb
- CNN-LSTM:
- You can use a CNN-LSTM trained for audio deepfake detection (e.g., ASVspoof challenge models).
- Late Fusion:
- The fusion model can be trained using outputs from the above pretrained models on a multimodal dataset.
- TensorFlow: Deep learning framework for model implementation (used for Xception and Autoencoder)
- PyTorch: Deep learning framework for model implementation (used for Vision Transformer)
- OpenCV: Computer vision operations for image and video processing
- Librosa: Audio processing and feature extraction
- Scikit-learn: Machine learning utilities and preprocessing
- NumPy/Pandas: Numerical computing and data manipulation
- Click: Command-line interface creation
- Pillow: Image processing utilities
- Early Fusion: Feature-level fusion for improved accuracy
- Real-time Processing: Streaming analysis capabilities
- Model Training Pipeline: Automated training on new datasets
- Web Interface: Browser-based user interface
- Adversarial Robustness: Defense against adversarial attacks
- Explainable AI: Interpretable detection results
- Cross-dataset Generalization: Improved performance across different datasets (partially addressed by ViT model)
The Forensic Media Detection (FMD) Tool is an AI & ML-driven command-line interface (CLI) application designed to detect deepfake artifacts in various media types, including images, videos, and audio files. This tool leverages advanced machine learning models to analyze media for inconsistencies and anomalies that may indicate manipulation.
This repository has been updated to focus on the lightweight Image Forensics (XceptionNet) model. The full set of dependencies has been simplified to ensure a cleaner, more focused development environment.
To use the Image Forensics module, you need to have Python 3.8+ installed on your system. It is recommended to use a virtual environment to manage dependencies.
# Clone the repository
git clone https://github.com/RajaMuhammadAwais/Forensic-Media-Detection
cd Forensic-Media-Detection
# Install core dependencies for XceptionNet training
pip install -r requirements_light.txtThe train_xception.py script is configured to use a dataset with a specific folder structure. We recommend using the Deepfake and Real Images dataset from Kaggle.
- Install Kaggle API:
pip install kaggle
- Set up Kaggle Credentials:
- Go to your Kaggle account settings and create a new API token.
- Download the
kaggle.jsonfile and place it in the~/.kaggle/directory.
- Download the Dataset:
kaggle datasets download -d manjilkarki/deepfake-and-real-images
- Unzip and Structure the Dataset:
- Unzip the downloaded file.
- The
train_xception.pyscript expects the data to be structured as follows inside thedata/Dataset/directory:data/Dataset/ ├── Train/ │ ├── Real/ │ └── Fake/ └── Validation/ ├── Real/ └── Fake/ - You will need to manually create the
data/Datasetdirectory and move the unzipped image folders into the correctTrainandValidationsubdirectories.
Once the dataset is prepared, you can run the training script. The script is configured for a small, proof-of-concept run.
python train_xception.pyNote: The trained model weights (xception_deepfake_weights.h5) are excluded from the repository via .gitignore due to file size limits. You will need to train the model locally to use the detection features.
To use the FMD Tool, you need to have Python 3.8+ installed on your system. It is recommended to use a virtual environment to manage dependencies.
# Clone the repository (if applicable)
# git clone <repository_url>
# cd fmd_tool
# Create a virtual environment
python3.11 -m venv venv
source venv/bin/activate # On Windows, use `venv\Scripts\activate`
# Install dependencies
pip install -r requirements_light.txtThe FMD Tool is accessed via the fmd command, which provides subcommands for analyzing different media types. Below are the available commands and their options.
--version: Show the version and exit.
This command analyzes image files for deepfake artifacts and manipulation.
Usage:
fmd image --check <path_to_image_file> [OPTIONS]Options:
--check <path_to_image_file>: (Required) Path to the image file to analyze.--model <model_name>: Model to use for analysis. Supported models:xception(default),autoencoder.--detect <detection_type>: Type of detection to perform. Default:deepfake.--output <output_file>: Path to save the analysis results in JSON format.
Example:
fmd image --check my_image.jpg --model xception --output results.jsonThis command analyzes video files for deepfake artifacts and manipulation.
Usage:
fmd video --check <path_to_video_file> [OPTIONS]Options:
--check <path_to_video_file>: (Required) Path to the video file to analyze.--model <model_name>: Model to use for analysis. Supported models:cnn_lstm(default).--detect <detection_type>: Type of detection to perform. Default:deepfake.--output <output_file>: Path to save the analysis results in JSON format.
Example:
fmd video --check my_video.mp4 --model cnn_lstmThis command analyzes audio files for synthetic voice and manipulation.
Usage:
fmd audio --check <path_to_audio_file> [OPTIONS]Options:
--check <path_to_audio_file>: (Required) Path to the audio file to analyze.--model <model_name>: Model to use for analysis. Supported models:xvector(default),cnn_lstm.--detect <detection_type>: Type of detection to perform. Default:deepfake.--output <output_file>: Path to save the analysis results in JSON format.
Example:
fmd audio --check my_audio.wav --model xvectorThis command performs a comprehensive analysis across multiple modalities (image, video, audio) if available in the media file.
Usage:
fmd multimodal --check <path_to_media_file> [OPTIONS]Options:
--check <path_to_media_file>: (Required) Path to the media file to analyze.--model <model_name>: Model to use for fusion. Supported models:fusion_model(default).--output <output_file>: Path to save the analysis results in JSON format.
Example:
fmd multimodal --check my_media.mp4This command displays information about the FMD tool, including supported analysis types and usage examples.
Usage:
fmd infoExample:
fmd infoThe FMD tool provides detailed analysis results, typically in JSON format when the --output option is used. The output includes:
- Deepfake Probability: A score indicating the likelihood of the media being a deepfake.
- Specific Anomalies: Details about detected inconsistencies in pixels, lighting, frame-by-frame analysis, lip-sync, voice mismatch, and synthetic speech artifacts.
- Overall Assessment: A summary recommendation (e.g., "DEEPFAKE DETECTED", "SUSPICIOUS", "LIKELY AUTHENTIC").
- "File not found" error: Ensure the path to your media file is correct and the file exists.
- Dependency issues: If you encounter errors related to missing libraries, try reinstalling them using
pip install -r requirements.txt(if arequirements.txtis provided) or the individualpip installcommands listed in the Installation section. - Model loading errors: Ensure you have sufficient memory and the correct TensorFlow/PyTorch versions installed. Some models might require pre-trained weights, which may need to be downloaded separately.
For further assistance, please refer to the project's documentation or contact support.


