"In a digital world where a single click can compromise an entire network, PhishNet stands as the first line of defense."
- 📍 About The Project
- 🔍 How It Works
- 🚀 Tech Stack
- 🏗️ Architecture
- 💻 Getting Started
- 🎓 MIT Emerging Talents
PhishNet is a sophisticated machine learning solution designed to detect malicious phishing websites in real-time. Developed as part of the MIT Emerging Talents Experiential Learning Opportunity (ELO), this project bridges the gap between cybersecurity and artificial intelligence.
Phishing attacks are becoming increasingly subtle, often bypassing traditional filters. PhishNet analyzes 30+ distinct features of a URL—ranging from domain characteristics to web traffic patterns—to classify it as legitimate or malicious with high precision.
Cybersecurity threats are evolving. By leveraging historical data and behavioral patterns, PhishNet provides a proactive approach to identifying threats before they cause harm.
The core of PhishNet is its ability to analyze 11,055 data points, each representing a website with 30 distinct features. These features are meticulously extracted to capture the behavioral and structural fingerprints of phishing attempts.
The model looks at three main categories of features (see full list in data_schema/schema.yaml):
-
Address Bar Based Features:
having_IP_Address: Is the domain an IP address? (e.g.,123.45.67.89)URL_Length: Is the URL suspiciously long?Shortining_Service: Does it use bit.ly, goo.gl, etc.?having_At_Symbol: Does it contain@to confuse browsers?
-
Abnormal Based Features:
Request_URL: What % of external objects (images, videos) are loaded from other domains?URL_of_Anchor: Do<a>tags point to different domains?Links_in_tags: Do<meta>,<script>,<link>tags point to the same domain?
-
Domain Based Features:
SSLfinal_State: Does it have a valid HTTPS certificate?Domain_registeration_length: Is the domain very new? (Phishing sites are often short-lived).age_of_domain: How long has the domain existed?
Explore the detailed documentation for each module:
| Module | Description |
|---|---|
| 📦 networksecurity/ | The core package containing all source code. |
| 🧩 components/ | Detailed breakdown of Ingestion, Validation, Transformation, and Training. |
| 🚀 pipeline/ | How the training pipeline is orchestrated. |
| 📜 data_schema/ | The data contract and schema definitions. |
PhishNet is built on a Recurrent Pipeline Architecture. This ensures that every stage of the machine learning lifecycle is modular, reproducible, and traceable.
The system uses a strict Config -> Component -> Artifact pattern:
- Configuration (
entity/config_entity.py): Each component (e.g., Data Ingestion) has a specific configuration defining inputs, outputs, and parameters. - Component Execution: The component reads its config, performs its task (e.g., splitting data), and produces an Artifact.
- Artifact Generation (
entity/artifact_entity.py): The output (e.g.,train.csv,test.csv) is stored as an object. - Recurrence: The Artifact of one component becomes the Input for the next.
- Example:
DataIngestionArtifact-> feeds into ->DataValidationComponent.
- Example:
- Data Ingestion: Connects to MongoDB, fetches the 11k+ records, and splits them into Train/Test sets.
- Data Validation: Validates the schema against
schema.yamlto ensure no data drift. - Data Transformation: Handles missing values, scales numerical features, and prepares the data for the model.
- Model Training: Trains multiple models (Random Forest, Decision Tree, Gradient Boosting, AdaBoost) and selects the best one based on accuracy.
- Model Evaluation: Logs performance metrics to MLflow and DagsHub.
The project exposes a high-performance REST API built with FastAPI.
The /predict endpoint allows users to upload a CSV file containing URL features.
- Input: A CSV file (use
valid_data/test.csvfor testing). - Process: The system loads the saved model and preprocessor artifacts.
- Output: An HTML table displaying the original data alongside a new
predicted_column(0 for Phishing, 1 for Safe).
The /train endpoint allows administrators to trigger the entire training pipeline with a single click. This is crucial for updating the model as new phishing patterns emerge.
The project utilizes a robust Continuous Integration and Continuous Deployment (CI/CD) pipeline powered by GitHub Actions and AWS.
- Push to GitHub: A developer pushes code changes to the
mainbranch. - Continuous Integration (CI):
- GitHub Actions triggers the workflow.
- Linting: Checks code quality.
- Unit Tests: Runs tests to ensure functionality.
- Continuous Delivery (CD):
- Build: A new Docker image is built from the
Dockerfile. - Push to Registry: The image is tagged and pushed to AWS Elastic Container Registry (ECR).
- Build: A new Docker image is built from the
- Continuous Deployment:
- Self-Hosted Runner: An AWS EC2 instance (acting as a self-hosted runner) pulls the latest image from ECR.
- Deployment: The old container is removed, and the new version is deployed with
restart: alwayspolicies. - Live URL: The application is immediately available at http://18.202.196.247/docs.
-
Docker
-
Python 3.10+
-
MongoDB Connection String
- Start the Server: Follow the installation steps below to get the Docker container running.
- Open the UI: Navigate to
http://localhost:8080/docsor the deployed URL. - Locate
/predict: Click on thePOST /predictendpoint. - Upload Data:
- Click "Try it out".
- Upload the file located at
valid_data/test.csvin this repository.
- Execute: Click "Execute".
- View Results: The API will return a rendered HTML table showing which URLs were detected as malicious.
-
Clone the repository
git clone https://github.com/Bikaze/elo-project.git cd elo-project -
Set up Environment Variables Create a
.envfile:MONGO_DB_URL="your_mongodb_connection_string" AWS_ACCESS_KEY_ID="your_aws_key" AWS_SECRET_ACCESS_KEY="your_aws_secret" AWS_REGION="us-east-1"
-
Run with Docker
docker build -t phishnet . docker run -p 8080:8080 --env-file .env phishnet -
Access the API Open your browser to
http://localhost:8080/docsto test the prediction endpoint.
This project was developed as a capstone for the MIT Emerging Talents program. The program empowers talented individuals from marginalized backgrounds with the skills and network to become leaders in the tech industry.
- Program: Experiential Learning Opportunity (ELO)
- Focus: Data Science & Machine Learning Engineering
- Goal: To apply theoretical knowledge to solve real-world problems using industry-standard tools and practices.
Made with ❤️ by Bikaze



