This project monitors live Certificate Transparency logs using a local CertStream server and analyzes newly issued TLS certificates to identify potentially suspicious or phishing-related domains. Detected threats are logged to a CSV file for further analysis.
The system:
- connects to a locally running CertStream server
- extracts domain names from certificates
- identifies potential phishing domains using heuristics (Levenshtein distance, keyword matching, TLD and entropy)
- stores flagged domains for analysis
- provides a script to generate statistics and plots
- Python 3.8+
- Docker
pip install -r requirements.txt
git clone https://github.com/olivblvck/CT-Logs.git
cd CT-Logsdocker pull 0rickyy0/certstream-server-go
docker run -d -p 8080:8080 0rickyy0/certstream-server-goThis spins up a local WebSocket server compatible with the CertStream protocol on
ws://127.0.0.1:8080.
python3 -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txtpython certstream/listener.py # or python -m certstream.listenerCT-Logs/
├── analysis/
│ ├── phishing_detect.py
│ └── stats.py
├── certstream/
│ └── listener.py
├── data/
│ └── websites.txt
├── output/
│ ├── suspected_phishing.csv
│ └── plots/
│ ├── domain_length.png
│ ├── registration_age_log.png
│ ├── score_distribution.png
│ ├── score_vs_age.png
│ ├── score_vs_brand_match.png
│ ├── score_vs_entropy.png
│ ├── score_vs_issuer.png
│ ├── score_vs_keyword.png
│ ├── tld_vs_issuer.png
│ └── top_tlds.png
├── utils/
│ ├── dns_twister.py
│ └── who_is.py
├── requirements.txt
├── Report.pdf
└── README.md
- The list of monitored brands is stored in
data/websites.txt - Detection logic is based on heuristic signals
- Accuracy depends on tuning thresholds and keyword/TLD lists
- DNS permutations are limited to 30 per domain
- WHOIS queries are cached and only executed for suspicious domains
For each domain found in new TLS certificates, the following features are extracted:
- TLD: Top-Level Domain (e.g.,
.com,.xyz) - TLD Suspicious: Whether the TLD is from a list of commonly abused TLDs
- Keyword Match: Checks if the domain contains suspicious keywords like
login,secure,verify - Entropy: Shannon entropy of the domain name – higher values may indicate algorithmically generated domains
- WHOIS Age: Number of days since domain registration (if data available, returns -1 days if unavailable)
Each domain is assigned a score between 0 and 10 (final scores are capped at a maximum of 10 points), reflecting the likelihood of phishing. The higher the score, the more suspicious the domain.
The score is calculated based on the following features:
| Feature | Condition | Points |
|---|---|---|
| Entropy | ≥ 2.8 → +0.5, ≥ 3.2 → +1, ≥ 3.6 → +1.5 | +0.5-1.5 |
| Suspicious Keyword | Presence of phishing-related words (e.g. login, bank, verify) |
+1 |
| Suspicious TLD | .xyz, .icu, .top, .buzz, .shop etc. |
+1 |
| Issuer Risk | Let's Encrypt/ZeroSSL/Actalis AND (age<14d OR suspicious_tld OR keyword) |
+1 |
| CN Mismatch | Certificate Common Name ≠ domain | +1 |
| OCSP Missing | No Online Certificate Status Protocol | +1 |
| Short-Lived Cert | Certificate validity ≤ 14 days | +1 |
| Brand in Subdomain | Legitimate brand name in subdomain (e.g. paypal.host.com) |
+1 |
| Domain Age | 0-30 days → +3, <90 days → +2, <360 days → +1 |
1-3 |
| Brand Similarity | ratio ≥ 0.8 → +1, ≥0.85 → +1.5, ≥0.9 → +2.0 |
1-2 |
Domains exceeding a chosen threshold (score ≥ 2) can be flagged as medium or (score ≥ 4) high-risk.
The script saves results to output/suspected_phishing.csv, with the following columns:
timestampdomainbrand_matchsimilarity_scoreissuertldtld_suspicioushas_keywordentropyregistration_dayscn_mismatchocsp_missingshort_livedbrand_in_subdomainscore
Duplicate detections with identical features (except timestamp) are automatically deduplicated before analysis.
To analyze the output data:
python analysis/stats.pyThis script provides:
- Distribution of TLDs and issuers
- Entropy statistics
- Domains containing phishing-like keywords
- Most common matched brands
- Distribution of phishing scores
- Score vs entropy and domain age
- Score vs issuer and brand match
- Score by presence of suspicious keyword
- Age distribution (log scale)
- Frequency heatmap: TLD vs Issuer
- Permutation checks are limited (max 30), and WHOIS is only called for domains flagged as suspicious
- Uses in-memory caches (
TTLCacheandlru_cache) to prevent redundant DNS and WHOIS queries - Semaphore Limits: 30 concurrent DNS Twister API calls, 10 parallel processing workers
- Domains with missing WHOIS creation date are marked with
-1and excluded from age-based scoring - Analysis script deduplicates rows to avoid skewing results from repeated entries
- Domains like
s3-eu-west-1.amazonaws.comoften appear similar to brand names but are legitimate infrastructure domains. - WHOIS lookups may occasionally fail due to connection resets or missing domain records (
Domain not found,No match for ...,[Errno 54] Connection reset by peer). - CT logs include a large number of benign domains; filtering is heuristic-based and not perfect.
- Add machine learning-based phishing classifier
- Support for other log sources beyond CertStream
- Crosscheck with Google Safe Browsing, Virus Total and other blacklists if the domains have been detected as malicious.