By Prianjana Bengani (@acookiecrumbles) and Jon Keegan (@jonkeegan)
By Priyanjana Bengani Data Reporter Bloomberg Bluesky: @acookiecrumbles.bsky.social Mastodon: @acookiecrumbles@indieweb.social
Jon Keegan Tech reporter / Data journalist Sherwood News Bluesky: @jonkeegan.com
Oringinally presented at IRE NICAR Conference - March 4, 2022 - Updated March 2025 Slides: English | Russian (earlier version)
Thank you to Svetlana Borodina at Harriman Institute for the Russian translation!
This checklist is meant to be used as a reporting tool to help journalists and researchers when trying to find out who published a website. This is meant to be used in conjunction with offline reporting techniques.
Following this checklist does not guarantee that you can unmask the owner of a website that does not want to be found, but it can help surface crucial clues and connections that can act as leads for further reporting.
🌟 Strong recommendation: while running through this checklist, create a data diary — it can be a TextEdit doc, a Google Doc, just the Notes app, whatever. It is important to be able to retrace your steps.
Priyanjana Bengani
Data Reporter, Bloomberg
@acookiecrumbles.bsky.social
@acookiecrumbles@indieweb.social
Jon Keegan
Tech Reporter / Data Journalist, Sherwood News
Bluesky: @jonkeegan.com
Updated March 2025
- Google Cache no longer exists (but Bing Cache still does).
- Microsoft shut down RiskIQ.
- Google dorks / search operators are inconsistent.
- Google Analytics 4 disallows finding relationships between websites based on ID.
- Platforms are restricting transparency tools, retreating from content moderation.
- CrowdTangle no longer exists; journalists can’t get access to the Meta Content Library.
- The X (formerly Twitter) API is prohibitively expensive.
- More platforms to monitor: Mastodon, Bluesky, Threads, Discord, Telegram.
- WHOIS is less useful for new domains due to GDPR.
- Squarespace, WordPress, GoDaddy static websites share IP addresses with thousands of sites.
- CDNs make IP address matching harder.
- Easy to incorporate a company with false identities.
Who? Why? When? How?
- Who’s featured on the website?
- Are there authors, email addresses, profile pictures?
- Are there payment options (crypto, PayPal, donations, subscriptions)? Who’s receiving the money?
- Are authors common across multiple sites or exclusive to this one?
- Is the owner trying to stay hidden?
- Was the site set up to make money (scams, ads, content farms)?
- Is it part of influence operations?
- Promoting political candidates or social advocacy?
- Deceiving audiences by impersonating another website?
- Poisoning LLMs?
- When was the domain first registered?
- How long has the site existed in its current form?
- Was it offline for any significant time?
- Did the ownership change? (Check historical WHOIS)
- Did the site’s design or content change drastically?
- What is the tech stack? Where is it hosted?
- Is it a WordPress site? (Check authors, templates, plugins)
- How is it monetized? Affiliate links, advertising?
- Is the content generated by AI?
- Where does it link to, and who links to it?
- Maintain a data diary with detailed notes.
- Create a timeline of the website’s evolution.
- Use Hunchly or screen recordings.
- Archive sites consistently using archive.org and archive.is.
- Screenshots are essential for non-archivable content.
- Download videos before they are taken down (yt-dlp).
- Capture full browser windows with timestamps (GoFullPage, ArchiveWeb).
- Set up alerts with Klaxon Cloud or VisualPing.
- Automate screenshots over time with GitHub Actions and ShotScraper.
- Investigative techniques > tools.
- Most investigations require multiple tools.
- Tools can be expensive, overpromise, underdeliver, and collect data unethically.
- Platforms rise and fall, APIs disappear.
- Don’t get too dependent on one tool.
- Use a VPN or Tor (note: some VPNs track activity).
- Use a separate browser in incognito/private mode.
- Use a different email address for newsletter signups.
- Block remote content loading to prevent tracking.
- Use the
+trick (e.g.,johnsmith+newsletter@gmail.com).
- Some tools collect data from shady sources (data brokers).
- Investigations should avoid doxxing individuals.
- Scraping publicly available content is generally fine but be aware of the Computer Fraud and Abuse Act.
- Accessing unauthorized credentials is a legal risk.
- The Missouri SSN exposure case shows that even “viewing source” can be misinterpreted as hacking.
- Check for duplicate text using exact string searches.
- Look for names, emails, phone numbers, social media handles, company names.
- Use reverse image search (Google, Bing, Yandex).
- Check for stock images or repeated profile pictures.
- Use facial recognition (PimEyes, Search4Faces).
- Extract metadata from images (EXIF Data).
- Use forensic tools to detect AI-generated content (WeVerify, Forensically).
- Use Google dorking (
filetype:pdf site:<domain>.com). - Check PDF metadata. Use Dangerzone for safe viewing.
- Lookup WHOIS data (Whoxy, DomainTools).
- Find related domains (DNSTwist).
- Find domains sharing the same IP (SecurityTrails, BuiltWith).
- Identify shared analytics identifiers (BuiltWith, DomainTools’ Iris Investigate).
- Use FouAnalytics X-Ray to analyze network requests.
- Find accounts with WhatsMyName, Sherlock, Blackbird.
- Check Facebook Page Transparency.
- Check ad spending on platforms.
- Monitor engagement levels.
- Russian-aligned websites targeting Ukraine and Europe.
- Hundreds of domains registered post-2022 invasion.
- Tracked using WHOIS, analytics, and infrastructure tools.
- Verification Handbook (Craig Silverman)
- Open Source Intelligence Techniques (Michael Bazzell)
- Hacks, Leaks, and Revelations (Micah Lee)