Skip to content

jonkeegan/behind-this-website

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 

Repository files navigation

Who’s behind this website? A Checklist.

By Prianjana Bengani (@acookiecrumbles) and Jon Keegan (@jonkeegan)

By Priyanjana Bengani Data Reporter Bloomberg Bluesky: @acookiecrumbles.bsky.social Mastodon: @acookiecrumbles@indieweb.social

Jon Keegan Tech reporter / Data journalist Sherwood News Bluesky: @jonkeegan.com

Oringinally presented at IRE NICAR Conference - March 4, 2022 - Updated March 2025 Slides: English | Russian (earlier version)

Thank you to Svetlana Borodina at Harriman Institute for the Russian translation!

What is this?

This checklist is meant to be used as a reporting tool to help journalists and researchers when trying to find out who published a website. This is meant to be used in conjunction with offline reporting techniques.

Following this checklist does not guarantee that you can unmask the owner of a website that does not want to be found, but it can help surface crucial clues and connections that can act as leads for further reporting.

🌟 Strong recommendation: while running through this checklist, create a data diary — it can be a TextEdit doc, a Google Doc, just the Notes app, whatever. It is important to be able to retrace your steps.

Who’s Behind This Website?

Priyanjana Bengani
Data Reporter, Bloomberg
@acookiecrumbles.bsky.social
@acookiecrumbles@indieweb.social

Jon Keegan
Tech Reporter / Data Journalist, Sherwood News
Bluesky: @jonkeegan.com

Updated March 2025

The New Normal: More Opacity, Less Transparency

Existing Tools Are Breaking

  • Google Cache no longer exists (but Bing Cache still does).
  • Microsoft shut down RiskIQ.
  • Google dorks / search operators are inconsistent.

World of Platforms and Walled Gardens

  • Google Analytics 4 disallows finding relationships between websites based on ID.
  • Platforms are restricting transparency tools, retreating from content moderation.
  • CrowdTangle no longer exists; journalists can’t get access to the Meta Content Library.
  • The X (formerly Twitter) API is prohibitively expensive.
  • More platforms to monitor: Mastodon, Bluesky, Threads, Discord, Telegram.

Concealing Identities Is Easier

  • WHOIS is less useful for new domains due to GDPR.
  • Squarespace, WordPress, GoDaddy static websites share IP addresses with thousands of sites.
  • CDNs make IP address matching harder.
  • Easy to incorporate a company with false identities.

Somebody Set This Website Up!

Who? Why? When? How?

Who?

  • Who’s featured on the website?
  • Are there authors, email addresses, profile pictures?
  • Are there payment options (crypto, PayPal, donations, subscriptions)? Who’s receiving the money?
  • Are authors common across multiple sites or exclusive to this one?
  • Is the owner trying to stay hidden?

Why?

  • Was the site set up to make money (scams, ads, content farms)?
  • Is it part of influence operations?
  • Promoting political candidates or social advocacy?
  • Deceiving audiences by impersonating another website?
  • Poisoning LLMs?

When?

  • When was the domain first registered?
  • How long has the site existed in its current form?
  • Was it offline for any significant time?
  • Did the ownership change? (Check historical WHOIS)
  • Did the site’s design or content change drastically?

How?

  • What is the tech stack? Where is it hosted?
  • Is it a WordPress site? (Check authors, templates, plugins)
  • How is it monetized? Affiliate links, advertising?
  • Is the content generated by AI?
  • Where does it link to, and who links to it?

Documenting, Archiving, Monitoring

📝 Documenting

  • Maintain a data diary with detailed notes.
  • Create a timeline of the website’s evolution.
  • Use Hunchly or screen recordings.

📚 Archiving

  • Archive sites consistently using archive.org and archive.is.
  • Screenshots are essential for non-archivable content.
  • Download videos before they are taken down (yt-dlp).
  • Capture full browser windows with timestamps (GoFullPage, ArchiveWeb).

🔍 Monitoring


So Many Tools

  • Investigative techniques > tools.
  • Most investigations require multiple tools.
  • Tools can be expensive, overpromise, underdeliver, and collect data unethically.
  • Platforms rise and fall, APIs disappear.
  • Don’t get too dependent on one tool.

List of OSINT tools


Protecting Yourself Online

🛡️ IP Address Protection

  • Use a VPN or Tor (note: some VPNs track activity).
  • Use a separate browser in incognito/private mode.

📧 Email Protection

  • Use a different email address for newsletter signups.
  • Block remote content loading to prevent tracking.
  • Use the + trick (e.g., johnsmith+newsletter@gmail.com).

💻 Virtual Machine


Ethics & Legality

Ethics

  • Some tools collect data from shady sources (data brokers).
  • Investigations should avoid doxxing individuals.

Legality

  • Scraping publicly available content is generally fine but be aware of the Computer Fraud and Abuse Act.
  • Accessing unauthorized credentials is a legal risk.
  • The Missouri SSN exposure case shows that even “viewing source” can be misinterpreted as hacking.

Read More


Investigating Site Content

📝 Text

  • Check for duplicate text using exact string searches.
  • Look for names, emails, phone numbers, social media handles, company names.

🖼️ Media

  • Use reverse image search (Google, Bing, Yandex).
  • Check for stock images or repeated profile pictures.
  • Use facial recognition (PimEyes, Search4Faces).
  • Extract metadata from images (EXIF Data).
  • Use forensic tools to detect AI-generated content (WeVerify, Forensically).

📄 Documents

  • Use Google dorking (filetype:pdf site:<domain>.com).
  • Check PDF metadata. Use Dangerzone for safe viewing.

Investigating Domains & Infrastructure

🏠 Ownership

🔗 Shared Infrastructure

📡 Network Requests


Investigating Social & Platform Connections

📱 Social Media Presence

📢 Ads & Influence

  • Check ad spending on platforms.
  • Monitor engagement levels.

Case Study: Kremlin-Aligned Influence Networks in Europe

Link: https://www.thebureauinvestigates.com/stories/2024-07-06/russian-disinformation-networks-ramp-up-attacks-on-european-elections

  • Russian-aligned websites targeting Ukraine and Europe.
  • Hundreds of domains registered post-2022 invasion.
  • Tracked using WHOIS, analytics, and infrastructure tools.

Resources

📰 Newsletters

📚 Books

About

Checklist for investigating the provenance and ownership of websites.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors