A tool to extract recommendations from YouTube.
It's a client / server architecture.
The server centralizes the recommendations crawled from YouTube, and any number of clients can be spawned to crawl YouTube.
Each client will:
- ask the server for a YouTube video URL to crawl
- crawl it (main info) as well as its channel and the first 10 recommendations
- send the result to the server
- iterate, using a brand new browser session such that history doesn't play a role
The most compute intensive operations are performed by the clients, so it's OK to have one server and many clients (hundreds or thousands probably work fine).
We don't know yet what we will do with the dataset exactly, this is a basis for research.
- a linux machine or something that runs
bash - the latest
dockerwithdocker compose
docker and docker compose are the easiest way to run either the server or the clients.
A script named setup-ubuntu is provided to install docker on a brand new Ubuntu Jammy machine.
Otherwise follow the official instructions, upon which the script is heavily inspired.
You'll need the node server itself, and another server that understands virtual hosts and SSL to do the SSL termination and forward the traffic to node.
I describe how to do that with apache because I'm more familiar with it, but a similar result
could be obtained with nginx for instance.
If you have docker compose, running:
./server <password>should be enough.
It is recommended to secure the connection with SSL.
Since having SSL certificates in node apps is usually a pain in the ass, I'm using Apache2 for the SSL termination, and it uses mod_proxy to forward the requests to node.
If you have apache2 installed, you can use
the example vhost, adapting what's necessary (a priori only the ServerName) to route the traffic to node.
You'll need to enable two apache modules:
sudo a2enmod proxy
sudo a2enmod remoteip
sudo systemctl reload apache2There is another vhost to expose the database administration interface, it uses Basic Auth to protect access to the services exposed.
In this one you'll have to modify ServerName and ServerAdmin.
Quick reminder on how to add a user for Basic Auth
sudo htpasswd -c /etc/apache2/.htpasswd <user>Just copy the 2 vhosts you have just adapted to /etc/apache2/sites-available, then run sudo a2ensite on each one,
then enable SSL on both of them by following the instructions from the certbot website.
Customize the seed for this client:
Edit seed_video in config/production-docker.yaml and client_name.
A client is identified with its name and IP address (IP as determined by server).
The seed video is associated to the client at its creation and never changes.
Start the client(s)
Still assuming you have docker compose installed,
just run:
./explore <url> <password> [concurrency=4]