As written on the website, Rotten Tomatoes and the Tomatometer score are the world’s most trusted recommendation resources for quality entertainment. As the leading online aggregator of movie and TV show reviews from critics, they provide fans with a comprehensive guide to what’s Fresh – and what’s Rotten – in theaters and at home. This website provides general information on movies and TV shows and also their reviews containing the tomatometer score and the audience score. Tomatometer score is based on the opinions of hundreds of film and television critics and is a trusted measurement of critical recommendation for millions of fans. And the audience score represents the percentage of users who have rated a movie or TV show positively.
The author decided to scrape the contents of the most popular TV shows on this website because of the increase in entertainment consumption since the COVID-19 outbreak, especially in TV shows or series. A number of companies in the entertainment industry took this chance and so released a lot of new TV shows and series. But since the beginning of 2022, people has been starting to go back to their normal routines and that resulted in less leisure time. Considering that condition, the author decided to do this project to help people decide what to watch according to the most popular TV shows or series at that time by providing the information on the TV show or series along with its ratings.
The DBMS used to store the result of web scraping in this project is MongoDB as the default DBMS. The reason why the author chose this DBMS is that because of its high performance and flexibility. On top of that, it is also compatible with .json file that is used when exporting the result of web scraping. Furthermore, MongoDB has MongoDB Atlas as its cloud database that simplifies the process of making a cluster in cloud, which is relatively safer.
These are some Python libraries and tools required to run the scraper program.
-
To make the code easier to write and maintain, Jupyter Notebook is used. The scraper file is stored in
.ipynbformat. -
Since the main language used in this project is
Python, this library is used as the main library to scrape the contents of a website. Its syntax is fairly simple, easy to understand and easy to use. - This library is used as HTML parser in this project. It is relatively faster than HTML parser provided by Python because it's written in C language.
- This library is used to access websites and request objects from the website.
-
Since the Rotten Tomatoes website uses
load morepagination and the website itself prevents the user to access further than page 5 directly, this library is used to open theChrome Webdriver. On top of hat, this library is also used to click the load more button to reveal more pages to scrape. - To avoid the website's anti-scraping mechanism, and to keep the server from crashing, the program uses the time.sleep() method to stop the program for a few seconds. Time library is already preinstalled with Python.
-
In order to store the data with a
.jsonformat, this library is used to dump the scraped data to a.jsonfile. JSON library is already preinstalled with Python. -
This library is used to join the path and the name of the file when exporting the
.jsonfile. OS library is already preinstalled with Python.
To install all these libraries, open right directory where the libs.txt is located on Command Prompt or Terminal and simply type in
pip install -r libs.txt
- This tool is used with Selenium to access the desired page of the web. If you don't have this tool in your device yet, you can download it here.
- Make sure you already have all the required libraries and tools installed in your device. Also, make sure to have a stable internet connection before running the code to prevent error when the code is running (RTE).
- Clone this repository to your local directory.
- Change the path of the
Chrome Webdriveraccording to the local directory in your device. - Change the path and the name of the exported
.jsonto your liking. - Open the
scraper.ipynbinJupyter Notebookor any IDE that you may have. - Run all the codes.
The scraped data will be stored into a .json file with the structure as written below.
{
_id:{
$oid (string) : _id is set as the default primary key in MongoDB and is automatically generated when exported from the MongoDB
}
title (string) : title of the series/TV show
airing (string) : airing years of the series (as a whole)
synopsis (string) : synopsis of the series/TV show
average_tomatometer (int) : average tomatometer score of the whole series/TV show (in percent)
average_audience-score (int): average audience score of the whole series/TV Show (in percent)
tv_network (string) : TV network where the series/TV show can be watched
premiere_date (string) : premiere date of the whole series/TV show (in format yyyy-mm-dd)
genre (string) : genre of the series/TV show
main_casts [(string)] : name of the main casts of the whole series/TV show
num_of_seasons (int) : number of seasons the series/TV show has
seasons_info :
[
{
season_title (string) : title of the season
airing_year (int) : airing year of the season
episodes (int) : number of episodes in the season
tomatometer (int) : tomatometer score of the season (in percent)
audience_score (int) : audience score of the season (in percent)
}
]
}
The following is ERD of the database to store the scraped data, with _id as the primary key.

The author made a simple API to access the online database. The API itself is capable of Insert and Read operations. The API is deployed on the URL below.
https://rottentomatoes-tvshows.herokuapp.com/
The API is written in JavaScript using NodeJS. These are some libraries and tools used to create the API. If you don't have NodeJS installed on your device yet, you can download it here.
You can see all used libraries in the package.json inside the API folder.
-
This library is used to parse the
req.bodyin order to do thePOSToperation. -
This library is used to make the
.envfile so that theMONGO_URIincluding the username and password of the DB so that it is not leaked to the public. -
This library is used to simplify the process of building the web application used by the
API. - This library is used to create the schema and the model of the data to do posting to the web. It is also used to translate between objects in code and its representation in MongoDB.
-
This library is used to simplify the process of starting the
APIwhen developing as it wraps the Node app, watches the file system and automatically restarts the program if any changes is made.
-
This tool is used to test and use the API by sending requests. The
GETtool is used when anyGet AllorGet by IDrequests. ThePOSTtool is used when doingInsertrequests.
- Open
PostmanorThunder Clientin VS Code. - Copy the URL below.
https://rottentomatoes-tvshows.herokuapp.com/tvshows
- Send requests by
GETandPOST.-
- Send the
GETrequest from the URL above.
- Send the
-
- add
/<id of the tv show>to the URL above and send the request.
- add
-
- Add
/postto the URL above and type in the JSON format of the data to theBodyof the request. Then send the request.
- Add
-
-
PyPI
Selenium
BeautifulSoup
MongoDB
Chrome WebDriver
Mongoose API
Express -
Web scraping : Web Scraping with Python - Beautiful Soup Crash Course
Python & JSON :
Python JSON dump() and dumps() for JSON Encoding
Python Encode Unicode and non-ASCII characters as-is into JSON
Selenium : Web Scraping with Selenium in Python
Rest API :
Build A Restful Api With Node.js Express & MongoDB | Rest Api Tutorial
Create a complete REST API with Node, Express and MongoDB | Deploy on Heroku -
Stack Overflow
Geeks For Geeks
The data visualization of this database is made by using MongoDB Charts as it is connected to the MongoDB Database. The full dashboard can be accessed through TV Shows Dashboard .
Gresya Angelina Eunike Leman (18220104)
Information System and Technology
Institut Teknologi Bandung














