I had some requeriments and restrictions like:
- Collect the data by myself (cannot download datasets)
- The dataset should have between 30 and 100 observations (rows) and 5 to 10 features (columns)
- I could enrich the dataset with more information obtained with other methods than manual typing (for example, web scraping)
- Need to complete one analysis to answer the questions that I have to solve with this project, also I should supplement the analysis with some hypothesis.
My questions were about the lyrics of the songs that we use to listen everyday, these question are:
- Do the lyrics have an overall positive sentiment?
- Are women's lyrics more positive than men's
- Are pop's lyrics than hip hop's?
-
Creation of the dataset using python: I decided to collect information from Spotify Charts because I wanted to analyze lyrics globally, so I chose the top artist of the week 47 of year 2022. The process I followed was type in a file the artist of that chart and also used Last.fm to type more information like gender, main genre and if is a band.
Then I complete the dataset searching the 10 most popular songs of every artist in Spotify using their API.
After that I used the lyricsgenius library to connect to the website Genius to download 5 lyrics of the 10 most popular songs in Spotify of every artist (because the name of the songs in Spotify not always has a corresponding name in Genius). At this point I also created one function with Selenium to get the connection token, just in case that the token to connect to Genius change. Right now is stored in a file (secrets.txt) but could get it without store it.
-
Once I downloaded all the information, needed to make a treatment to do the sentiment analysis (using the library Flair) and natural language processing (NLTK).
-
Since in the list of songs were used different languages, I decided to translate all to english to simplify the analysis. For this I used the library Fasttext with their pre-trained model to detect the language of the lyrics and then translate them to english with the library translators which if cannot do the translation uses google translator to perform the work.
-
I calculated the top words with NLTK functions and update manually their stop words list to include more that where not interesting to my analysis.
-
I also created a function to generate one word cloud with the library wordcloud and PIL to show different with shapes that I downloaded.
-
In another jupyter notebook I developed the hypothesis analysis and prepared the dataset to the visual analysis with Tableau
I used Tableau in his public version to create the presentation of the project and the visual analysis of the project, can find in in my Tableau
The project include a demo in flask that allows to search a song of one artist and will show the lyrics translated with the sentiment analysis and one wordcloud.