This project investigates whether Dua Lipa's lyrics exhibit characteristics consistent with Zipf's Law, a fundamental linguistic phenomenon.
We performed a comprehensive Exploratory Data Analysis (EDA) on 246 of her songs, exploring word frequencies, token distributions, and lyrical structures.
- Analyze Dua Lipa’s lyrical data for statistical patterns.
- Test Zipf’s Law, which states that word frequency is inversely proportional to its rank.
- Explore lexical diversity, word distributions, and structural insights into her songwriting.
- Python 3
- Jupyter Notebook
- Libraries:
pandas– data preprocessingnumpy– numerical analysismatplotlib/seaborn– visualizationnltk– natural language processing
- Handled missing values to ensure dataset integrity.
- Normalized text: converted to lowercase, removed punctuation.
- Tokenized lyrics into individual words.
- Distribution of word frequencies → revealed a long-tailed distribution.
- Top 20 most frequent words included common terms like “you”, “I”, etc.
- Log-log plot of rank vs frequency showed a linear relationship.
- Regression slope ≈ -1.57, close to the expected -1.
- R² = 0.96, indicating strong adherence to Zipf’s Law.
- Correlation heatmaps revealed relationships between word counts, unique words, and average word length.
- Scatterplots/pairplots showed structural patterns in songwriting.
-
Zipf’s Law Adherence
- Dua Lipa’s lyrics strongly align with Zipf’s Law.
-
Lexical Diversity
- Rich vocabulary with a few high-frequency words and many rare words.
-
Structural Insights
- Complexity in songwriting highlighted through relationships between unique word counts, word length, etc.
Through this project, we gained experience in:
- Text preprocessing & NLP basics.
- Applying statistical linguistics (Zipf’s Law).
- Data visualization & correlation analysis.
- Collaborative research and presentation of results.
- Extend analysis to other artists for comparison.
- Perform sentiment analysis on lyrics.
- Create an interactive dashboard (Streamlit/Dash).
- Aditya Singh – Data collection & cleaning, EDA framework.
- Pratik Kumar Pan – Multivariate analysis, heatmaps, visualizations.
- Shreyas Sarkar – Zipf’s Law analysis & interpretation.
- Priyank Gaur – Documentation & presentation design.
This project is for educational purposes only.
Lyrics dataset is used under fair use for linguistic analysis.