Content-based recommender that suggests relevant TED Talks based on your interests using TF-IDF + Cosine Similarity + Pearson Correlation.
- Clean modular project structure (not a single Jupyter notebook)
- Text preprocessing (stopwords removal + punctuation cleaning)
- TF-IDF vectorization
- Dual similarity scoring:
- Cosine similarity (angle-based)
- Pearson correlation (linear relationship)
- Combined ranking of top-N most relevant talks
- Easy to extend (add new similarity metrics, evaluation, UI, etc.)
Query:
"Climate change and impact on health and carbon footprint"
Recommended talks (example):
| main_speaker | details |
|---|---|
| Al Gore | ... climate change health impacts carbon emissions ... |
| Johan Rockström | ... planetary boundaries climate health connection ... |
| Christiana Figueres | ... Paris agreement health co-benefits ... |
ted-talks-recommender/
├── data/
│ └── tedx_dataset.csv # original dataset
├── src/
│ ├── init.py
│ ├── preprocessing.py # data loading & text cleaning
│ ├── model.py # TF-IDF + similarity computation
│ └── utils.py # helper functions
├── recommend.py # main recommendation script
├── README.md
└── requirements.txt (recommended)
- Python 3.8+
- pandas, scikit-learn, nltk, scipy
# Clone the repository
git clone https://github.com/YOUR-USERNAME/ted-talks-recommender.git
cd ted-talks-recommender
# (Recommended) Create virtual environment
python -m venv venv
source venv/bin/activate # Linux / macOS
venv\Scripts\activate # Windows
# Install dependencies
pip install pandas scikit-learn nltk scipyimport nltk
nltk.download('stopwords')python recommend.pyOr modify the query directly in recommend.py:
query = [
"Climate change and impact on health and carbon footprint",
# "machine learning ethics bias fairness",
# "future of education technology children",
]Load & Preprocess
- Read CSV → keep main_speaker, title, details
- Merge title + details
- Lowercase → remove stopwords → remove punctuation
Vectorization
- TfidfVectorizer on cleaned details column
Similarity Calculation
- Transform user query to TF-IDF vector
- Compute cosine similarity for each document
- Compute Pearson correlation for each document
- Sort primarily by cosine, secondarily by Pearson
- Return top-N results
- Add more preprocessing (lemmatization, stemming)
- Try sentence transformers / BERT embeddings
- Add evaluation metrics (precision@K, NDCG, user study)
- Create a simple Streamlit / Gradio web interface
- Support multi-query / query expansion
- Add speaker / event / year filters
MIT License
Feel free to use this code for learning, personal projects, or portfolios.
Made with ❤️ by Youssef Mohammed
Happy recommending! 🎬