A comprehensive implementation of collaborative filtering techniques for music recommendation using the Last.fm dataset.
This project implements three different collaborative filtering approaches:
- User-User Collaborative Filtering using cosine similarity
- Item-Item Collaborative Filtering using KNN
- Matrix Factorization using Alternating Least Squares (ALS)
The system uses the Last.fm dataset containing:
- User ID: Anonymized user identifiers
- Artist Name: Music artist names
- Listen Count: Number of times a user listened to an artist
- Filters users and artists based on minimum interaction thresholds
- Handles cold start problems by removing users/artists with insufficient data
- Creates train/test splits for evaluation
- Converts categorical data for efficient processing
- Computes cosine similarity between users
- Finds similar users based on listening patterns
- Recommends artists liked by similar users
- Handles missing values through imputation
- Uses Surprise library's KNNBasic algorithm
- Finds similar artists based on user preferences
- Recommends artists similar to user's listening history
- Configurable similarity metrics (MSD, cosine, Pearson)
- Implements Alternating Least Squares using implicit library
- Decomposes user-artist matrix into latent factors
- Handles sparse data efficiently
- Applies confidence weighting to implicit feedback
pip install pandas numpy scipy matplotlib seaborn plotly
pip install implicit surprise scikit-learn tqdm ipywidgetsimport pandas as pd
import numpy as np
import implicit
from surprise import KNNBasic, Dataset, Reader
# Load dataset
df = pd.read_csv("usersha1-artmbid-artname-plays.tsv",
delimiter="\t", header=None,
usecols=[0,2,3],
names=['userId','artistName','listens'])# Filter users and artists based on thresholds
df_filtered, users_filtered, artists_filtered = filter_lastfm_raw(
df, user_sum, artist_sum, user_t=50, artist_t=50
)# Create user-item matrix
df_pivot = df_train.pivot_table(
index='userId', columns='artistName', values='listens'
)
# Compute similarity and generate recommendations
user_similarity = cosine_similarity(df_pivot.fillna(0).values)
recommendations = get_rec_u2u_cb(user_index, user_similarity, ...)# Setup Surprise dataset
reader = Reader(rating_scale=(1, df['listens'].max()))
trainset = Dataset.load_from_df(df_train, reader).build_full_trainset()
# Train KNN model
algo = KNNBasic(k=100, sim_options={'name': 'msd', 'user_based': False})
algo.fit(trainset)# Create sparse matrix
sparse_matrix = sparse.csr_matrix((df_train['listens'],
(df_train['user_ID'], df_train['artist_ID'])))
# Train ALS model
model = implicit.als.AlternatingLeastSquares(factors=50, regularization=0.1)
model.fit(sparse_matrix * 10) # Apply confidence weightingThe system uses standard recommendation metrics:
Measures the fraction of recommended items that are relevant:
Precision@K = (Relevant items in top-K recommendations) / K
Measures the fraction of relevant items that were recommended:
Recall@K = (Relevant items in top-K recommendations) / (Total relevant items)
filter_lastfm_raw(): Filters dataset based on user/artist thresholdsplot_heat(): Visualizes user-artist interaction patterns
get_rec_u2u_cb(): Generates user-user collaborative filtering recommendationsget_preds_i2i(): Generates item-item collaborative filtering recommendationspak(): Computes Precision@K and Recall@K metrics
The system addresses cold start issues by:
- Filtering out users with few artist interactions
- Filtering out artists with few user interactions
- Using popularity-based recommendations for new users/items
- Uses sparse matrix representations for memory efficiency
- Applies confidence weighting in matrix factorization
- Implements various imputation strategies for missing values
The implementation provides comparative analysis of three approaches:
- User-User CF: Good for users with similar taste profiles
- Item-Item CF: Better scalability and stability
- Matrix Factorization: Handles sparsity well, discovers latent factors
pandas: Data manipulation and analysisnumpy: Numerical computingscipy: Sparse matrix operationsimplicit: Matrix factorization algorithmssurprise: Collaborative filtering libraryscikit-learn: Machine learning utilitiesmatplotlib/seaborn/plotly: Data visualization
- The system is designed for implicit feedback (listen counts) rather than explicit ratings
- Memory usage is optimized through sparse matrix representations
- Evaluation uses temporal splitting (70% train, 30% test)
- Visualization includes heatmaps for similarity analysis
- Content-based filtering integration
- Deep learning approaches (neural collaborative filtering)
- Real-time recommendation updates
- A/B testing framework for recommendation quality assessment