This service provides personalised recommendations for flatmate matching based on a user's profile and preferences. It uses unsupervised learning, namely clustering, to group users together based on their similarities.
To improve the user experience of our flatmate matching platform by providing personalised recommendations.
The R notebook BCM_DATA_V2.Rmd contains the research that we conducted to determine the best clustering method, number of clusters, and number of PCA components for our recommender service.
The notebook is divided into the following sections:
CleaningExploratory Data AnalysisFeature Engineering- Feature Selection
- Feature Encoding
- Feature Scaling
- TF-IDF
- Outlier Detection
- Interaction Features
Model Selection & PCA- Centroid Based Clustering
- Hierarchical Clustering
- Density Based Clustering
- Fuzzy Clustering
- Affinity Propagation Clustering
- Spectral Clustering
Conclusion
To overcome the challenges of data privacy, we opted using synthetic data in our models.
The data was generated using Mostly.ai which is a synthetic data generation platform that uses GANs to generate statistically and structurally identical data to the original.
The generated data has the following features:
age: Numericgender: Categorical (name, female, other)language: Categoricalnationality: Categoricaloccupation: Categorical (student, professional, other)smoker: Categorical (yes,no)any pets: Categorical (yes,no)interests: Text Data (e.g. music, hiking, cooking, etc.)budget: Numericroom wanted: Categorical (single, double)areas: Text data (e.g. Zone 1, Elephant & Castle, Highbury, etc.)min term: Numeric, unit is monthsmax term: Numeric, unit is monthsamenities: Text Data (e.g. roof terrace, gym, furnished, etc.)
Flatmate Preferences:
smokers OK?: Categorical (yes,no)pets OK?: Categorical (yes,no)occupation Req?: Categorical (student, professional, no preference)min age: Numericmax age: Numericgender Req: Categorical (female, male, other)
The services uses a 4-PCA Component PAM with Cosine Similarity as a distance metric to cluster users.
PAM is a popular clustering algorithm known for its efficiency and accuracy. It can, however, be computationally expensive. Due to this, the service uses CLARANS which is an extension to PAM to perform clustering on large data sets.
The service manages three documents in MongoDB: TrainedModel, PreProcessingMeta, and ClusteredProfile.
- TrainedModel: Stores the trained model and its PCA components.
- PreProcessingMeta: Stores the meta data used to preprocess the data. (This is required so we don't have to preprocess the whole dataset again)
- ClusteredProfile: Stores the clustered profiles formed by: (1) Original User Data (2) Processed Profile (ready to be ingested by our model) (3) The Cluster they belong to
- /api/recommendation/matches/{uid} - Returns a list of recommended users for the given user from the most to least similar. (There is no rate-limiting for now, the service returns the entire cluster with one call)
The service was developed using a modular approach to provide easy navigation and separation of concerns. The project is structured as follows:
config: Configuration filescontroller: Web Controllers to manage endpointsconverter: Classes that help MongoDB serialise and deserialise more complex objectsentity: Persistent data modelexception: Custom exceptionslistener: Event listeners. Checks if a model is trained at startuplogic: Model and Clustering LOGICrepository: Data access layerservice: Business logic layerutil: Miscellaneous classes
The service is not currently operational due to its dependency on another service called onboarding-service which is used to create user profiles and expose them through an API.
The onboarding-service is tied to my personal mongoDB and making it public at this time would not be prudent.
- Use the
BCM_DATA_V2.Rmdnotebook andsyntethic_data.csvto inspect the data and run the models. - Use the
teststo see if the service works as expected locally.
Do let me know if there is anything else I can help with.