This project automates the process of reading book blurbs to classify its genres. Multi label classification is also a prominent area of research in Machine Learning so I was intrigued with the challenge of combining text analysis and multi label prediction.
I've also added explanations for the code in the Jupyter notebook.
Approach: To build a model that accurately predicts multiple book genres, we will be training three different models and comparing them.
- Random Forest Classifier
- Logistic Regression Classifier (Incorporated with OneVsRest)
- Feedfoward Neural Network
- Preprocessing
- Training
- Testing functions
- Evaluation
- Setup & Dependencies
- Practicality & Improvements
Source: Carnegie Mellon University
Name: CMU Book Summary Dataset
Books: 16,559
Metadata:
- Wiki ID
- Freebase ID
- Book Title
- Book Author
- Publication Date
- Genres
Extract useful data: We will only be using 'Title', 'Genres' and 'Summary' for our models.
Missing values: Removed all missing values from our dataframe (3718 were found in the 'Genres' column).
Freebase Tags: To remove freebase tags, we import the data using json.loads
Low frequency Genres: Removed over 150 genres which contained less than 50 books, merged several subgenres into their respective genres (i.e Speculative Fiction -> Fiction).
To clean the text features, we will:
- Change all characters to lower case
- Remove any numbers from text
- Remove white spaces
- Remove punctuation (with String library)
- Remove words with less than 3 characters
- Remove stopwords (with NLTK)
Genres: Since there are multiple genres to be classified, we will be using a multi label binarizer
Text data: For the Logistic Regression and Random forest classifier, we will be using a TF-IDF vectorizer with a threshold of 0.8 and using a maximum of 10,000 features.
For the Neural Network, we will create a work index using Keras's Tokenizer where the most frequent words will appear first. We will also convert the summaries into sequences and pad them to a max length of 500 characters.
Splitting Data: We will be using the typical 80-20 train/test split to train our models.
To compare our models, we will be using their f1 scores which is the balance between precision/recall since there are multiple labels to be classified. We will also be using the micro average since there is a significant class imbalance.
For our baseline model, we will simply take the most frequent genre (Fiction: 4191) and since there are now 11282 books in our dataset, the baseline accuracy will be 4191/11282 = 37%. Note that accuracy is only with one genre so precision, recall and f1 score are not applicable. With that in mind this is purely for observation.
We will be comparing two Random Forest models, one by itself and one incorporated with OneVsRest.

From the results, using the OneVsRest variation yields slightly better results.
Since Logistic Regression itself is binary, we must incorporate it with OneVsRest.

Hyperparameters: To find the optimal hyper parameters for our Neural Network, we will be utilising a combination of trial and error and Gridsearch.

- Total params: 4,336,636
- Batch size: 25
- Epochs: 2
- Optimizer: adam
- Loss function: binary_crossentropy
We will create two testing functions to observe our model.
Inference function: Our first function will take in a title name and search if it exists in our dataset. If it exists, this function will then use the summary of the book and predict its genres. We can then observe the difference between the prediction/actual genres.

Analyse function: The analyse function takes in our own text and will predict its genres. For this example we will be using the summary of 'The Woman in White' (from Wikipedia) which does not exist in our dataset.

The actual genres for this book are Novel, Fiction, Gothic and Mystery so this prediction is considerably accurate.
From the results, the Neural network achieved the best f1 score and if we were to measure its accuracy:

we can assume that it is fairly effective.
In comparison to other similar multi classifiers such as this movie genre predictor which also achieved an f1 score in the 50s range, we can conclude that our Neural network is relatively accurate.
The libraries/imports are located at the beginning of the Jupyter notebook. For the versions used at the time of creation of this project, refer to the requirements.txt file.
Practicality : This model can be scaled up to create a book recommendation system based on preferred genres/previously read genres. Other models such as reading history and book ratings could be integrated to achieve this.
What else could be tried?:
- A better dataset could be used with a better genre distribution since this dataset contained almost 40% fiction.
- A bigger dataset with more samples and genres may produce a model which is able to predict rarer genres
- More models could be created and tested such as Recurrent Neural Networks or Long short-term memory Neural Networks
- Words could further be cleaned using n-grams and stemming/lemmatization if we had more processing power
- Other word embedding techniques such as GloVe





