A practical demonstration of how to transform messy MOT (Ministry of Transport) defect notes into structured insights using text embeddings and machine learning.
This notebook demonstrates how to use embeddings to analyse unstructured text data - specifically MOT defect notes. You'll learn how to:
- Convert text to numbers: Transform defect notes into numerical vectors that capture meaning
- Find hidden patterns: Use clustering to group similar defects automatically
- Search by meaning: Find related defects using semantic search instead of keyword matching
MOT testers write defect notes in their own words: "brake pipe corroded", "brake hose deteriorated", "brakes imbalanced". Traditional keyword searches miss the connection between these different phrasings. Embeddings solve this by understanding that these all describe brake-related issues.
This is a simplified example of techniques used in CarHunch, a vehicle insights platform that analyses millions of MOT records to provide vehicle insights and help people make better decisions about their vehicles.
Click the badge below to open the notebook in your browser. No setup required - just sign in to Google and start experimenting:
Note: Google Colab requires a Google account to save your work and provide computational resources. Your data remains private, and you can always download your work or run it locally if you prefer.
Clone this repository and install the dependencies:
git clone https://github.com/DonaldSimpson/mot_embeddings_demo.git
cd mot_embeddings_demoInstall the required dependencies:
pip install sentence-transformers scikit-learn matplotlib jupyterThen open the notebook:
jupyter notebook mot_embeddings_demo.ipynbConvert MOT defect notes into 384-dimensional vectors using the MiniLM model. Each note gets a unique "fingerprint" that captures its semantic meaning.
Use K-means clustering to automatically group similar defects together. Visualise the results with PCA to see how brake issues, lighting problems, and steering defects cluster separately.
Find defects similar in meaning to any query - not just exact word matches. Search for "brake failure" and discover brake-related defects even when they don't contain the word "failure".
Once you understand the basics, try these modifications:
- Add your own defect notes - Replace the sample data with notes from your own vehicle's MOT history
- Change the number of clusters - Try
n_clusters=2,4, or5to see how groupings change - Experiment with different queries - Try "safety concern", "performance issue", or "electrical fault"
- Try a different model - Replace
"all-MiniLM-L6-v2"with"multi-qa-mpnet-base-dot-v1"for potentially better results
- Larger datasets: Try with hundreds or thousands of MOT records
- Different domains: Apply the same techniques to customer feedback, support tickets, or any unstructured text
- Production deployment: Check out the MLOps blog post to see how to turn this into a production pipeline
- Model: all-MiniLM-L6-v2 from Hugging Face
- Clustering: K-means with scikit-learn
- Visualisation: PCA for dimensionality reduction
- Search: Cosine similarity for semantic matching
Contains public sector information licensed under the Open Government Licence v3.0.
Donald Simpson - DevOps engineer who got interested in ML through building CarHunch. This demo shares what I've learned about embeddings through that journey, presented in a way that other DevOps engineers and people interested in AI/ML can understand and experiment with.
- Blog Post: From MOT Notes to Insights with MiniLM: A Practical Guide to Text Embeddings - Detailed explanation of the concepts and real-world applications
- MLOps Demo: MLOps for DevOps Engineers - MiniLM & MLflow demo - How to turn this into a production pipeline
- CarHunch: Vehicle insights platform - Real-world application of these techniques
This project is licensed under the MIT License - see the LICENSE file for details.