MirrorMetrics is a scientific benchmarking tool for evaluating Face LoRAs (Stable Diffusion fine-tuned models). It uses InsightFace (ArcFace) to perform local biometric analysis and generates a rich, interactive Plotly dashboard β all running entirely on your machine.
Training a LoRA often feels like guessing. You might ask:
- How do I know if my LoRA is overtrained?
- Why does my character look rigid?
- Is my dataset consistent?
MirrorMetrics solves this by measuring Identity Consistency, Face Geometry, and Flexibility mathematically.
In two quick and easy words
You put the dataset you used to train your LoRA in the Reference_Images folder and the generated images in the Lora_Candidates folder (Create one folder for each LoRA you want to compare. The name of the folder will be used as the name of the LoRA in the dashboard.).
You can have more LoRAs from the same training, maybe with different settings, or different steps, for example. Or you could have LoRA trained for entirely different models, that's fine too.
Then you run the script and it will generate a dashboard with all the metrics. The way I use it the most at the moment is:
- Compare the plots for the dataset, to see if I have some outliers that can skew the results.
- Compare the plots for the LoRAs, to see which one is the most consistent with the dataset, especially graph one which shows the general similarity score of the various images compared to the median of the dataset.
- Look at the plots 5 and 6 to see if the LoRA is good at generating the face in different angles and how much does it tend to do so (of course mind the prompts you're using, if you specifically tell it to generate a right profile image of course you'll have a bunch of dots on the right side of the graph).
"How consistent is the identity your LoRA generates?" β MirrorMetrics answers this question with data.
| Feature | Description |
|---|---|
| 𧬠Biometric Similarity | Cosine similarity between face embeddings and a reference centroid |
| π Leave-One-Out (LOO) | Robust reference scoring that excludes each image from its own centroid calculation β great for cleaning noisy datasets |
| π Interactive Plotly Dashboard | 7-panel dark-themed HTML dashboard with floating control panel |
| π― Pose Analysis | Yaw / Pitch scatter plots to evaluate identity stability across head orientations |
| πΊοΈ t-SNE Identity Map | 2D projection of face embeddings to visualize identity clusters |
| π€ Age Detection | Per-image age estimation via deep learning |
| π Privacy-Focused | Everything runs locally β no images are ever uploaded |
- Python 3.10+
- NVIDIA GPU (recommended, not required) β the script runs on CPU too, just slower. See the installation tips below.
# Clone the repository
git clone https://github.com/AndyLone22/MirrorMetrics.git
cd MirrorMetrics
# Create a virtual environment (recommended)
python -m venv venv
venv\Scripts\activate # Windows
# source venv/bin/activate # macOS / Linux
# Install dependencies
pip install -r requirements.txtImportant
The requirements.txt includes all NVIDIA CUDA libraries (cuBLAS, cuDNN, etc.) so the setup is fully standalone β no system-level CUDA Toolkit installation needed. This makes the total venv size ~4.5 GB.
Tip
Already have CUDA 12 installed system-wide? You can save ~3 GB by removing all nvidia-* lines from requirements.txt before running pip install. The script will use your system CUDA libraries instead.
No NVIDIA GPU? Replace onnxruntime-gpu with onnxruntime in requirements.txt and remove all nvidia-* lines. The script will run on CPU (slower but functional).
Note
On the first run, InsightFace will automatically download the buffalo_l model (~300 MB). This is a one-time operation.
It's imperative to use good structured datasets for the Analysis: The reference images should be the ones used in the dataset for Training the LoRAs. Then with those LoRAs, produce at least 10 different prompts, in batches of 3 per prompts, to have a good variety, and mind to have differentiated prompts that reach for many positions, angles, situations... If you only produce images of portraits, of course the variance will be flat, but that won't mean that the model is not variable, just that the dataset was created poorly. I suggest you decide 10 standard prompts and always use those, so that you'll learn better how to read the results with experience.
MirrorMetrics/
βββ Reference_Images/ β Your real reference photos (the "ground truth")
β βββ photo1.jpg
β βββ photo2.jpg
β βββ ...
βββ Lora_Candidates/ β One subfolder per LoRA to evaluate
β βββ MyLoRA_v1/
β β βββ gen_001.png
β β βββ gen_002.png
β βββ MyLoRA_v2/
β β βββ ...
β βββ ...
βββ mirror_metrics.py
Reference_Images/β Place your real face photos here. More variety (angles, lighting) = better centroid.Lora_Candidates/β Create one subfolder per LoRA (or per experiment). Each subfolder will appear as a separate group in the dashboard.
Windows: Just double-click run.bat β it activates the venv and launches the script automatically.
Linux / macOS: Run chmod +x run.sh once, then ./run.sh.
Or manually from any platform:
python mirror_metrics.pyThe script generates two output files in the project root:
| File | Description |
|---|---|
Dashboard_<timestamp>.html |
Interactive Plotly dashboard β open in any browser |
Data_<timestamp>.csv |
Raw data export for further analysis |
The generated dashboard contains 7 analysis panels:
- Face Similarity β Box plot of cosine similarity scores per group
- Age Distribution β Violin plot of estimated ages
- Face Ratio β Bounding-box aspect ratio distribution
- Detection Confidence β Face detector confidence scores
- Profile Stability β Similarity vs. absolute yaw angle (does identity hold in profile views?)
- Pose Variety β Yaw vs. Pitch bubble chart (bubble size = identity strength)
- Identity Map β t-SNE 2D projection of face embeddings
A floating control panel lets you toggle individual groups on/off or cycle through them in solo mode.
For reference images, each photo is scored against the centroid of all other reference images (excluding itself). This prevents inflated self-similarity and helps identify outlier photos in your reference set.
For generated images (LoRA candidates), each photo is compared against the full reference centroid.
All face embeddings (reference + generated) are projected into 2D using t-SNE. Tight clusters indicate consistent identity; scattered points suggest identity drift.
You can customize these variables at the top of mirror_metrics.py:
| Variable | Default | Description |
|---|---|---|
PATH_REFERENCE |
Reference_Images |
Path to reference images folder |
PATH_CANDIDATES_ROOT |
Lora_Candidates |
Path to LoRA candidates root folder |
EXTENSIONS |
jpg, jpeg, png, webp, bmp |
Supported image formats |
Look at the Face Similarity chart. If the score is extremely high (>0.85) on all images and the Face Ratio variance is near zero, your model is likely overfitted (memorizing pixels instead of concepts).
Yes. MirrorMetrics works on the output images, so it is compatible with Stable Diffusion 1.5, SDXL, Pony, Flux, QWEN, Z-Image and any other model of any kind.
Use the "Reference" box in the charts. If the purple box is very tall or has outliers, check those images to see why they give a high difference evaluation to the mean of the rest of the images, then discard them if you feel it's best.
This usually indicates inconsistent skin texture in your dataset. The biometric engine uses high-frequency details (pores, wrinkles, skin smoothing) to estimate age. If you mixed "soft" lighting images with "harsh" realistic photos, the tool might interpret the texture difference as an age difference.
This often happens with extreme angles (e.g. looking back over the shoulder, steep profiles). The detector expects a standard facial geometry, so perspective compression can lower the confidence score even if the anatomy is correct. Low confidence on a profile shot is acceptable; low confidence on a front-facing shot means your model is broken, so interpretation of the data is always needed!
Yes! You can run the tool pointing only to your Dataset folder to analyze the Purple Box (Reference). If the box is very tall or has dots floating far below it, you have "poison" images (outliers) in your dataset. Removing them before training could save you time and GPU hours, but it's data: not a suggestion so you always must interpret the data before deciding how to act.
Look at the Face Ratio and Pose Variety charts. If the Face Ratio is a flat line and Pose Variety is clustered at the center, the model is just "photocopying" the data (Low Creativity). If the charts show wide variance (like a violin shape) and scattered dots, the model understands the 3D structure well enough to generate new expressions and angles (High Flexibility). Of course all this needs to have a good set of produced images to evaluate with different poses, angles, positions, backgrounds etc...
This project is licensed under the MIT License β see the LICENSE.txt file for details.
Contributions are welcome! Feel free to open an issue or submit a pull request.