PyTorch implementation of EfficientNet optimized for Kuzushiji-Kanji character recognition, showcasing state-of-the-art performance on historical Japanese text classification.
Kuzushiji (くずし字), meaning "cursive characters", refers to a style of handwritten Japanese that was widely used from the 8th to the 19th century. This script is characterized by its flowing, interconnected strokes, making it challenging for modern readers to decipher.
Kuzushiji was the standard writing style in Japan for over a millennium, used in literature, official documents, and personal correspondence. However, following the Meiji Restoration in 1868, Japan underwent rapid modernization, including reforms in education and writing systems. As a result, Kuzushiji is no longer taught in schools, creating a significant barrier to accessing Japan's rich textual heritage.
The inability to read Kuzushiji has profound implications for Japan's cultural heritage:
- Over 3 million books and a billion historical documents written before 1867 are preserved but largely inaccessible to the general public.
- Even most native Japanese speakers cannot read texts over 150 years old.
- This linguistic gap threatens to disconnect modern Japan from its pre-modern history and culture.
To address this challenge, the National Institute of Japanese Literature (NIJL) and the Center for Open Data in the Humanities (CODH) created the Kuzushiji dataset:
- Part of a national project to digitize about 300,000 old Japanese books.
- Contains 3,999 character types and 403,242 characters.
- Released in November 2016 to promote international collaboration and machine learning research.
- Kuzushiji-MNIST: A drop-in replacement for the MNIST dataset (70,000 28x28 grayscale images, 10 classes).
- Kuzushiji-49: 270,912 28x28 grayscale images, 49 classes (48 Hiragana characters and one iteration mark).
- Kuzushiji-Kanji: 140,426 64x64 grayscale images, 3,832 Kanji character classes.
The Kuzushiji dataset was introduced in the following paper:
@article{clanuwat2018deep,
title={Deep Learning for Classical Japanese Literature},
author={Clanuwat, Tarin and Bober-Irizar, Mikel and Kitamoto, Asanobu and Lamb, Alex and Yamamoto, Kazuaki and Ha, David},
journal={arXiv preprint arXiv:1812.01718},
year={2018}
}
This research aims to engage the machine learning community in the field of classical Japanese literature, bridging the gap between cutting-edge technology and cultural heritage preservation. For more information on Kuzushiji research, refer to the 2nd CODH Seminar: Kuzushiji Challenge and the Kuzushiji Challenge on Kaggle.
EfficientNet, introduced by Tan and Le in 2019, is a groundbreaking convolutional neural network architecture that achieves state-of-the-art accuracy with an order of magnitude fewer parameters and FLOPS than previous models. It uses a compound scaling method that uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients.
This project adapts EfficientNet for the challenging task of Kuzushiji-Kanji character recognition. Kuzushiji, a cursive writing style used in Japan until the early 20th century, presents unique challenges for optical character recognition due to its complex and varied forms.
The EfficientNet architecture is built on mobile inverted bottleneck convolution (MBConv) blocks, which were first introduced in MobileNetV2. The network progressively scales up in width, depth, and resolution across different variants (B0 to B7).
For a detailed view of the EfficientNet architecture, refer to our EfficientNet diagram.
To get started with this project, follow these steps:
Clone the repository:
git clone https://github.com/davidgeorgewilliams/EfficientNet-KuzushijiKanji.git
cd EfficientNet-KuzushijiKanji
Install the required dependencies:
pip install -r requirements.txt
To use the EfficientNet model in your project, you can utilize the EfficientNetFactory
class. Here's an example of how to create and use an EfficientNet model:
from efficientnet import EfficientNetFactory
import torch
# Choose the EfficientNet variant you want to use
variant = 'b0' # Options: 'b0', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7'
num_classes = 1000 # Adjust based on your Kuzushiji-Kanji dataset
# Create the specified EfficientNet model
model = EfficientNetFactory.create(variant, num_classes)
# Get the appropriate input size for the chosen variant
input_size = EfficientNetFactory.get_input_size(variant)
# Prepare your input data (example)
input_data = torch.randn(1, 3, input_size, input_size)
# Use the model for inference
with torch.no_grad():
output = model(input_data)
# Process the output as needed
predicted_class = torch.argmax(output, dim=1)
print(f"Predicted class: {predicted_class.item()}")
This project uses a combination of Kuzushiji-Kanji and Kuzushiji-49 datasets, which are then balanced and augmented. Follow these steps to prepare your data:
Download the Kuzushiji-49 and Kuzushiji-Kanji datasets from the ROIS-CODH/kmnist GitHub repository. Extract the files into the EfficientNet-KuzushijiKanji/data
directory. After extraction, your data directory should contain the following files:
data/
├── k49-test-imgs.npz
├── k49-test-labels.npz
├── k49-train-imgs.npz
├── k49-train-labels.npz
├── k49_classmap.csv
├── kkanji.tar
└── kkanji2/
Run the script to combine Kuzushiji-Kanji and Kuzushiji-49 datasets:
python 01_combine_k49_and_kkanji2_datasets.py
This script merges:
- Kuzushiji-Kanji: A large, imbalanced 64x64 dataset of 3,832 Kanji characters (140,424 images)
- Kuzushiji-49: 270,912 images spanning 49 classes (extension of Kuzushiji-MNIST)
- Kuzushiji-MNIST: 70,000 28x28 grayscale images across 10 classes (balanced dataset)
Next, run the script to balance the combined dataset:
python 02_balance_kuzushiji_dataset.py
This script applies various augmentation techniques to create a balanced dataset:
- Elastic transforms
- Affine transforms
- Noise addition
The script generates 10,000 images per Kanji character. For characters with more than 7,000 original samples, it randomly subsamples to maintain diversity.
If you prefer to skip the data preparation steps, you can download our pre-balanced dataset from this Google Drive link.
After balancing the dataset, run the following script to prepare the data for model training:
python 03_prepare_kuzushiji_array_data.py
This script performs several crucial tasks:
-
Create Index Mapping:
- Generates an index.json file in the ../data/kuzushiji-arrays directory.
- This JSON file contains a mapping between Kanji characters, their Unicode codepoints, and integer indices.
- The indices correspond to the output of the classification softmax layer, facilitating sparse categorical cross-entropy loss calculation during training.
-
Generate Numpy Arrays:
-
Creates two numpy arrays saved in the ../data/kuzushiji-arrays directory:
-
images.npy
: Contains all images as uint8 arrays, one image per row. -
labels.npy
: Contains the corresponding integer labels as uint32 values, with rows matching the images.npy file.
-
-
-
Data Organization:
Reads all 10,000 images for each Kanji character from the balanced dataset. Ensures consistent ordering between images and their corresponding labels.
This preprocessing step is crucial for efficient model training, as it:
- Provides a standardized mapping between characters and their numeric representations.
- Organizes the image data in a format that's ready for batch processing during training.
- Allows for quick loading of the entire dataset into memory, potentially speeding up the training process.
After running this script, you'll have a structured dataset ready for model input, with a clear mapping between the model's output indices and the corresponding Kanji characters and codepoints.
The scripts in the src
directory are numbered (e.g., 01_
, 02_
, ...) to indicate the order in which they should be run. Always execute them in ascending numerical order to ensure proper data processing.
We trained our EfficientNet model on the Kuzushiji dataset using state-of-the-art hardware and optimized training parameters. The training process yielded impressive results, demonstrating the model's ability to learn and generalize effectively on this complex character recognition task.
- Hardware: 8x NVIDIA H100 (80 GB SXM5) GPUs
- CPU: 208 cores
- RAM: 1.9 TB
- Storage: 24.2 TB SSD
- Provider: LambdaLabs
- Batch size:
50000
- Optimizer: Adam with
amsgrad
setting - Learning rate:
0.001
(default Adam learning rate) - Epochs:
50
We found that using larger batch sizes with the Adam optimizer and its default learning rate resulted in a very stable learning process without experiencing overfitting. Based on our observations, we believe that training for an additional 50 epochs could lead to further incremental improvements without introducing overfitting.
The learning curves plot shows the training and validation loss (top) and accuracy (bottom) over the course of training. We observe:
- Rapid initial improvement in both loss and accuracy
- Consistent convergence of training and validation metrics, indicating good generalization
- No signs of overfitting, as validation performance closely tracks training performance
This plot focuses on the loss for both training and validation sets:
- Sharp decrease in loss during the first few epochs
- Gradual and steady decline in loss throughout training
- Close alignment between training and validation loss, suggesting a well-balanced model
The accuracy curve demonstrates:
- Rapid increase in accuracy during early epochs
- Consistent high accuracy achieved for both training and validation sets
- Slight edge of training accuracy over validation accuracy, but within acceptable limits
This plot shows the precision and recall for both training and validation sets:
- High precision and recall values achieved early in training
- Consistent performance across training and validation sets
- Balanced precision and recall, indicating the model's ability to correctly identify characters without sacrificing either metric
We analyzed the top 50 misclassified pairs from the validation set at epoch 50:
- Many misclassifications appear intuitive based on the visual similarities between Kanji characters
- Results highlight the need for additional training data to further improve the model's ability to distinguish between visually similar characters
- This analysis provides valuable insights for potential future improvements in data collection and model architecture
Our EfficientNet model has demonstrated strong performance in Kuzushiji character recognition. The stable learning process and high accuracy achieved suggest that the model has effectively learned to classify a wide range of Kanji characters. However, the analysis of misclassifications indicates that there is still room for improvement, particularly in distinguishing between visually similar characters.
We have made our trained model available for public use and further research. You can access the following resources via this Google Drive link:
- Character to index mapping
- Best model checkpoint (epoch 50)
We encourage researchers and enthusiasts to build upon our work and contribute to the ongoing efforts in historical Japanese text recognition.
We welcome contributions to improve EfficientNet-KuzushijiKanji! If you'd like to contribute, please follow these steps:
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Please ensure your code adheres to our coding standards and includes appropriate tests.
This project is licensed under the MIT License - see the LICENSE file for details.