Wordtyle is a deep learning tool for training Korean writing style embedding models. It analyzes text writing styles and converts them into embedding vectors. With its user-friendly Gradio-based web interface, anyone can easily create writing style analysis models.
- π Book Text-based Training: Upload text files (novels, essays, etc.) to train writing style analysis models
- π― 8 Writing Style Classifications: Automatic classification of formal, informal, literary, dialogue, narrative, poetic, technical, and emotional styles
- π GPU Acceleration Support: High-speed training with CUDA, XFormers, and Flash Attention 2
- π¨ Intuitive Web Interface: Easy-to-use Gradio-based GUI
- πΎ Model Saving & Reusability: Save trained models and load them for later use
- π§ͺ Real-time Testing: Instantly test text style analysis after training completion
git clone https://github.com/yourusername/wordtyle.git
cd wordtylepython -3.12 -m venv venv
# Windows
venv\Scripts\activate
# macOS/Linux
source venv/bin/activatepip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
pip install -r requirements.txt# Install PyTorch with CUDA support
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
# Install basic requirements
pip install -r requirements.txt
# Optional: Install acceleration libraries
pip install xformers>=0.0.20 # Memory optimization
pip install triton>=2.0.0 # CUDA kernel optimization
pip install flash-attn>=2.0.0 # Flash Attention 2python main.pyOr use the batch file on Windows:
run_gpu.batOpen your browser and navigate to:
http://localhost:7860
-
π Data Preparation Tab:
- Upload a Korean text file (.txt)
- Set minimum sentence length (recommended: 20+ characters)
- Click "π Analyze File"
-
ποΈ Model Training Tab:
- Configure training parameters:
- Base model (default: klue/bert-base)
- Number of epochs (1-10)
- Batch size (adjust based on your GPU memory)
- Learning rate (default: 2e-5)
- Embedding dimension (128-768)
- Click "π Start Training"
- Configure training parameters:
-
π§ͺ Model Testing Tab:
- Enter test text to analyze writing style
- View style probability distributions
- Get embedding vectors
The model classifies text into 8 different Korean writing styles:
| Style | Korean | Description | Example Patterns |
|---|---|---|---|
| Formal | 격μ체 | Formal/polite language | μ΅λλ€, μ λλ€, νμμ΅λλ€ |
| Informal | λΉκ²©μ체 | Casual/informal language | μΌ, μ΄, μ§, μμ, κ±°μΌ |
| Literary | λ¬Ένμ | Literary/artistic style | μ²λΌ, λ§μΉ, λ―μ΄, κ²λ§ κ°μλ€ |
| Dialogue | λν체 | Conversational style | ", ', λΌκ³ , νλ€, λ§νλ€ |
| Narrative | μμ 체 | Narrative/descriptive | κ·Έλ, κ·Έλ λ, μ΄λ, κ·Έ μκ° |
| Poetic | μμ | Poetic/lyrical style | λ¬λΉ, λ°λ, κ½μ, λ³, κ΅¬λ¦ |
| Technical | κΈ°μ μ | Technical/analytical | μμ€ν , λ°μ΄ν°, λΆμ, κ²°κ³Ό |
| Emotional | κ°μ±μ | Emotional/expressive | κ°μ΄, λ§μ, λλ¬Ό, κΈ°μ¨, μ¬ν |
-
Base Model: Choose from pre-trained Korean language models
klue/bert-base(default)klue/roberta-basebeomi/KcELECTRA-basemonologg/kobert
-
Training Parameters:
- Epochs: 1-10 (default: 3)
- Batch Size: 4-64 (adjust based on GPU memory)
- Learning Rate: 1e-6 to 1e-4 (default: 2e-5)
- Embedding Dimension: 128-768 (default: 384)
- CPU: 4 cores
- RAM: 8GB
- Storage: 2GB free space
- GPU: NVIDIA GPU with 8GB+ VRAM
- CPU: 8+ cores
- RAM: 16GB+
- Storage: 5GB+ free space
The application automatically detects and uses available acceleration libraries:
- β CUDA: GPU acceleration
- β XFormers: Memory-efficient attention
- β Flash Attention 2: Faster attention computation
- β Mixed Precision: Reduced memory usage
wordtyle/
βββ main.py # Main application file
βββ requirements.txt # Python dependencies
βββ run_gpu.bat # Windows batch file for easy execution
βββ README.md # This file
βββ style_models/ # Directory for saved models
βββ example/ # Example files and demos
βββ venv/ # Virtual environment (after setup)
from main import StyleEmbeddingTrainer
# Initialize trainer
trainer = StyleEmbeddingTrainer(model_name="klue/bert-base")
# Load your book text
with open("my_novel.txt", "r", encoding="utf-8") as f:
book_text = f.read()
# Prepare data and train
data_dict = trainer.prepare_data(book_text)
trainer.create_model(embedding_dim=384)
history = trainer.train_model(data_dict, num_epochs=3)
# Save the model
model_path = trainer.save_model("my_style_model")# Extract style embeddings from text
texts = ["κ·Έλ μ‘°μ©ν λ¬Έμ μ΄κ³ λ°© μμΌλ‘ λ€μ΄κ°λ€.",
"μΌ, λνκ³ μμ΄?"]
embeddings = trainer.extract_embeddings(texts)
print(f"Embedding shape: {embeddings.shape}")-
Buttons not responding
- Check browser console (F12) for errors
- Make sure to click "π Analyze File" after uploading
-
Out of memory errors
- Reduce batch size (try 4-8 for CPU, 8-16 for GPU)
- Reduce embedding dimension
- Use smaller text files
-
Slow training
- Install GPU acceleration libraries
- Increase batch size if memory allows
- Use smaller models (bert-base instead of bert-large)
-
Import errors
- Make sure virtual environment is activated
- Reinstall requirements:
pip install -r requirements.txt - For GPU: Install PyTorch with CUDA support
- For CPU training: Use batch size 4-8, embedding dim 256
- For GPU training: Use batch size 16-32, embedding dim 384-512
- Text size: Minimum 10,000 characters recommended for good results
- Sentence length: Set minimum 20 characters for quality data
We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
git clone https://github.com/yourusername/wordtyle.git
cd wordtyle
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txt
pip install -e .This project is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).
This means:
- β You can use, modify, and distribute this software
- β You can use it for commercial purposes
- β If you run this software on a server and provide it as a service, you must make the source code available to users
- β Any modifications must also be licensed under AGPL-3.0
- β You must include the original license and copyright notice
See the LICENSE file for full details.
- Hugging Face Transformers for the transformer models
- KLUE for Korean language understanding models
- Gradio for the web interface framework
- PyTorch for the deep learning framework
- π Bug Reports: Open an issue
- π‘ Feature Requests: Open an issue
- π¬ Discussions: GitHub Discussions
- KoBERT: Korean BERT model
- Sentence Transformers: Sentence embedding models
- KLUE: Korean Language Understanding Evaluation
Made with β€οΈ for the Korean NLP community