This repo serves as the reflection of my work in the DECRYPT project as a scholarship intern at Centre de Visió per Computador (CVC) and my FDP with them. Transcription, decryption and 'Direct Image Decryption' of the Copiale manuscript.
An end-to-end neural architecture for direct decryption of historical encrypted manuscript images, bypassing traditional transcription stages to reduce error propagation in cipher decipherment.
This project introduces Direct Image Decryption, a novel paradigm that directly maps encrypted manuscript images to decrypted plaintext in a single model, eliminating the intermediate transcription bottleneck that plagues traditional two-stage cryptanalysis pipelines.
Submitted to HistoCrypt 2026 - International Conference on Historical Cryptology
Historical encrypted manuscripts present unique challenges at the intersection of cryptography, paleography, and computer vision. Current decipherment approaches rely on a two-stage pipeline: transcription followed by decryption. This methodology faces three critical limitations:
- Error Propagation: Transcription errors compound during decryption, corrupting final output
- Manual Bottleneck: Symbol transcription requires extensive expert labor and limits scalability
- Information Loss: Converting images to discrete symbols discards visual features (confidence, spacing, degradation patterns)
The Copiale Cipher—a 105-page 18th-century German manuscript with ~75,000 handwritten characters from a 90+ symbol alphabet—exemplifies these challenges. Despite successful decipherment in 2011, the traditional approach remains vulnerable to cascading errors.
Direct Image Decryption learns the direct mapping from encrypted manuscript images to decrypted plaintext using attention-based sequence-to-sequence architecture. By maintaining continuous visual representations throughout the pipeline, the model:
- Eliminates transcription-related error propagation
- Removes the manual transcription bottleneck
- Captures visual features lost during symbolic conversion
- Enables end-to-end gradient flow for joint optimization
Encrypted Manuscript Image → CRNN Feature Extractor → Attention-Based LSTM Decoder → Decrypted Plaintext
Core Innovation: Unlike traditional pipelines that commit to discrete symbol decisions early (blocking gradient flow), Direct Image Decryption maintains continuous representations, allowing visual features to adapt directly to decryption requirements through end-to-end backpropagation.
Five-block CNN architecture (1→64→128→256→256→512 channels) with progressive pooling extracts hierarchical visual features. Output feature maps are processed through 2-4 layer bidirectional LSTM (256 units per direction) to produce contextualized visual representations.
Key Design: Initialized with pretrained transcription model weights but fine-tuned end-to-end during training, unlike frozen representations in traditional pipelines.
2-layer LSTM decoder (512 hidden units) with 8-head multi-head attention generates plaintext characters autoregressively. At each timestep:
- Query vector from LSTM state
- Multi-head attention computes relevance scores over encoded visual features
- Attended features concatenated with LSTM output
- Linear projection to German alphabet vocabulary
Cross-entropy loss enables end-to-end training with gradients flowing through attention, decoder, and feature extractor.
Comprehensive pipeline produces 115,000+ Copiale-like training images from historical German texts (Goethe's Faust, Kant's Critique of Pure Reason, Lutheran Bible, Stifter's Nachsommer):
Text Processing: Filtered to 106-character Copiale vocabulary, segmented to 12-40 character lines matching manuscript distribution
Visual Encoding: Copiale.ttf font maps Unicode to cipher glyphs through DECRYPT project vocabulary
Augmentation Suite: Gaussian noise, random erosion/dilation, gamma correction, Kanungo degradation, geometric transformations (rotation ±3°, shearing, scaling, cropping)
Comprehensive evaluation across synthetic and authentic historical data:
| Dataset | Metric | Two-Stage | Direct Image Decryption | Δ |
|---|---|---|---|---|
| Faust (in-dist) | Token Acc. | 91.3% | 92.4% | +1.1% |
| WER | 20.6% | 10.5% | -49% | |
| Novalis (OOD) | Token Acc. | 69.5% | 75.8% | +6.3% |
| WER | 59.7% | 31.6% | -47% |
| Metric | Two-Stage | Direct Image Decryption | Δ |
|---|---|---|---|
| Token Accuracy | 39.6% | 51.4% | +11.8% |
| WER | 89.0% | 76.0% | -13.0% |
| CER | 43.0% | 39.3% | -3.7% |
Key Findings:
- Direct Image Decryption outperforms traditional approach on both synthetic and real data
- Advantage increases under challenging conditions (11.8% on real vs. 1.1% on synthetic)
- Performance gap on real data reflects data scarcity (57× less real than synthetic training examples) rather than fundamental domain incompatibility
While transcription generalizes well (91.1% accuracy on real manuscripts), decryption performance collapses from 92% (synthetic) to 51% (real). Experiments confirm this stems from insufficient training data rather than quality issues:
- Reducing synthetic training from 115k → 8k images: 92% → 31% accuracy (comparable to synthetic-to-real drop)
- Models require tens of thousands of examples to learn robust linguistic patterns
- Only 2,000 real manuscript images available vs. 115,000 synthetic training samples
Despite severe data limitations, Direct Image Decryption's consistent advantage validates that end-to-end learning reduces error propagation regardless of training set size.
Dependencies:
torch>=2.0.0
opencv-python>=4.5.0
numpy>=1.21.0
pillow>=9.0.0Hardware Requirements:
- CUDA-capable GPU (8GB+ VRAM recommended)
- ~50GB storage for datasets
- 16GB+ system RAM for training
Data Sources:
- Copiale manuscript: Robust Reading Competition 2024
- DECRYPT project vocabulary: https://de-crypt.org/
- Historical German texts: Project Gutenberg
Only dataaset samples were included as the entire size of the datasets is too big.
Direct-Image-Decryption-COPIALE/
├── Datasets/
│ ├── CopialeOriginalManuscriptSample/ # Original Copiale manuscript images
│ ├── FaustSample/ # Synthetic Faust dataset samples
│ └── TextFiles/ # Source text files for generation
│
├── Direct Image Decryption/
│ ├── copialeDirectImageDecryptionVocabulary.json
│ ├── inferenceDirectImageDecryptionCaseInsensitive.py
│ ├── inferenceDirectImageDecryptionCaseSensitive.py
│ └── directImageDecryptionModelCreation.py
│
├── Synthetic Data Generation/
│ ├── GenerationTools/
│ │ ├── asteriskRemover.ipynb
│ │ ├── check_stats.py
│ │ ├── copialeCompetitionDecryptionText.ipynb
│ │ ├── datasetSplitter.ipynb
│ │ ├── frankensteiner.ipynb
│ │ ├── lineSplitter.ipynb
│ │ ├── mismatchAnalyzer.py
│ │ ├── symbolGeneration.ipynb
│ │ └── textPreprocess.ipynb
│ ├── augmentor.py
│ ├── CopialeV2.ttf # Copiale cipher font
│ └── generator.py
│
└── Transcription-Decryption/
├── Decryption/
│ ├── copialeDecryptionVocabulary.json
│ ├── decryptionInferenceCaseInsensitive.py
│ ├── decryptionInferenceCaseSensitive.py
│ └── decryptionModelCreation.py
└── Transcription/
├── BAROTranscription/
└── copialeTranscriptionVocabulary.json
- GAN-based synthesis: Generate higher-fidelity manuscript images capturing ink flow, pressure variations, and authentic aging patterns beyond current augmentation
- Semi-supervised learning: Leverage untranscribed digitized manuscripts to expand training data
- Cross-cipher evaluation: Test on Borg, Rammanacoil, and other historical ciphers to assess generalization
- Human-in-the-loop systems: Interactive correction for refining linguistic priors and creating high-quality training data
- Multilingual extension: Adapt to non-German historical ciphers
- Attention visualization: Explainable AI showing which manuscript regions influence decryption decisions
This work establishes that:
- End-to-end learning outperforms traditional pipelines: Consistent 6% mean improvement in token accuracy, with advantages amplified under data scarcity
- Data quantity matters more than quality: Performance gaps stem from insufficient examples (2k real vs. 115k synthetic) rather than fundamental domain incompatibility
- Visual feature extraction transfers well: Transcription maintains 91% accuracy on real manuscripts despite synthetic training
- Decipherment requires scale: Current accuracy (51%) represents performance ceiling with 2,000 training examples; practical deployment requires 10,000-50,000 real manuscript images
Current State: Direct Image Decryption serves as a tool for augmenting human expert analysis rather than replacing it, offering meaningful accuracy improvements that reduce manual effort in historical cipher decipherment.
Author: Marino Oliveros Blanco
Supervisor: Dr. Alicia Fornés (CVC Document Analysis Group)
Institution: Universitat Autònoma de Barcelona, Escola d'Enginyeria & Computer Vision Center (CVC)
Research Environment: Computer Vision Center (CVC) - 8 GPUs (6× RTX 3090, 2× Quadro RTX 6000)
Special Thanks:
- Alicia Fornés for guidance and supervision
- DECRYPT project for standardized datasets and vocabulary
- Robust Reading Competition for Copiale manuscript images
- Jialuo Chen, Lei Kang, Pau Torras, Marçal Rusiñol for CRNN and augmentation code
- Dr. Josep Lladós and Dr. Ernest Valveny for institutional support
- Knight, K., Megyesi, B., & Schaefer, C. (2011). The Copiale Cipher. Workshop on Building and Using Comparable Corpora.
- Shi, B., Bai, X., & Yao, C. (2017). An End-to-End Trainable Neural Network for Image-based Sequence Recognition. IEEE TPAMI.
- Aldarrab, N., & May, J. (2021). Can Sequence-to-Sequence Models Crack Substitution Ciphers? EMNLP.
- Bluche, T., Louradour, J., & Messina, R. (2017). Scan, Attend and Read: End-to-End Handwritten Paragraph Recognition. ICDAR.
- Graves, A., et al. (2006). Connectionist Temporal Classification. ICML.
This work represents a proof-of-concept system for historical cipher decipherment. It is intended to augment expert analysis rather than replace human cryptographers.
Contact: marino.oliverosblanco@gmail.com Project Year: 2025/26


