InseeFrLab
diff --git a/‎README.md‎
Lines changed: 261 additions & 86 deletions b/‎README.md‎
Lines changed: 261 additions & 86 deletions
@@ -1,135 +1,310 @@
-# torchTextClassifiers : Efficient text classification with PyTorch
+# torchTextClassifiers
 
-A flexible PyTorch implementation of models for text classification with support for categorical features.
+A unified, extensible framework for text classification using PyTorch and PyTorch Lightning.
 
-## Features
+## 🚀 Features
 
-- Supports text classification with FastText architecture
-- Handles both text and categorical features
-- N-gram tokenization
-- Flexible optimizer and scheduler options
-- GPU and CPU support
-- Model checkpointing and early stopping
-- Prediction and model explanation capabilities
+- **Unified API**: Consistent interface for different classifier types
+- **FastText Support**: Built-in FastText classifier implementation
+- **PyTorch Lightning**: Automated training with callbacks, early stopping, and logging
+- **Mixed Features**: Support for both text and categorical features
+- **Extensible**: Easy to add new classifier types
+- **Production Ready**: Model serialization, validation, and inference
 
-## Installation
-
-- With `pip`:
+## 📦 Installation
 
 ```bash
-pip install torchTextClassifiers
+# Clone the repository
+git clone https://github.com/your-repo/torch-fastText.git
+cd torch-fastText
+
+# Install with uv (recommended)
+uv sync
+
+# Or install with pip
+pip install -e .
 ```
 
-- with `uv`:
+## 🎯 Quick Start
 
+### Basic FastText Classification
 
-```bash
-uv add torchTextClassifiers
+```python
+import numpy as np
+from torchTextClassifiers import create_fasttext
+
+# Create a FastText classifier
+classifier = create_fasttext(
+    embedding_dim=100,
+    sparse=False,
+    num_tokens=10000,
+    min_count=2,
+    min_n=3,
+    max_n=6,
+    len_word_ngrams=2,
+    num_classes=2
+)
+
+# Prepare your data
+X_train = np.array([
+    "This is a positive example",
+    "This is a negative example",
+    "Another positive case",
+    "Another negative case"
+])
+y_train = np.array([1, 0, 1, 0])
+
+X_val = np.array([
+    "Validation positive",
+    "Validation negative"
+])
+y_val = np.array([1, 0])
+
+# Build the model
+classifier.build(X_train, y_train)
+
+# Train the model
+classifier.train(
+    X_train, y_train, X_val, y_val,
+    num_epochs=50,
+    batch_size=32,
+    patience_train=5,
+    verbose=True
+)
+
+# Make predictions
+X_test = np.array(["This is a test sentence"])
+predictions = classifier.predict(X_test)
+print(f"Predictions: {predictions}")
+
+# Validate on test set
+accuracy = classifier.validate(X_test, np.array([1]))
+print(f"Accuracy: {accuracy:.3f}")
 ```
 
-## Key Components
+### Working with Mixed Features (Text + Categorical)
+
+```python
+import numpy as np
+from torchTextClassifiers import create_fasttext
+
+# Text data with categorical features
+X_train = np.column_stack([
+    np.array(["Great product!", "Terrible service", "Love it!"]),  # Text
+    np.array([[1, 2], [2, 1], [1, 3]])  # Categorical features
+])
+y_train = np.array([1, 0, 1])
+
+# Create classifier with categorical support
+classifier = create_fasttext(
+    embedding_dim=50,
+    sparse=False,
+    num_tokens=5000,
+    min_count=1,
+    min_n=3,
+    max_n=6,
+    len_word_ngrams=2,
+    num_classes=2,
+    categorical_vocabulary_sizes=[3, 4],  # Vocab sizes for categorical features
+    categorical_embedding_dims=[10, 10]   # Embedding dims for categorical features
+)
+
+# Build and train as usual
+classifier.build(X_train, y_train)
+# ... continue with training
+```
+
+### Model Persistence
+
+```python
+# Save configuration
+classifier.to_json('model_config.json')
+
+# Load configuration (creates new instance)
+new_classifier = torchTextClassifiers.from_json('model_config.json')
 
-- `build()`: Constructs the FastText model architecture
-- `train()`: Trains the model with built-in callbacks and logging
-- `predict()`: Generates class predictions
-- `predict_and_explain()`: Provides predictions with feature attributions
+# You'll need to retrain the loaded classifier
+new_classifier.build(X_train, y_train)
+new_classifier.train(X_train, y_train, X_val, y_val, ...)
+```
 
-## Subpackages
+## 🔧 Advanced Usage
 
-- `preprocess`: To preprocess text input, using `nltk` and `unidecode` libraries.
-- `explainability`: Simple methods to visualize feature attributions at word and letter levels, using `captum`library.
+### Custom Configuration
 
-Run `pip install torchTextClassifiers[preprocess]` or `pip install torchTextClassifiers[explainability]` to download these optional dependencies.
+```python
+from torchTextClassifiers import torchTextClassifiers, ClassifierType
+from torchTextClassifiers.classifiers.fasttext.config import FastTextConfig
+
+# Create custom configuration
+config = FastTextConfig(
+    embedding_dim=200,
+    sparse=True,
+    num_tokens=20000,
+    min_count=3,
+    min_n=2,
+    max_n=8,
+    len_word_ngrams=3,
+    num_classes=5,
+    direct_bagging=False,  # Custom FastText parameter
+)
 
+# Create classifier with custom config
+classifier = torchTextClassifiers(ClassifierType.FASTTEXT, config)
+```
 
-## Quick Start
+### Using Pre-trained Tokenizers
 
 ```python
-from torchTextClassifiers import torchTextClassifiers
+from torchTextClassifiers import build_fasttext_from_tokenizer
 
-# Initialize the model
-model = torchTextclassifiers(
-    num_tokens=1000000,
+# Assume you have a pre-trained tokenizer
+# my_tokenizer = ... (previously trained NGramTokenizer)
+
+classifier = build_fasttext_from_tokenizer(
+    tokenizer=my_tokenizer,
     embedding_dim=100,
-    min_count=5,
-    min_n=3,
-    max_n=6,
-    len_word_ngrams=True,
-    sparse=True
+    num_classes=3,
+    sparse=False
 )
 
-# Train the model
-model.train(
-    X_train=train_data,
-    y_train=train_labels,
-    X_val=val_data,
-    y_val=val_labels,
-    num_epochs=10,
+# Model and tokenizer are already built, ready for training
+classifier.train(X_train, y_train, X_val, y_val, ...)
+```
+
+### Training Customization
+
+```python
+# Custom PyTorch Lightning trainer parameters
+trainer_params = {
+    'accelerator': 'gpu',
+    'devices': 1,
+    'precision': 16,  # Mixed precision training
+    'gradient_clip_val': 1.0,
+}
+
+classifier.train(
+    X_train, y_train, X_val, y_val,
+    num_epochs=100,
     batch_size=64,
-    lr=4e-3
+    patience_train=10,
+    trainer_params=trainer_params,
+    verbose=True
 )
-# Make predictions
-predictions = model.predict(test_data)
 ```
 
-where ```train_data``` is an array of size $(N,d)$, having the text in string format in the first column, the other columns containing tokenized categorical variables in `int` format.
+## 📊 API Reference
 
-Please make sure `y_train` contains at least one time each possible label.
+### Main Classes
 
-## Dependencies
+#### `torchTextClassifiers`
+The main classifier class providing a unified interface.
 
-- PyTorch Lightning
-- NumPy
+**Key Methods:**
+- `build(X_train, y_train)`: Build tokenizer and model
+- `train(X_train, y_train, X_val, y_val, ...)`: Train the model
+- `predict(X)`: Make predictions
+- `validate(X, Y)`: Evaluate on test data
+- `to_json(filepath)`: Save configuration
+- `from_json(filepath)`: Load configuration
 
-## Categorical features
+#### `ClassifierType`
+Enumeration of supported classifier types.
+- `FASTTEXT`: FastText classifier
 
-If any, each categorical feature $i$ is associated to an embedding matrix of size (number of unique values, embedding dimension) where the latter is a hyperparameter (`categorical_embedding_dims`) - chosen by the user - that can take three types of values:
+#### `ClassifierFactory`
+Factory for creating classifier instances.
 
-- `None`: same embedding dimension as the token embedding matrix. The categorical embeddings are then summed to the sentence-level embedding (which itself is an averaging of the token embeddings). See [Figure 1](#Default-architecture).
-- `int`: the categorical embeddings have all the same embedding dimensions, they are averaged and the resulting vector is concatenated to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 2](#avg-architecture).
-- `list`: the categorical embeddings have different embedding dimensions, all of them are concatenated without aggregation to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 3](#concat-architecture).
+### FastText Specific
 
-Default is `None`.
+#### `create_fasttext(**kwargs)`
+Convenience function to create FastText classifiers.
 
-<a name="figure-1"></a>
-![Default-architecture](images/NN.drawio.png "Default architecture")  
-*Figure 1: The 'sum' architecture*
+**Parameters:**
+- `embedding_dim`: Embedding dimension
+- `sparse`: Use sparse embeddings
+- `num_tokens`: Vocabulary size
+- `min_count`: Minimum token frequency
+- `min_n`, `max_n`: Character n-gram range
+- `len_word_ngrams`: Word n-gram length
+- `num_classes`: Number of output classes
 
-<a name="figure-2"></a>
-![avg-architecture](images/avg_concat.png "Default architecture")  
-*Figure 2: The 'average and concatenate' architecture*
+#### `build_fasttext_from_tokenizer(tokenizer, **kwargs)`
+Create FastText classifier from existing tokenizer.
 
-<a name="figure-3"></a>
-![concat-architecture](images/full_concat.png "Default architecture")  
-*Figure 3: The 'concatenate all' architecture*
+## 🏗️ Architecture
 
-## Documentation
+The framework follows a modular architecture:
 
-For detailed usage and examples, please refer to the [example notebook](notebooks/example.ipynb). Use `pip install -r requirements.txt` after cloning the repository to install the necessary dependencies (some are specific to the notebook).
+```
+torchTextClassifiers/
+├── torchTextClassifiers.py      # Main classifier interface
+├── classifiers/
+│   ├── base.py                  # Abstract base classes
+│   └── fasttext/                # FastText implementation
+│       ├── config.py            # Configuration
+│       ├── wrapper.py           # Classifier wrapper
+│       ├── factory.py           # Convenience methods
+│       ├── tokenizer.py         # N-gram tokenizer
+│       ├── pytorch_model.py     # PyTorch model
+│       ├── lightning_module.py  # Lightning module
+│       └── dataset.py           # Dataset implementation
+├── utilities/
+│   └── checkers.py              # Input validation utilities
+└── factories.py                 # Generic factory system
+```
 
-## Contributing
+## 🔬 Testing
 
-Contributions are welcome! Please feel free to submit a Pull Request.
+Run the test suite:
+
+```bash
+# Run all tests
+uv run pytest
 
-## License
+# Run with coverage
+uv run pytest --cov=torchTextClassifiers
 
-MIT
+# Run specific test file
+uv run pytest tests/test_torchTextClassifiers.py -v
+```
 
+## 🤝 Contributing
 
-## References
+We welcome contributions! See our [Developer Guide](docs/developer_guide.md) for information on:
 
-Inspired by the original FastText paper [1] and implementation.
+- Adding new classifier types
+- Code organization and patterns
+- Testing requirements
+- Documentation standards
 
-[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)
+## 📄 License
 
-```
-@InProceedings{joulin2017bag,
-  title={Bag of Tricks for Efficient Text Classification},
-  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
-  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
-  month={April},
-  year={2017},
-  publisher={Association for Computational Linguistics},
-  pages={427--431},
-}
-```
+This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+
+## 🙏 Acknowledgments
+
+- Built with [PyTorch](https://pytorch.org/) and [PyTorch Lightning](https://lightning.ai/)
+- Inspired by [FastText](https://fasttext.cc/) for efficient text classification
+- Uses [uv](https://github.com/astral-sh/uv) for dependency management
+
+## 📚 Examples
+
+See the [examples/](examples/) directory for:
+- Basic text classification
+- Multi-class classification
+- Mixed features (text + categorical)
+- Custom classifier implementation
+- Advanced training configurations
+
+## 🐛 Support
+
+If you encounter any issues:
+
+1. Check the [examples](examples/) for similar use cases
+2. Review the API documentation above
+3. Open an issue on GitHub with:
+   - Python version
+   - Package versions (`uv tree` or `pip list`)
+   - Minimal reproduction code
+   - Error messages/stack traces