InseeFrLab
diff --git a/‎README.md‎
Lines changed: 135 additions & 135 deletions b/‎README.md‎
Lines changed: 135 additions & 135 deletions
diff --git a/‎tests/conftest.py‎
Lines changed: 112 additions & 0 deletions b/‎tests/conftest.py‎
Lines changed: 112 additions & 0 deletions
@@ -1,135 +1,135 @@
-# torchTextClassifiers : Efficient text classification with PyTorch
-
-A flexible PyTorch implementation of models for text classification with support for categorical features.
-
-## Features
-
-- Supports text classification with FastText architecture
-- Handles both text and categorical features
-- N-gram tokenization
-- Flexible optimizer and scheduler options
-- GPU and CPU support
-- Model checkpointing and early stopping
-- Prediction and model explanation capabilities
-
-## Installation
-
-- With `pip`:
-
-```bash
-pip install torchTextClassifiers
-```
-
-- with `uv`:
-
-
-```bash
-uv add torchTextClassifiers
-```
-
-## Key Components
-
-- `build()`: Constructs the FastText model architecture
-- `train()`: Trains the model with built-in callbacks and logging
-- `predict()`: Generates class predictions
-- `predict_and_explain()`: Provides predictions with feature attributions
-
-## Subpackages
-
-- `preprocess`: To preprocess text input, using `nltk` and `unidecode` libraries.
-- `explainability`: Simple methods to visualize feature attributions at word and letter levels, using `captum`library.
-
-Run `pip install torchTextClassifiers[preprocess]` or `pip install torchTextClassifiers[explainability]` to download these optional dependencies.
-
-
-## Quick Start
-
-```python
-from torchTextClassifiers import torchTextClassifiers
-
-# Initialize the model
-model = torchTextclassifiers(
-    num_tokens=1000000,
-    embedding_dim=100,
-    min_count=5,
-    min_n=3,
-    max_n=6,
-    len_word_ngrams=True,
-    sparse=True
-)
-
-# Train the model
-model.train(
-    X_train=train_data,
-    y_train=train_labels,
-    X_val=val_data,
-    y_val=val_labels,
-    num_epochs=10,
-    batch_size=64,
-    lr=4e-3
-)
-# Make predictions
-predictions = model.predict(test_data)
-```
-
-where ```train_data``` is an array of size $(N,d)$, having the text in string format in the first column, the other columns containing tokenized categorical variables in `int` format.
-
-Please make sure `y_train` contains at least one time each possible label.
-
-## Dependencies
-
-- PyTorch Lightning
-- NumPy
-
-## Categorical features
-
-If any, each categorical feature $i$ is associated to an embedding matrix of size (number of unique values, embedding dimension) where the latter is a hyperparameter (`categorical_embedding_dims`) - chosen by the user - that can take three types of values:
-
-- `None`: same embedding dimension as the token embedding matrix. The categorical embeddings are then summed to the sentence-level embedding (which itself is an averaging of the token embeddings). See [Figure 1](#Default-architecture).
-- `int`: the categorical embeddings have all the same embedding dimensions, they are averaged and the resulting vector is concatenated to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 2](#avg-architecture).
-- `list`: the categorical embeddings have different embedding dimensions, all of them are concatenated without aggregation to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 3](#concat-architecture).
-
-Default is `None`.
-
-<a name="figure-1"></a>
-![Default-architecture](images/NN.drawio.png "Default architecture")  
-*Figure 1: The 'sum' architecture*
-
-<a name="figure-2"></a>
-![avg-architecture](images/avg_concat.png "Default architecture")  
-*Figure 2: The 'average and concatenate' architecture*
-
-<a name="figure-3"></a>
-![concat-architecture](images/full_concat.png "Default architecture")  
-*Figure 3: The 'concatenate all' architecture*
-
-## Documentation
-
-For detailed usage and examples, please refer to the [example notebook](notebooks/example.ipynb). Use `pip install -r requirements.txt` after cloning the repository to install the necessary dependencies (some are specific to the notebook).
-
-## Contributing
-
-Contributions are welcome! Please feel free to submit a Pull Request.
-
-## License
-
-MIT
-
-
-## References
-
-Inspired by the original FastText paper [1] and implementation.
-
-[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)
-
-```
-@InProceedings{joulin2017bag,
-  title={Bag of Tricks for Efficient Text Classification},
-  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
-  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
-  month={April},
-  year={2017},
-  publisher={Association for Computational Linguistics},
-  pages={427--431},
-}
-```
+# torchTextClassifiers : Efficient text classification with PyTorch
+
+A flexible PyTorch implementation of models for text classification with support for categorical features.
+
+## Features
+
+- Supports text classification with FastText architecture
+- Handles both text and categorical features
+- N-gram tokenization
+- Flexible optimizer and scheduler options
+- GPU and CPU support
+- Model checkpointing and early stopping
+- Prediction and model explanation capabilities
+
+## Installation
+
+- With `pip`:
+
+```bash
+pip install torchTextClassifiers
+```
+
+- with `uv`:
+
+
+```bash
+uv add torchTextClassifiers
+```
+
+## Key Components
+
+- `build()`: Constructs the FastText model architecture
+- `train()`: Trains the model with built-in callbacks and logging
+- `predict()`: Generates class predictions
+- `predict_and_explain()`: Provides predictions with feature attributions
+
+## Subpackages
+
+- `preprocess`: To preprocess text input, using `nltk` and `unidecode` libraries.
+- `explainability`: Simple methods to visualize feature attributions at word and letter levels, using `captum`library.
+
+Run `pip install torchTextClassifiers[preprocess]` or `pip install torchTextClassifiers[explainability]` to download these optional dependencies.
+
+
+## Quick Start
+
+```python
+from torchTextClassifiers import torchTextClassifiers
+
+# Initialize the model
+model = torchTextclassifiers(
+    num_tokens=1000000,
+    embedding_dim=100,
+    min_count=5,
+    min_n=3,
+    max_n=6,
+    len_word_ngrams=True,
+    sparse=True
+)
+
+# Train the model
+model.train(
+    X_train=train_data,
+    y_train=train_labels,
+    X_val=val_data,
+    y_val=val_labels,
+    num_epochs=10,
+    batch_size=64,
+    lr=4e-3
+)
+# Make predictions
+predictions = model.predict(test_data)
+```
+
+where ```train_data``` is an array of size $(N,d)$, having the text in string format in the first column, the other columns containing tokenized categorical variables in `int` format.
+
+Please make sure `y_train` contains at least one time each possible label.
+
+## Dependencies
+
+- PyTorch Lightning
+- NumPy
+
+## Categorical features
+
+If any, each categorical feature $i$ is associated to an embedding matrix of size (number of unique values, embedding dimension) where the latter is a hyperparameter (`categorical_embedding_dims`) - chosen by the user - that can take three types of values:
+
+- `None`: same embedding dimension as the token embedding matrix. The categorical embeddings are then summed to the sentence-level embedding (which itself is an averaging of the token embeddings). See [Figure 1](#Default-architecture).
+- `int`: the categorical embeddings have all the same embedding dimensions, they are averaged and the resulting vector is concatenated to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 2](#avg-architecture).
+- `list`: the categorical embeddings have different embedding dimensions, all of them are concatenated without aggregation to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 3](#concat-architecture).
+
+Default is `None`.
+
+<a name="figure-1"></a>
+![Default-architecture](images/NN.drawio.png "Default architecture")  
+*Figure 1: The 'sum' architecture*
+
+<a name="figure-2"></a>
+![avg-architecture](images/avg_concat.png "Default architecture")  
+*Figure 2: The 'average and concatenate' architecture*
+
+<a name="figure-3"></a>
+![concat-architecture](images/full_concat.png "Default architecture")  
+*Figure 3: The 'concatenate all' architecture*
+
+## Documentation
+
+For detailed usage and examples, please refer to the [example notebook](notebooks/example.ipynb). Use `pip install -r requirements.txt` after cloning the repository to install the necessary dependencies (some are specific to the notebook).
+
+## Contributing
+
+Contributions are welcome! Please feel free to submit a Pull Request.
+
+## License
+
+MIT
+
+
+## References
+
+Inspired by the original FastText paper [1] and implementation.
+
+[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)
+
+```
+@InProceedings{joulin2017bag,
+  title={Bag of Tricks for Efficient Text Classification},
+  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
+  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
+  month={April},
+  year={2017},
+  publisher={Association for Computational Linguistics},
+  pages={427--431},
+}
+```
@@ -0,0 +1,112 @@
+import pytest
+import numpy as np
+from unittest.mock import Mock, MagicMock
+
+
+@pytest.fixture
+def sample_text_data():
+    """Sample text data for testing."""
+    return np.array([
+        "This is a positive example",
+        "This is a negative example", 
+        "Another positive case",
+        "Another negative case",
+        "Good example here",
+        "Bad example here"
+    ])
+
+
+@pytest.fixture
+def sample_labels():
+    """Sample labels for testing."""
+    return np.array([1, 0, 1, 0, 1, 0])
+
+
+@pytest.fixture
+def sample_categorical_data():
+    """Sample categorical data for testing."""
+    return np.array([
+        [1, 2],
+        [2, 1], 
+        [1, 3],
+        [3, 1],
+        [2, 2],
+        [3, 3]
+    ])
+
+
+@pytest.fixture
+def sample_X_with_categorical(sample_text_data, sample_categorical_data):
+    """Sample X data with categorical variables."""
+    return np.column_stack([sample_text_data, sample_categorical_data])
+
+
+@pytest.fixture
+def sample_X_text_only(sample_text_data):
+    """Sample X data with text only."""
+    return sample_text_data.reshape(-1, 1)
+
+
+@pytest.fixture
+def fasttext_config():
+    """Mock FastText configuration."""
+    config = Mock()
+    config.embedding_dim = 10
+    config.sparse = False
+    config.num_tokens = 1000
+    config.min_count = 1
+    config.min_n = 3
+    config.max_n = 6
+    config.len_word_ngrams = 2
+    config.num_classes = 2
+    config.num_rows = None
+    config.num_categorical_features = None
+    config.categorical_vocabulary_sizes = None
+    config.to_dict = Mock(return_value={'embedding_dim': 10, 'sparse': False})
+    return config
+
+
+@pytest.fixture
+def mock_tokenizer():
+    """Mock NGramTokenizer for testing."""
+    tokenizer = Mock()
+    tokenizer.min_count = 1
+    tokenizer.min_n = 3
+    tokenizer.max_n = 6
+    tokenizer.num_tokens = 1000
+    tokenizer.word_ngrams = 2
+    tokenizer.padding_index = 999
+    return tokenizer
+
+
+@pytest.fixture
+def mock_pytorch_model():
+    """Mock PyTorch model for testing."""
+    model = Mock()
+    model.eval = Mock()
+    model.to = Mock(return_value=model)
+    model.predict = Mock(return_value=np.array([1, 0, 1]))
+    model.predict_and_explain = Mock(return_value=(np.array([1, 0, 1]), np.array([0.8, 0.2, 0.9])))
+    return model
+
+
+@pytest.fixture
+def mock_lightning_module():
+    """Mock Lightning module for testing."""
+    module = Mock()
+    module.model = Mock()
+    return module
+
+
+@pytest.fixture
+def mock_dataset():
+    """Mock dataset for testing."""
+    dataset = Mock()
+    dataset.create_dataloader = Mock()
+    return dataset
+
+
+@pytest.fixture
+def mock_dataloader():
+    """Mock dataloader for testing."""
+    return Mock()