Skip to content

Commit fda4bbf

Browse files
committed
refactor: reorganize codebase into modular classifier architecture
1 parent 0d18769 commit fda4bbf

27 files changed

+4594
-2963
lines changed

README.md

Lines changed: 135 additions & 135 deletions
Original file line numberDiff line numberDiff line change
@@ -1,135 +1,135 @@
1-
# torchTextClassifiers : Efficient text classification with PyTorch
2-
3-
A flexible PyTorch implementation of models for text classification with support for categorical features.
4-
5-
## Features
6-
7-
- Supports text classification with FastText architecture
8-
- Handles both text and categorical features
9-
- N-gram tokenization
10-
- Flexible optimizer and scheduler options
11-
- GPU and CPU support
12-
- Model checkpointing and early stopping
13-
- Prediction and model explanation capabilities
14-
15-
## Installation
16-
17-
- With `pip`:
18-
19-
```bash
20-
pip install torchTextClassifiers
21-
```
22-
23-
- with `uv`:
24-
25-
26-
```bash
27-
uv add torchTextClassifiers
28-
```
29-
30-
## Key Components
31-
32-
- `build()`: Constructs the FastText model architecture
33-
- `train()`: Trains the model with built-in callbacks and logging
34-
- `predict()`: Generates class predictions
35-
- `predict_and_explain()`: Provides predictions with feature attributions
36-
37-
## Subpackages
38-
39-
- `preprocess`: To preprocess text input, using `nltk` and `unidecode` libraries.
40-
- `explainability`: Simple methods to visualize feature attributions at word and letter levels, using `captum`library.
41-
42-
Run `pip install torchTextClassifiers[preprocess]` or `pip install torchTextClassifiers[explainability]` to download these optional dependencies.
43-
44-
45-
## Quick Start
46-
47-
```python
48-
from torchTextClassifiers import torchTextClassifiers
49-
50-
# Initialize the model
51-
model = torchTextclassifiers(
52-
num_tokens=1000000,
53-
embedding_dim=100,
54-
min_count=5,
55-
min_n=3,
56-
max_n=6,
57-
len_word_ngrams=True,
58-
sparse=True
59-
)
60-
61-
# Train the model
62-
model.train(
63-
X_train=train_data,
64-
y_train=train_labels,
65-
X_val=val_data,
66-
y_val=val_labels,
67-
num_epochs=10,
68-
batch_size=64,
69-
lr=4e-3
70-
)
71-
# Make predictions
72-
predictions = model.predict(test_data)
73-
```
74-
75-
where ```train_data``` is an array of size $(N,d)$, having the text in string format in the first column, the other columns containing tokenized categorical variables in `int` format.
76-
77-
Please make sure `y_train` contains at least one time each possible label.
78-
79-
## Dependencies
80-
81-
- PyTorch Lightning
82-
- NumPy
83-
84-
## Categorical features
85-
86-
If any, each categorical feature $i$ is associated to an embedding matrix of size (number of unique values, embedding dimension) where the latter is a hyperparameter (`categorical_embedding_dims`) - chosen by the user - that can take three types of values:
87-
88-
- `None`: same embedding dimension as the token embedding matrix. The categorical embeddings are then summed to the sentence-level embedding (which itself is an averaging of the token embeddings). See [Figure 1](#Default-architecture).
89-
- `int`: the categorical embeddings have all the same embedding dimensions, they are averaged and the resulting vector is concatenated to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 2](#avg-architecture).
90-
- `list`: the categorical embeddings have different embedding dimensions, all of them are concatenated without aggregation to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 3](#concat-architecture).
91-
92-
Default is `None`.
93-
94-
<a name="figure-1"></a>
95-
![Default-architecture](images/NN.drawio.png "Default architecture")
96-
*Figure 1: The 'sum' architecture*
97-
98-
<a name="figure-2"></a>
99-
![avg-architecture](images/avg_concat.png "Default architecture")
100-
*Figure 2: The 'average and concatenate' architecture*
101-
102-
<a name="figure-3"></a>
103-
![concat-architecture](images/full_concat.png "Default architecture")
104-
*Figure 3: The 'concatenate all' architecture*
105-
106-
## Documentation
107-
108-
For detailed usage and examples, please refer to the [example notebook](notebooks/example.ipynb). Use `pip install -r requirements.txt` after cloning the repository to install the necessary dependencies (some are specific to the notebook).
109-
110-
## Contributing
111-
112-
Contributions are welcome! Please feel free to submit a Pull Request.
113-
114-
## License
115-
116-
MIT
117-
118-
119-
## References
120-
121-
Inspired by the original FastText paper [1] and implementation.
122-
123-
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)
124-
125-
```
126-
@InProceedings{joulin2017bag,
127-
title={Bag of Tricks for Efficient Text Classification},
128-
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
129-
booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
130-
month={April},
131-
year={2017},
132-
publisher={Association for Computational Linguistics},
133-
pages={427--431},
134-
}
135-
```
1+
# torchTextClassifiers : Efficient text classification with PyTorch
2+
3+
A flexible PyTorch implementation of models for text classification with support for categorical features.
4+
5+
## Features
6+
7+
- Supports text classification with FastText architecture
8+
- Handles both text and categorical features
9+
- N-gram tokenization
10+
- Flexible optimizer and scheduler options
11+
- GPU and CPU support
12+
- Model checkpointing and early stopping
13+
- Prediction and model explanation capabilities
14+
15+
## Installation
16+
17+
- With `pip`:
18+
19+
```bash
20+
pip install torchTextClassifiers
21+
```
22+
23+
- with `uv`:
24+
25+
26+
```bash
27+
uv add torchTextClassifiers
28+
```
29+
30+
## Key Components
31+
32+
- `build()`: Constructs the FastText model architecture
33+
- `train()`: Trains the model with built-in callbacks and logging
34+
- `predict()`: Generates class predictions
35+
- `predict_and_explain()`: Provides predictions with feature attributions
36+
37+
## Subpackages
38+
39+
- `preprocess`: To preprocess text input, using `nltk` and `unidecode` libraries.
40+
- `explainability`: Simple methods to visualize feature attributions at word and letter levels, using `captum`library.
41+
42+
Run `pip install torchTextClassifiers[preprocess]` or `pip install torchTextClassifiers[explainability]` to download these optional dependencies.
43+
44+
45+
## Quick Start
46+
47+
```python
48+
from torchTextClassifiers import torchTextClassifiers
49+
50+
# Initialize the model
51+
model = torchTextclassifiers(
52+
num_tokens=1000000,
53+
embedding_dim=100,
54+
min_count=5,
55+
min_n=3,
56+
max_n=6,
57+
len_word_ngrams=True,
58+
sparse=True
59+
)
60+
61+
# Train the model
62+
model.train(
63+
X_train=train_data,
64+
y_train=train_labels,
65+
X_val=val_data,
66+
y_val=val_labels,
67+
num_epochs=10,
68+
batch_size=64,
69+
lr=4e-3
70+
)
71+
# Make predictions
72+
predictions = model.predict(test_data)
73+
```
74+
75+
where ```train_data``` is an array of size $(N,d)$, having the text in string format in the first column, the other columns containing tokenized categorical variables in `int` format.
76+
77+
Please make sure `y_train` contains at least one time each possible label.
78+
79+
## Dependencies
80+
81+
- PyTorch Lightning
82+
- NumPy
83+
84+
## Categorical features
85+
86+
If any, each categorical feature $i$ is associated to an embedding matrix of size (number of unique values, embedding dimension) where the latter is a hyperparameter (`categorical_embedding_dims`) - chosen by the user - that can take three types of values:
87+
88+
- `None`: same embedding dimension as the token embedding matrix. The categorical embeddings are then summed to the sentence-level embedding (which itself is an averaging of the token embeddings). See [Figure 1](#Default-architecture).
89+
- `int`: the categorical embeddings have all the same embedding dimensions, they are averaged and the resulting vector is concatenated to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 2](#avg-architecture).
90+
- `list`: the categorical embeddings have different embedding dimensions, all of them are concatenated without aggregation to the sentence-level embedding (the last linear layer has an adapted input size). See [Figure 3](#concat-architecture).
91+
92+
Default is `None`.
93+
94+
<a name="figure-1"></a>
95+
![Default-architecture](images/NN.drawio.png "Default architecture")
96+
*Figure 1: The 'sum' architecture*
97+
98+
<a name="figure-2"></a>
99+
![avg-architecture](images/avg_concat.png "Default architecture")
100+
*Figure 2: The 'average and concatenate' architecture*
101+
102+
<a name="figure-3"></a>
103+
![concat-architecture](images/full_concat.png "Default architecture")
104+
*Figure 3: The 'concatenate all' architecture*
105+
106+
## Documentation
107+
108+
For detailed usage and examples, please refer to the [example notebook](notebooks/example.ipynb). Use `pip install -r requirements.txt` after cloning the repository to install the necessary dependencies (some are specific to the notebook).
109+
110+
## Contributing
111+
112+
Contributions are welcome! Please feel free to submit a Pull Request.
113+
114+
## License
115+
116+
MIT
117+
118+
119+
## References
120+
121+
Inspired by the original FastText paper [1] and implementation.
122+
123+
[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, [*Bag of Tricks for Efficient Text Classification*](https://arxiv.org/abs/1607.01759)
124+
125+
```
126+
@InProceedings{joulin2017bag,
127+
title={Bag of Tricks for Efficient Text Classification},
128+
author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
129+
booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
130+
month={April},
131+
year={2017},
132+
publisher={Association for Computational Linguistics},
133+
pages={427--431},
134+
}
135+
```

tests/conftest.py

Lines changed: 112 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
import pytest
2+
import numpy as np
3+
from unittest.mock import Mock, MagicMock
4+
5+
6+
@pytest.fixture
7+
def sample_text_data():
8+
"""Sample text data for testing."""
9+
return np.array([
10+
"This is a positive example",
11+
"This is a negative example",
12+
"Another positive case",
13+
"Another negative case",
14+
"Good example here",
15+
"Bad example here"
16+
])
17+
18+
19+
@pytest.fixture
20+
def sample_labels():
21+
"""Sample labels for testing."""
22+
return np.array([1, 0, 1, 0, 1, 0])
23+
24+
25+
@pytest.fixture
26+
def sample_categorical_data():
27+
"""Sample categorical data for testing."""
28+
return np.array([
29+
[1, 2],
30+
[2, 1],
31+
[1, 3],
32+
[3, 1],
33+
[2, 2],
34+
[3, 3]
35+
])
36+
37+
38+
@pytest.fixture
39+
def sample_X_with_categorical(sample_text_data, sample_categorical_data):
40+
"""Sample X data with categorical variables."""
41+
return np.column_stack([sample_text_data, sample_categorical_data])
42+
43+
44+
@pytest.fixture
45+
def sample_X_text_only(sample_text_data):
46+
"""Sample X data with text only."""
47+
return sample_text_data.reshape(-1, 1)
48+
49+
50+
@pytest.fixture
51+
def fasttext_config():
52+
"""Mock FastText configuration."""
53+
config = Mock()
54+
config.embedding_dim = 10
55+
config.sparse = False
56+
config.num_tokens = 1000
57+
config.min_count = 1
58+
config.min_n = 3
59+
config.max_n = 6
60+
config.len_word_ngrams = 2
61+
config.num_classes = 2
62+
config.num_rows = None
63+
config.num_categorical_features = None
64+
config.categorical_vocabulary_sizes = None
65+
config.to_dict = Mock(return_value={'embedding_dim': 10, 'sparse': False})
66+
return config
67+
68+
69+
@pytest.fixture
70+
def mock_tokenizer():
71+
"""Mock NGramTokenizer for testing."""
72+
tokenizer = Mock()
73+
tokenizer.min_count = 1
74+
tokenizer.min_n = 3
75+
tokenizer.max_n = 6
76+
tokenizer.num_tokens = 1000
77+
tokenizer.word_ngrams = 2
78+
tokenizer.padding_index = 999
79+
return tokenizer
80+
81+
82+
@pytest.fixture
83+
def mock_pytorch_model():
84+
"""Mock PyTorch model for testing."""
85+
model = Mock()
86+
model.eval = Mock()
87+
model.to = Mock(return_value=model)
88+
model.predict = Mock(return_value=np.array([1, 0, 1]))
89+
model.predict_and_explain = Mock(return_value=(np.array([1, 0, 1]), np.array([0.8, 0.2, 0.9])))
90+
return model
91+
92+
93+
@pytest.fixture
94+
def mock_lightning_module():
95+
"""Mock Lightning module for testing."""
96+
module = Mock()
97+
module.model = Mock()
98+
return module
99+
100+
101+
@pytest.fixture
102+
def mock_dataset():
103+
"""Mock dataset for testing."""
104+
dataset = Mock()
105+
dataset.create_dataloader = Mock()
106+
return dataset
107+
108+
109+
@pytest.fixture
110+
def mock_dataloader():
111+
"""Mock dataloader for testing."""
112+
return Mock()

0 commit comments

Comments
 (0)