tagine_fe is a scikit-learn-compatible feature selection library designed for taxonomic data, implementing the TAGINE algorithm. It enables efficient and interpretable feature engineering for microbiome and other hierarchical datasets.
Clone the repository and install using pip:
git clone https://github.com/borenstein-lab/tagine_fe
cd tagine_fe
pip install .For mutual information-based methods ('mi' selection method), install with the mutual information extra:
pip install .[mutual_info]Or install all optional dependencies:
pip install .[all]- Python >= 3.8
- numpy
- pandas
- scikit-learn
- ete3
- statsmodels
- NPEET (for mutual information calculations) - only required when using
selection_method='mi'
To avoid dependency issues, using a virtual environment is recommended; alternatively, the library can be installed and run within a Docker container.
Here's a minimal example of using tagine_fe:
import pandas as pd
from tagine_fe.feature_engineering import TagineFE
# X: DataFrame with taxonomic features (columns are taxa, rows are samples)
# Column names should contain full taxon names separated by semicolons
# e.g., "d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;..."
# y: Series or array with target labels (binary or multiclass)
tfe = TagineFE()
tfe.fit(X, y)
# tfe.transform returns a dataframe with the new features
X_transformed = tfe.transform(X)The TagineFE object can also be used as part of a scikit-learn pipeline.
See example (using data from The Curated Gut Microbiome Metabolome Data Resource) in the examples folder.
- X (DataFrame): Feature matrix where rows are samples and columns are taxonomic features
- Column names: Must contain full taxonomic paths separated by semicolons (e.g.,
"d__Bacteria;p__Firmicutes;c__Clostridia") - y (Series/array): Target labels supporting both binary and multiclass classification
The TagineFE class provides several configuration options:
# Initialize with custom parameters
tfe = TagineFE(
selection_method='aic', # Statistical test method
leaf_filter='l1_lr', # Method to filter significant features
n_initial_tree_layers=2, # Number of initial tree layers
n_permutations=100, # Permutations for 'll_perm' and 'mi' methods
n_workers=8, # Parallel workers for permutation methods
sig_pvalue=0.1, # Significance threshold
verbose=True, # Enable logging
debug_verbose=False, # Enable detailed debug logs
random_seed=42 # Set random seed for reproducibility
)
# Fit the model
tfe.fit(X, y)
# Access the constructed taxonomic tree (ete3 tree object) after fitting
# The tree can be used for further analysis or visualization
tree = tfe.species_tree
print(f"Tree has {len(list(tree.traverse()))} nodes")
print(f"Selected features: {len(tfe.aggregated_columns)} groups")
leaves = tree.get_leaves()
print(f"Number of leaf nodes: {len(leaves)}")
# Transform data
X_transformed = tfe.transform(X)The statistical test used to determine node splits:
'aic': Akaike Information Criterion (Default)'wilk': Wilks' Test'll_perm': Log-likelihood permutation test'mi': Mutual information permutation test (requires NPEET:pip install .[mutual_info])
If a node is split, its children can be further filtered in several ways:
'by_method': Automatically selects the filter based on the selection method (l1_lrfor likelihood-based methods andmifor information-based methods) (Default)'l1_lr': Keeps nodes with significant coefficients in an L1-regularized logistic regression'mi': Keeps nodes with higher mutual information with the label than the parent (requires NPEET:pip install .[mutual_info])'none': No filtering (all child nodes kept after split)
MIT License. See LICENSE