TAGINE Feature Engineering

tagine_fe is a scikit-learn-compatible feature selection library designed for taxonomic data, implementing the TAGINE algorithm. It enables efficient and interpretable feature engineering for microbiome and other hierarchical datasets.

Installation

Clone the repository and install using pip:

git clone https://github.com/borenstein-lab/tagine_fe
cd tagine_fe
pip install .

Optional Dependencies

For mutual information-based methods ('mi' selection method), install with the mutual information extra:

pip install .[mutual_info]

Or install all optional dependencies:

pip install .[all]

Requirements

Python >= 3.8
numpy
pandas
scikit-learn
ete3
statsmodels

Optional Requirements

NPEET (for mutual information calculations) - only required when using selection_method='mi'

To avoid dependency issues, using a virtual environment is recommended; alternatively, the library can be installed and run within a Docker container.

Usage

Basic Usage

Here's a minimal example of using tagine_fe:

import pandas as pd
from tagine_fe.feature_engineering import TagineFE

# X: DataFrame with taxonomic features (columns are taxa, rows are samples)
# Column names should contain full taxon names separated by semicolons
# e.g., "d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;..."
# y: Series or array with target labels (binary or multiclass)

tfe = TagineFE()
tfe.fit(X, y)
# tfe.transform returns a dataframe with the new features
X_transformed = tfe.transform(X)

The TagineFE object can also be used as part of a scikit-learn pipeline. See example (using data from The Curated Gut Microbiome Metabolome Data Resource) in the examples folder.

Data Format Requirements

X (DataFrame): Feature matrix where rows are samples and columns are taxonomic features
Column names: Must contain full taxonomic paths separated by semicolons (e.g., "d__Bacteria;p__Firmicutes;c__Clostridia")
y (Series/array): Target labels supporting both binary and multiclass classification

Advanced Usage

The TagineFE class provides several configuration options:

# Initialize with custom parameters
tfe = TagineFE(
    selection_method='aic',           # Statistical test method
    leaf_filter='l1_lr',             # Method to filter significant features
    n_initial_tree_layers=2,         # Number of initial tree layers
    n_permutations=100,              # Permutations for 'll_perm' and 'mi' methods
    n_workers=8,                     # Parallel workers for permutation methods
    sig_pvalue=0.1,                  # Significance threshold
    verbose=True,                    # Enable logging
    debug_verbose=False,             # Enable detailed debug logs
    random_seed=42                   # Set random seed for reproducibility
)

# Fit the model
tfe.fit(X, y)

# Access the constructed taxonomic tree (ete3 tree object) after fitting
# The tree can be used for further analysis or visualization
tree = tfe.species_tree
print(f"Tree has {len(list(tree.traverse()))} nodes")
print(f"Selected features: {len(tfe.aggregated_columns)} groups")
leaves = tree.get_leaves()
print(f"Number of leaf nodes: {len(leaves)}")

# Transform data
X_transformed = tfe.transform(X)

Selection Methods

The statistical test used to determine node splits:

'aic': Akaike Information Criterion (Default)
'wilk': Wilks' Test
'll_perm': Log-likelihood permutation test
'mi': Mutual information permutation test (requires NPEET: pip install .[mutual_info])

Leaf Filters

If a node is split, its children can be further filtered in several ways:

'by_method': Automatically selects the filter based on the selection method (l1_lr for likelihood-based methods and mi for information-based methods) (Default)
'l1_lr': Keeps nodes with significant coefficients in an L1-regularized logistic regression
'mi': Keeps nodes with higher mutual information with the label than the parent (requires NPEET: pip install .[mutual_info])
'none': No filtering (all child nodes kept after split)

License

MIT License. See LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
paper_preprocessing_code		paper_preprocessing_code
tagine_fe		tagine_fe
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TAGINE Feature Engineering

Installation

Optional Dependencies

Requirements

Optional Requirements

Usage

Basic Usage

Data Format Requirements

Advanced Usage

Selection Methods

Leaf Filters

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TAGINE Feature Engineering

Installation

Optional Dependencies

Requirements

Optional Requirements

Usage

Basic Usage

Data Format Requirements

Advanced Usage

Selection Methods

Leaf Filters

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages