Skip to content

borenstein-lab/tagine_fe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TAGINE Feature Engineering

tagine_fe is a scikit-learn-compatible feature selection library designed for taxonomic data, implementing the TAGINE algorithm. It enables efficient and interpretable feature engineering for microbiome and other hierarchical datasets.

Installation

Clone the repository and install using pip:

git clone https://github.com/borenstein-lab/tagine_fe
cd tagine_fe
pip install .

Optional Dependencies

For mutual information-based methods ('mi' selection method), install with the mutual information extra:

pip install .[mutual_info]

Or install all optional dependencies:

pip install .[all]

Requirements

  • Python >= 3.8
  • numpy
  • pandas
  • scikit-learn
  • ete3
  • statsmodels

Optional Requirements

  • NPEET (for mutual information calculations) - only required when using selection_method='mi'

To avoid dependency issues, using a virtual environment is recommended; alternatively, the library can be installed and run within a Docker container.

Usage

Basic Usage

Here's a minimal example of using tagine_fe:

import pandas as pd
from tagine_fe.feature_engineering import TagineFE

# X: DataFrame with taxonomic features (columns are taxa, rows are samples)
# Column names should contain full taxon names separated by semicolons
# e.g., "d__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;..."
# y: Series or array with target labels (binary or multiclass)

tfe = TagineFE()
tfe.fit(X, y)
# tfe.transform returns a dataframe with the new features
X_transformed = tfe.transform(X)

The TagineFE object can also be used as part of a scikit-learn pipeline. See example (using data from The Curated Gut Microbiome Metabolome Data Resource) in the examples folder.

Data Format Requirements

  • X (DataFrame): Feature matrix where rows are samples and columns are taxonomic features
  • Column names: Must contain full taxonomic paths separated by semicolons (e.g., "d__Bacteria;p__Firmicutes;c__Clostridia")
  • y (Series/array): Target labels supporting both binary and multiclass classification

Advanced Usage

The TagineFE class provides several configuration options:

# Initialize with custom parameters
tfe = TagineFE(
    selection_method='aic',           # Statistical test method
    leaf_filter='l1_lr',             # Method to filter significant features
    n_initial_tree_layers=2,         # Number of initial tree layers
    n_permutations=100,              # Permutations for 'll_perm' and 'mi' methods
    n_workers=8,                     # Parallel workers for permutation methods
    sig_pvalue=0.1,                  # Significance threshold
    verbose=True,                    # Enable logging
    debug_verbose=False,             # Enable detailed debug logs
    random_seed=42                   # Set random seed for reproducibility
)

# Fit the model
tfe.fit(X, y)

# Access the constructed taxonomic tree (ete3 tree object) after fitting
# The tree can be used for further analysis or visualization
tree = tfe.species_tree
print(f"Tree has {len(list(tree.traverse()))} nodes")
print(f"Selected features: {len(tfe.aggregated_columns)} groups")
leaves = tree.get_leaves()
print(f"Number of leaf nodes: {len(leaves)}")

# Transform data
X_transformed = tfe.transform(X)

Selection Methods

The statistical test used to determine node splits:

  • 'aic': Akaike Information Criterion (Default)
  • 'wilk': Wilks' Test
  • 'll_perm': Log-likelihood permutation test
  • 'mi': Mutual information permutation test (requires NPEET: pip install .[mutual_info])

Leaf Filters

If a node is split, its children can be further filtered in several ways:

  • 'by_method': Automatically selects the filter based on the selection method (l1_lr for likelihood-based methods and mi for information-based methods) (Default)
  • 'l1_lr': Keeps nodes with significant coefficients in an L1-regularized logistic regression
  • 'mi': Keeps nodes with higher mutual information with the label than the parent (requires NPEET: pip install .[mutual_info])
  • 'none': No filtering (all child nodes kept after split)

License

MIT License. See LICENSE

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages