Feature engineering tool based on simple arithmetic operations between each pair of numeric features. This is intended both to increase the accuracy and interpretability of models for some datasets. Experimental results are included below demonstrating its utility for some models on many datasets.
The tool uses the same signature, based on the fit-tranform pattern, as sklearn's PolynomialFeatures and RotationFeatures.
The tool simply generates additional numeric features through the application of basic arithmetic operations (+, -, *, /, and optionally min and max) to each pair of numeric features. It is posible to apply repeatedly, optionally interspersed with feature selection, to create higher-order generated features, which may capture more complex feature intereactions. In our experiments, executing once is typically sufficient to capture most feature interactions, and feature selection is often not necessary depending on the model using the generated features.
from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
X, y = iris.data, iris.target
arith = ArithmeticFeatures()
extended_X = pd.DataFrame(arith.fit_transform(X), columns=arith.get_feature_names())
X_train, X_test, y_train, y_test = train_test_split(extended_X, y, random_state=42)
dt = tree.DecisionTreeClassifier(max_depth=4, random_state=0)
dt.fit(X_train, y_train)
y_pred = dt.predict(X_test)Simple_Test_Arithmetic-Based_Feature_Generation Provides simple examples using the tool
Accuracy_Test_ArithmeticFeatures Provides more thorough testing of the tool. This utilizes the DatasetsEvaluator tool, which can simplify testing on large numbers of datasets. The file compares both classification and regression problems on several sklearn predictors (Decision Trees, RandomForest, kNN, Logistion Regression, Lasso Linear Regression, Gaussian Naive Bayes, ExtraTrees, AdaBoost, and GradientBoost). Further, it performs tests using cross-validated grid search to determine the best settings for feature generation using ArithmeticFeatures. Some results are included below.
This shows two plots: the top for accuracy (higher is better) using a macro f1 score, and the second for complexity (smaller is better). It can be seen that while ArithmeticFeatures often provides for higher accuracy, the overal accuracy is quite similar. However, the model complexity (measured by number of nodes) is consistently lower, allowing for more interpretable models. The x-axis orders the datasets from least to highest accuracy on the baseline, the standard sklearn model with no generated features.
This again shows the accuracy, though not consitently, often higher using ArithmeticFeatures.
Similar results as other models.
This shows similar results, but was slow to execute over 100 datasets, so was removed from the test file, along with QDA.