Ensemble proposal #281

elmartinj · 2025-12-23T22:48:50Z

Hello!

Just created an ensemble proposal in order to provide one alternative to the median ensemble used in the gift eval testing.

This PR contains two scripts. One is the proposal for the trimmed average ensemble. The ensemble consists of the dropping of a percentage of the smallest and largest values in an ordered set of forecasts and then a mean of the remaining values. This percentage is sensible of the size of the forecasts provided. There are 2 caveats to consider:

1.- Current GiftEval execution involves 3 models. So, this alternative will not pose any difference in the result emission and would have to consider more models (at least 2) to see if there's any meaningful difference in the results provided by the ensembles.
2.- This commit does not include a script in order to try this because of current hardware limitations and because of point number 1. This should be discussed previously in order to setup future implementation if determined viable and worthy of further interest.

There is however an ensembles first run script that will work with monthly data as a check to see the current trimmed ensemble is working, though results are identical in smape metric for 50 monthly timeseries adjustments.

This tutorial contains an example with multiple indexes and subsequent tampering to the data in order to show resiliency and a real life use case of TC applied on cryptocurrency prices up to 2021.

Fix to the issue raised on empty dataframe, resulting from an inner merge where the existing dataframe that accumulated results and the newer one had diferent indices. A subsequent issue must be raised to either: 1 report a single model failure (on index matching) 2 fix the moirai discrepancy (only model that showed this issue)

… the title suggested

…age ensemble. This alternative poses as proposal as another idea onto the value determined by TimeCopilot itself as to what value to provide to the user. There are 2 caveats to consider. 1.- Curretn GiftEval execution involves 3 models. So, this alternative will not pose any difference in the result emission and would have to consider more models (at least 2) to see if there's any meaningful difference un the results provided by the ensembles. 2.- This commit does not include a script in order to try this because of current hardware limitations and because of point number 1. This should be discussed previously in order to setup future implementation if determined viable and worthy of further interest. There is however an ensembles first run script that will work with monthly data as a check to see the current trimmed ensemble is working, though results are identical in smape metric for 50 timeseries adjustments

AzulGarza

thanks so much for working on this, @elmartinj! would it be possible to fix the conflicts with the main branch?

elmartinj · 2026-01-06T03:39:18Z

Done!

AzulGarza

thanks @elmartinj! i left a couple of initial comments. i'm also adding copilot as reviewer to have its comments.

AzulGarza · 2026-01-08T17:12:40Z

experiments/trimmed-ensemble/ensembnles-first-run.py

+import numpy as np
+import pandas as pd
+
+sys.path.insert(0, "/home/el-cacomixtle/timecopilot/")


love the el cacomixtle name haha but the line can cause some issues to users.

AzulGarza · 2026-01-08T17:16:25Z

experiments/trimmed-ensemble/ensembnles-first-run.py

pretty cool experiment! can we create a new dir for it and also add a README inside with reproduction instructions? other experiments in the dir can serve as inspiration.

also, to standardize evaluation and data extraction, wdyt about using losses from utilsforecast and data from datasetsforecast. the later includes test-train splits for M4.

AzulGarza · 2026-01-08T17:19:50Z

timecopilot/models/ensembles/trimmed.py

+    """
+    TrimmedEnsemble (alternate ensemble to MedianEnsemble)
+
+    Purpose
+    -------
+    A robust ensemble that aggregates model forecasts using a
+    *trimmed mean* for quantiles
+    (and optionally for point forecasts), with safety rails:
+
+      1) Fixed trimming per row (unique_id, ds): we first compute the
+        *minimum* number of
+         available contributors across all requested quantiles for that row (n_min).
+         Then we decide how much to trim based on n_min, and apply the same trimming
+         intensity to every quantile in that row.
+
+      2) Minimum contributor quota: if n_min < min_quota, we fall back to the *median*
+         aggregation for quantiles for that row (and skip isotonic repair).
+
+      3) Monotone quantiles: when a full quantile vector exists (no NaNs) and the row
+         did not fallback, we run isotonic regression to enforce:
+             q10 <= q50 <= q90 <= ...
+         This is a "repair" step only for monotonicity, not a modeling step.
+
+    Notes
+    -----
+    - This ensemble tolerates missing quantile columns per model (point-only models).


nice, would it be possible to use google's docstrings format? here is an example from our docs.

also documenting the arguments would be very beneficial for our users.

Copilot

Pull request overview

This PR introduces a TrimmedEnsemble forecasting model as an alternative to the existing MedianEnsemble. The trimmed mean approach drops extreme values (smallest and largest forecasts) before averaging, with the trimming percentage adapting based on the number of contributing models. The PR includes both the implementation and an experimental script to validate the new ensemble method on M4 monthly data.

Adds TrimmedEnsemble class with adaptive trimming logic (trim 1, 20%, or 10% based on contributor count)
Implements fallback to median aggregation when insufficient contributors are available
Includes isotonic regression for quantile monotonicity enforcement
Provides experiment script for comparing MedianEnsemble vs TrimmedEnsemble on 50 M4 monthly series

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 16 comments.

File	Description
timecopilot/models/ensembles/trimmed.py	New TrimmedEnsemble class implementing adaptive trimmed mean aggregation with quantile support and isotonic regression for monotonicity
experiments/ensembles-first-run.py	Experimental script to compare MedianEnsemble and TrimmedEnsemble performance on M4 monthly time series data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

experiments/ensembles-first-run.py

Copilot · 2026-01-08T17:29:39Z

timecopilot/models/ensembles/trimmed.py

+class TrimmedEnsemble(Forecaster):
+    """
+    TrimmedEnsemble (alternate ensemble to MedianEnsemble)
+
+    Purpose
+    -------
+    A robust ensemble that aggregates model forecasts using a
+    *trimmed mean* for quantiles
+    (and optionally for point forecasts), with safety rails:
+
+      1) Fixed trimming per row (unique_id, ds): we first compute the
+        *minimum* number of
+         available contributors across all requested quantiles for that row (n_min).
+         Then we decide how much to trim based on n_min, and apply the same trimming
+         intensity to every quantile in that row.
+
+      2) Minimum contributor quota: if n_min < min_quota, we fall back to the *median*
+         aggregation for quantiles for that row (and skip isotonic repair).
+
+      3) Monotone quantiles: when a full quantile vector exists (no NaNs) and the row
+         did not fallback, we run isotonic regression to enforce:
+             q10 <= q50 <= q90 <= ...
+         This is a "repair" step only for monotonicity, not a modeling step.
+
+    Notes
+    -----
+    - This ensemble tolerates missing quantile columns per model (point-only models).
+    - When quantiles include 0.5, point forecast is set to the ensemble
+    q50 for coherence.
+    """
+
+    def __init__(
+        self,
+        models: list[Forecaster],
+        alias: str = "TrimmedEnsemble",
+        min_quota: int = 2,
+        trim_10p_threshold: int = 8,  # n_min >= this -> 10% trim
+    ):
+        self.tcf = TimeCopilotForecaster(models=models, fallback_model=None)
+        self.alias = alias
+        self.min_quota = int(min_quota)
+        self.trim_10p_threshold = int(trim_10p_threshold)
+
+    # ---------- trimming policy (fixed per row based on n_min) ----------
+
+    def _trim_k_from_nmin(self, n_min: int) -> int:
+        """
+        Decide how many values to trim from each tail (k) given n_min contributors.
+
+        Rule agreed:
+          - n_min 3–4  -> trim 1 each side
+          - n_min 5–7  -> trim 20%
+          - n_min >= 8 -> trim 10%  (simple fixed choice; avoids a "10–20%" ambiguity)
+
+        Returns:
+            k (int): how many to drop from each tail.
+        """
+        if n_min <= 2:
+            return 0
+        if n_min <= 4:
+            return 1
+        if n_min <= 7:
+            return int(np.floor(0.20 * n_min))
+        return int(np.floor(0.10 * n_min))
+
+    @staticmethod
+    def _trimmed_mean_1d(values: np.ndarray, k: int) -> float:
+        """
+        Trim k from each tail, then mean. Ignores NaNs.
+
+        If trimming would remove everything (2k >= n), we fall back to plain mean.
+        """
+        x = values.astype(float)
+        x = x[~np.isnan(x)]
+        n = x.size
+        if n == 0:
+            return np.nan
+        if k <= 0 or (2 * k) >= n:
+            return float(np.mean(x))
+        x.sort()
+        return float(np.mean(x[k : n - k]))
+
+    @staticmethod
+    def _nanmedian_1d(values: np.ndarray) -> float:
+        """Median ignoring NaNs; returns NaN if all values are NaN."""
+        x = values.astype(float)
+        return float(np.nanmedian(x)) if np.any(~np.isnan(x)) else np.nan
+
+    # ---------- main API ----------
+
+    def forecast(
+        self,
+        df: pd.DataFrame,
+        h: int,
+        freq: str | None = None,
+        level: list[int | float] | None = None,
+        quantiles: list[float] | None = None,
+    ) -> pd.DataFrame:
+        qc = QuantileConverter(level=level, quantiles=quantiles)
+
+        # Call all models; merged output includes each model alias column (point),
+        # and (if provided by the model) alias-q-{pct} columns for each quantile.
+        _fcst_df = self.tcf._call_models(
+            "forecast",
+            merge_on=["unique_id", "ds"],
+            df=df,
+            h=h,
+            freq=freq,
+            level=None,
+            quantiles=qc.quantiles,
+        )
+
+        fcst_df = _fcst_df[["unique_id", "ds"]].copy()
+        model_cols = [m.alias for m in self.tcf.models]
+
+        # Point forecast:
+        # Keep median for robustness (same as MedianEnsemble baseline).
+        # If q50 is requested later, we overwrite with ensemble q50 for coherence.
+        fcst_df[self.alias] = _fcst_df[model_cols].median(axis=1)
+
+        # No probabilistic output requested -> done.
+        if qc.quantiles is None:
+            return fcst_df
+
+        # Quantile setup
+        qs = sorted(qc.quantiles)
+        q_cols = [f"{self.alias}-q-{int(q * 100)}" for q in qs]
+
+        # Map pct -> existing per-model quantile columns (some may be missing)
+        models_q_cols_map: dict[int, list[str]] = {}
+        for q in qs:
+            pct = int(q * 100)
+            expected = [f"{alias}-q-{pct}" for alias in model_cols]
+            models_q_cols_map[pct] = [c for c in expected if c in _fcst_df.columns]
+
+        n_rows = len(_fcst_df)
+        fallback_mask = np.zeros(n_rows, dtype=bool)
+        k_by_row = np.zeros(n_rows, dtype=int)
+
+        # Decide trimming ONCE per row:
+        # - Compute n_min = min contributors across requested
+        # quantiles (after NaN filtering)
+        # - If n_min < min_quota -> fallback to median for ALL quantiles in that row
+        # - Else compute k = trim_k_from_nmin(n_min)
+        for i in range(n_rows):
+            counts = []
+            row_idx = _fcst_df.index[i]
+
+            for q in qs:
+                pct = int(q * 100)
+                cols_here = models_q_cols_map[pct]
+                if not cols_here:
+                    counts.append(0)
+                    continue
+                vals = _fcst_df.loc[row_idx, cols_here].to_numpy(dtype=float)
+                counts.append(int(np.sum(~np.isnan(vals))))
+
+            n_min = int(min(counts)) if counts else 0
+
+            if n_min < self.min_quota:
+                fallback_mask[i] = True
+                k_by_row[i] = 0
+            else:
+                k_by_row[i] = self._trim_k_from_nmin(n_min)
+
+        # Aggregate quantiles (trimmed mean if not fallback; otherwise median)
+        for q in qs:
+            pct = int(q * 100)
+            cols_here = models_q_cols_map[pct]
+            out_col = f"{self.alias}-q-{pct}"
+
+            if not cols_here:
+                # Nobody produced this quantile column.
+                fcst_df[out_col] = np.nan
+                continue
+
+            vals_mat = _fcst_df[cols_here].to_numpy(dtype=float)
+            out = np.empty(n_rows, dtype=float)
+
+            for i in range(n_rows):
+                if fallback_mask[i]:
+                    out[i] = self._nanmedian_1d(vals_mat[i])
+                else:
+                    out[i] = self._trimmed_mean_1d(vals_mat[i], k=int(k_by_row[i]))
+
+            fcst_df[out_col] = out
+
+        # Isotonic monotonicity repair:
+        # Only valid when:
+        #   - row did NOT fallback, AND
+        #   - all requested quantiles exist for that row
+        # (no NaNs in the ensemble quantiles)
+        # Otherwise: skip (do not "repair" partial/broken vectors).
+        ir = IsotonicRegression(increasing=True)
+        q_vals = fcst_df[q_cols].to_numpy(dtype=float)
+        repaired = q_vals.copy()
+
+        for i in range(n_rows):
+            if fallback_mask[i]:
+                continue
+            if np.any(np.isnan(repaired[i])):
+                continue
+            repaired[i] = ir.fit_transform(qs, repaired[i])
+
+        fcst_df[q_cols] = repaired
+
+        # If q50 requested, make point forecast equal to median quantile output.
+        if 0.5 in qc.quantiles:
+            fcst_df[self.alias] = fcst_df[f"{self.alias}-q-50"].values
+
+        # One-line disclosure if any fallback occurred.
+        n_fallback = int(fallback_mask.sum())
+        if n_fallback > 0:
+            print(
+                f"{self.alias}: quantiles fallback->median \
+                for {n_fallback}/{n_rows} rows "
+                f"(min_quota={self.min_quota}); isotonic \
+                skipped on fallback/NaN rows."
+            )
+
+        # Convert quantiles to levels if user requested `level=...`
+        fcst_df = qc.maybe_convert_quantiles_to_level(fcst_df, models=[self.alias])
+        return fcst_df


The new TrimmedEnsemble class lacks test coverage. The repository has comprehensive tests for other models in tests/models/, and MedianEnsemble is tested via tests/models/conftest.py. The TrimmedEnsemble should have similar test coverage to verify its trimming logic, fallback behavior, and isotonic regression functionality.

Copilot · 2026-01-08T17:29:39Z

experiments/trimmed-ensemble/ensembnles-first-run.py

+import numpy as np
+import pandas as pd
+
+sys.path.insert(0, "/home/el-cacomixtle/timecopilot/")


The sys.path.insert uses a hardcoded absolute path that is specific to a local development environment. This should be removed or replaced with a relative path or proper package installation. Other developers won't have this path on their systems.

Suggested change

sys.path.insert(0, "/home/el-cacomixtle/timecopilot/")

PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))

if PROJECT_ROOT not in sys.path:

sys.path.insert(0, PROJECT_ROOT)

timecopilot/models/ensembles/trimmed.py

Copilot · 2026-01-08T17:29:42Z

experiments/trimmed-ensemble/ensembnles-first-run.py

+    return 100 * np.mean(
+        2 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred) + 1e-8)
+    )


The SMAPE implementation adds a small epsilon (1e-8) to prevent division by zero, but adds it to the denominator rather than checking for zero division explicitly. When both y_true and y_pred are zero, the formula should ideally return 0 (perfect match), but this implementation would give a small non-zero value. Consider handling the zero case explicitly or documenting this behavior.

Suggested change

return 100 * np.mean(

2 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred) + 1e-8)

)

den = np.abs(y_true) + np.abs(y_pred)

# When both y_true and y_pred are zero, define SMAPE contribution as 0

mask = den == 0

smape_vals = np.zeros_like(den, dtype=float)

smape_vals[~mask] = (

2.0 * np.abs(y_pred[~mask] - y_true[~mask]) / den[~mask]

)

return 100.0 * np.mean(smape_vals)

Copilot · 2026-01-08T17:29:42Z

timecopilot/models/ensembles/trimmed.py

+                f"{self.alias}: quantiles fallback->median \
+                for {n_fallback}/{n_rows} rows "
+                f"(min_quota={self.min_quota}); isotonic \
+                skipped on fallback/NaN rows."


The string contains line continuation backslashes that will include unwanted whitespace in the printed message. The backslashes and indentation will appear in the output. Consider using parentheses to wrap the string across multiple lines, or use a single line.

Suggested change

f"{self.alias}: quantiles fallback->median \

for {n_fallback}/{n_rows} rows "

f"(min_quota={self.min_quota}); isotonic \

skipped on fallback/NaN rows."

(

f"{self.alias}: quantiles fallback->median "

f"for {n_fallback}/{n_rows} rows "

f"(min_quota={self.min_quota}); isotonic "

f"skipped on fallback/NaN rows."

)

Copilot · 2026-01-08T17:29:43Z

timecopilot/models/ensembles/trimmed.py

+    def __init__(
+        self,
+        models: list[Forecaster],
+        alias: str = "TrimmedEnsemble",
+        min_quota: int = 2,
+        trim_10p_threshold: int = 8,  # n_min >= this -> 10% trim
+    ):
+        self.tcf = TimeCopilotForecaster(models=models, fallback_model=None)
+        self.alias = alias
+        self.min_quota = int(min_quota)
+        self.trim_10p_threshold = int(trim_10p_threshold)
+


Missing comprehensive documentation for the TrimmedEnsemble class. The MedianEnsemble class has an extensive docstring with usage examples, args documentation, and a full example. The TrimmedEnsemble should have similar documentation including a usage example and proper documentation of all parameters (min_quota, trim_10p_threshold).

Copilot · 2026-01-08T17:29:43Z

timecopilot/models/ensembles/trimmed.py

+            return 0
+        if n_min <= 4:
+            return 1
+        if n_min <= 7:


There's a potential logic issue with the trimming policy. The docstring says "n_min >= 8 -> trim 10%" and the parameter is called trim_10p_threshold with default value 8, but the condition uses if n_min <= 7 instead of if n_min < self.trim_10p_threshold. This makes the threshold parameter unused and the logic inflexible. Consider using if n_min < self.trim_10p_threshold to make the threshold configurable as intended.

Suggested change

if n_min <= 7:

if n_min < self.trim_10p_threshold:

experiments/ensembles-first-run.py

added suggestion made by copilto Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

committed print suggestion Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

added suggestion by copilot for the use more accurate comms in percentage displays Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

betterment of lambda melt appliance Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

… ensemble logic to curtail fewer values

… ensemble logic to curtail fewer values. Added information on comments in first run script in order to explain users what the experiment is about.

elmartinj · 2026-01-09T23:41:25Z

Done!

elmartinj and others added 12 commits December 2, 2025 17:30

A cool tutorial

e3d7b8c

This tutorial contains an example with multiple indexes and subsequent tampering to the data in order to show resiliency and a real life use case of TC applied on cryptocurrency prices up to 2021.

Merge branch 'main' into main

6f16de1

run pre-comit hooks

36e910f

Merge branch 'main' into main

54b42c0

Changing filename to requested formats and folder. This also contains…

58b97a0

… the title suggested

Merge branch 'main' of https://github.com/elmartinj/timecopilot

444bb3b

Clearance of verbose cells

0799654

Staging changes to mkdocs and the naming of the tutorial file

8d9940a

Applye formatting and lint fixes

c621bdf

More formatting

302e9b2

AzulGarza self-requested a review January 5, 2026 15:21

AzulGarza requested changes Jan 5, 2026

View reviewed changes

Merge branch 'main' into gifteval_trial

4b43c13

AzulGarza requested changes Jan 8, 2026

View reviewed changes

AzulGarza requested a review from Copilot January 8, 2026 17:24

Copilot started reviewing on behalf of AzulGarza January 8, 2026 17:24 View session

Copilot AI reviewed Jan 8, 2026

View reviewed changes

elmartinj and others added 7 commits January 9, 2026 15:08

Update experiments/ensembles-first-run.py

9e35c9e

added suggestion made by copilto Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update experiments/ensembles-first-run.py

50b4a40

committed print suggestion Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update timecopilot/models/ensembles/trimmed.py

d68933e

added suggestion by copilot for the use more accurate comms in percentage displays Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update experiments/ensembles-first-run.py

9582684

betterment of lambda melt appliance Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Creation of directory to hold experiment

f8e85de

Formatting changes. Added libraries to match standarizations. Tweaked…

acf93da

… ensemble logic to curtail fewer values

Formatting changes. Added libraries to match standarizations. Tweaked…

586d04d

… ensemble logic to curtail fewer values. Added information on comments in first run script in order to explain users what the experiment is about.

-sys.path.insert(0, "/home/el-cacomixtle/timecopilot/")
+PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), ".."))
+if PROJECT_ROOT not in sys.path:
+    sys.path.insert(0, PROJECT_ROOT)

-    return 100 * np.mean(
-* np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred) + 1e-8)
-    )
+    den = np.abs(y_true) + np.abs(y_pred)
+    # When both y_true and y_pred are zero, define SMAPE contribution as 0
+    mask = den == 0
+    smape_vals = np.zeros_like(den, dtype=float)
+    smape_vals[~mask] = (
+.0 * np.abs(y_pred[~mask] - y_true[~mask]) / den[~mask]
+    )
+    return 100.0 * np.mean(smape_vals)

Uh oh!

Ensemble proposal #281

Are you sure you want to change the base?

Ensemble proposal #281

Uh oh!

Conversation

elmartinj commented Dec 23, 2025

Uh oh!

AzulGarza left a comment

Choose a reason for hiding this comment

Uh oh!

elmartinj commented Jan 6, 2026

Uh oh!

AzulGarza left a comment

Choose a reason for hiding this comment

Uh oh!

AzulGarza Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

AzulGarza Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

AzulGarza Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

elmartinj commented Jan 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants