-
-
Notifications
You must be signed in to change notification settings - Fork 51
Ensemble proposal #281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Ensemble proposal #281
Conversation
This tutorial contains an example with multiple indexes and subsequent tampering to the data in order to show resiliency and a real life use case of TC applied on cryptocurrency prices up to 2021.
Fix to the issue raised on empty dataframe, resulting from an inner merge where the existing dataframe that accumulated results and the newer one had diferent indices. A subsequent issue must be raised to either: 1 report a single model failure (on index matching) 2 fix the moirai discrepancy (only model that showed this issue)
… the title suggested
…age ensemble. This alternative poses as proposal as another idea onto the value determined by TimeCopilot itself as to what value to provide to the user. There are 2 caveats to consider. 1.- Curretn GiftEval execution involves 3 models. So, this alternative will not pose any difference in the result emission and would have to consider more models (at least 2) to see if there's any meaningful difference un the results provided by the ensembles. 2.- This commit does not include a script in order to try this because of current hardware limitations and because of point number 1. This should be discussed previously in order to setup future implementation if determined viable and worthy of further interest. There is however an ensembles first run script that will work with monthly data as a check to see the current trimmed ensemble is working, though results are identical in smape metric for 50 timeseries adjustments
AzulGarza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks so much for working on this, @elmartinj! would it be possible to fix the conflicts with the main branch?
|
Done! |
AzulGarza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @elmartinj! i left a couple of initial comments. i'm also adding copilot as reviewer to have its comments.
| import numpy as np | ||
| import pandas as pd | ||
|
|
||
| sys.path.insert(0, "/home/el-cacomixtle/timecopilot/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love the el cacomixtle name haha but the line can cause some issues to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty cool experiment! can we create a new dir for it and also add a README inside with reproduction instructions? other experiments in the dir can serve as inspiration.
also, to standardize evaluation and data extraction, wdyt about using losses from utilsforecast and data from datasetsforecast. the later includes test-train splits for M4.
| """ | ||
| TrimmedEnsemble (alternate ensemble to MedianEnsemble) | ||
|
|
||
| Purpose | ||
| ------- | ||
| A robust ensemble that aggregates model forecasts using a | ||
| *trimmed mean* for quantiles | ||
| (and optionally for point forecasts), with safety rails: | ||
|
|
||
| 1) Fixed trimming per row (unique_id, ds): we first compute the | ||
| *minimum* number of | ||
| available contributors across all requested quantiles for that row (n_min). | ||
| Then we decide how much to trim based on n_min, and apply the same trimming | ||
| intensity to every quantile in that row. | ||
|
|
||
| 2) Minimum contributor quota: if n_min < min_quota, we fall back to the *median* | ||
| aggregation for quantiles for that row (and skip isotonic repair). | ||
|
|
||
| 3) Monotone quantiles: when a full quantile vector exists (no NaNs) and the row | ||
| did not fallback, we run isotonic regression to enforce: | ||
| q10 <= q50 <= q90 <= ... | ||
| This is a "repair" step only for monotonicity, not a modeling step. | ||
|
|
||
| Notes | ||
| ----- | ||
| - This ensemble tolerates missing quantile columns per model (point-only models). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, would it be possible to use google's docstrings format? here is an example from our docs.
also documenting the arguments would be very beneficial for our users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a TrimmedEnsemble forecasting model as an alternative to the existing MedianEnsemble. The trimmed mean approach drops extreme values (smallest and largest forecasts) before averaging, with the trimming percentage adapting based on the number of contributing models. The PR includes both the implementation and an experimental script to validate the new ensemble method on M4 monthly data.
- Adds TrimmedEnsemble class with adaptive trimming logic (trim 1, 20%, or 10% based on contributor count)
- Implements fallback to median aggregation when insufficient contributors are available
- Includes isotonic regression for quantile monotonicity enforcement
- Provides experiment script for comparing MedianEnsemble vs TrimmedEnsemble on 50 M4 monthly series
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 16 comments.
| File | Description |
|---|---|
| timecopilot/models/ensembles/trimmed.py | New TrimmedEnsemble class implementing adaptive trimmed mean aggregation with quantile support and isotonic regression for monotonicity |
| experiments/ensembles-first-run.py | Experimental script to compare MedianEnsemble and TrimmedEnsemble performance on M4 monthly time series data |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class TrimmedEnsemble(Forecaster): | ||
| """ | ||
| TrimmedEnsemble (alternate ensemble to MedianEnsemble) | ||
|
|
||
| Purpose | ||
| ------- | ||
| A robust ensemble that aggregates model forecasts using a | ||
| *trimmed mean* for quantiles | ||
| (and optionally for point forecasts), with safety rails: | ||
|
|
||
| 1) Fixed trimming per row (unique_id, ds): we first compute the | ||
| *minimum* number of | ||
| available contributors across all requested quantiles for that row (n_min). | ||
| Then we decide how much to trim based on n_min, and apply the same trimming | ||
| intensity to every quantile in that row. | ||
|
|
||
| 2) Minimum contributor quota: if n_min < min_quota, we fall back to the *median* | ||
| aggregation for quantiles for that row (and skip isotonic repair). | ||
|
|
||
| 3) Monotone quantiles: when a full quantile vector exists (no NaNs) and the row | ||
| did not fallback, we run isotonic regression to enforce: | ||
| q10 <= q50 <= q90 <= ... | ||
| This is a "repair" step only for monotonicity, not a modeling step. | ||
|
|
||
| Notes | ||
| ----- | ||
| - This ensemble tolerates missing quantile columns per model (point-only models). | ||
| - When quantiles include 0.5, point forecast is set to the ensemble | ||
| q50 for coherence. | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| models: list[Forecaster], | ||
| alias: str = "TrimmedEnsemble", | ||
| min_quota: int = 2, | ||
| trim_10p_threshold: int = 8, # n_min >= this -> 10% trim | ||
| ): | ||
| self.tcf = TimeCopilotForecaster(models=models, fallback_model=None) | ||
| self.alias = alias | ||
| self.min_quota = int(min_quota) | ||
| self.trim_10p_threshold = int(trim_10p_threshold) | ||
|
|
||
| # ---------- trimming policy (fixed per row based on n_min) ---------- | ||
|
|
||
| def _trim_k_from_nmin(self, n_min: int) -> int: | ||
| """ | ||
| Decide how many values to trim from each tail (k) given n_min contributors. | ||
|
|
||
| Rule agreed: | ||
| - n_min 3–4 -> trim 1 each side | ||
| - n_min 5–7 -> trim 20% | ||
| - n_min >= 8 -> trim 10% (simple fixed choice; avoids a "10–20%" ambiguity) | ||
|
|
||
| Returns: | ||
| k (int): how many to drop from each tail. | ||
| """ | ||
| if n_min <= 2: | ||
| return 0 | ||
| if n_min <= 4: | ||
| return 1 | ||
| if n_min <= 7: | ||
| return int(np.floor(0.20 * n_min)) | ||
| return int(np.floor(0.10 * n_min)) | ||
|
|
||
| @staticmethod | ||
| def _trimmed_mean_1d(values: np.ndarray, k: int) -> float: | ||
| """ | ||
| Trim k from each tail, then mean. Ignores NaNs. | ||
|
|
||
| If trimming would remove everything (2k >= n), we fall back to plain mean. | ||
| """ | ||
| x = values.astype(float) | ||
| x = x[~np.isnan(x)] | ||
| n = x.size | ||
| if n == 0: | ||
| return np.nan | ||
| if k <= 0 or (2 * k) >= n: | ||
| return float(np.mean(x)) | ||
| x.sort() | ||
| return float(np.mean(x[k : n - k])) | ||
|
|
||
| @staticmethod | ||
| def _nanmedian_1d(values: np.ndarray) -> float: | ||
| """Median ignoring NaNs; returns NaN if all values are NaN.""" | ||
| x = values.astype(float) | ||
| return float(np.nanmedian(x)) if np.any(~np.isnan(x)) else np.nan | ||
|
|
||
| # ---------- main API ---------- | ||
|
|
||
| def forecast( | ||
| self, | ||
| df: pd.DataFrame, | ||
| h: int, | ||
| freq: str | None = None, | ||
| level: list[int | float] | None = None, | ||
| quantiles: list[float] | None = None, | ||
| ) -> pd.DataFrame: | ||
| qc = QuantileConverter(level=level, quantiles=quantiles) | ||
|
|
||
| # Call all models; merged output includes each model alias column (point), | ||
| # and (if provided by the model) alias-q-{pct} columns for each quantile. | ||
| _fcst_df = self.tcf._call_models( | ||
| "forecast", | ||
| merge_on=["unique_id", "ds"], | ||
| df=df, | ||
| h=h, | ||
| freq=freq, | ||
| level=None, | ||
| quantiles=qc.quantiles, | ||
| ) | ||
|
|
||
| fcst_df = _fcst_df[["unique_id", "ds"]].copy() | ||
| model_cols = [m.alias for m in self.tcf.models] | ||
|
|
||
| # Point forecast: | ||
| # Keep median for robustness (same as MedianEnsemble baseline). | ||
| # If q50 is requested later, we overwrite with ensemble q50 for coherence. | ||
| fcst_df[self.alias] = _fcst_df[model_cols].median(axis=1) | ||
|
|
||
| # No probabilistic output requested -> done. | ||
| if qc.quantiles is None: | ||
| return fcst_df | ||
|
|
||
| # Quantile setup | ||
| qs = sorted(qc.quantiles) | ||
| q_cols = [f"{self.alias}-q-{int(q * 100)}" for q in qs] | ||
|
|
||
| # Map pct -> existing per-model quantile columns (some may be missing) | ||
| models_q_cols_map: dict[int, list[str]] = {} | ||
| for q in qs: | ||
| pct = int(q * 100) | ||
| expected = [f"{alias}-q-{pct}" for alias in model_cols] | ||
| models_q_cols_map[pct] = [c for c in expected if c in _fcst_df.columns] | ||
|
|
||
| n_rows = len(_fcst_df) | ||
| fallback_mask = np.zeros(n_rows, dtype=bool) | ||
| k_by_row = np.zeros(n_rows, dtype=int) | ||
|
|
||
| # Decide trimming ONCE per row: | ||
| # - Compute n_min = min contributors across requested | ||
| # quantiles (after NaN filtering) | ||
| # - If n_min < min_quota -> fallback to median for ALL quantiles in that row | ||
| # - Else compute k = trim_k_from_nmin(n_min) | ||
| for i in range(n_rows): | ||
| counts = [] | ||
| row_idx = _fcst_df.index[i] | ||
|
|
||
| for q in qs: | ||
| pct = int(q * 100) | ||
| cols_here = models_q_cols_map[pct] | ||
| if not cols_here: | ||
| counts.append(0) | ||
| continue | ||
| vals = _fcst_df.loc[row_idx, cols_here].to_numpy(dtype=float) | ||
| counts.append(int(np.sum(~np.isnan(vals)))) | ||
|
|
||
| n_min = int(min(counts)) if counts else 0 | ||
|
|
||
| if n_min < self.min_quota: | ||
| fallback_mask[i] = True | ||
| k_by_row[i] = 0 | ||
| else: | ||
| k_by_row[i] = self._trim_k_from_nmin(n_min) | ||
|
|
||
| # Aggregate quantiles (trimmed mean if not fallback; otherwise median) | ||
| for q in qs: | ||
| pct = int(q * 100) | ||
| cols_here = models_q_cols_map[pct] | ||
| out_col = f"{self.alias}-q-{pct}" | ||
|
|
||
| if not cols_here: | ||
| # Nobody produced this quantile column. | ||
| fcst_df[out_col] = np.nan | ||
| continue | ||
|
|
||
| vals_mat = _fcst_df[cols_here].to_numpy(dtype=float) | ||
| out = np.empty(n_rows, dtype=float) | ||
|
|
||
| for i in range(n_rows): | ||
| if fallback_mask[i]: | ||
| out[i] = self._nanmedian_1d(vals_mat[i]) | ||
| else: | ||
| out[i] = self._trimmed_mean_1d(vals_mat[i], k=int(k_by_row[i])) | ||
|
|
||
| fcst_df[out_col] = out | ||
|
|
||
| # Isotonic monotonicity repair: | ||
| # Only valid when: | ||
| # - row did NOT fallback, AND | ||
| # - all requested quantiles exist for that row | ||
| # (no NaNs in the ensemble quantiles) | ||
| # Otherwise: skip (do not "repair" partial/broken vectors). | ||
| ir = IsotonicRegression(increasing=True) | ||
| q_vals = fcst_df[q_cols].to_numpy(dtype=float) | ||
| repaired = q_vals.copy() | ||
|
|
||
| for i in range(n_rows): | ||
| if fallback_mask[i]: | ||
| continue | ||
| if np.any(np.isnan(repaired[i])): | ||
| continue | ||
| repaired[i] = ir.fit_transform(qs, repaired[i]) | ||
|
|
||
| fcst_df[q_cols] = repaired | ||
|
|
||
| # If q50 requested, make point forecast equal to median quantile output. | ||
| if 0.5 in qc.quantiles: | ||
| fcst_df[self.alias] = fcst_df[f"{self.alias}-q-50"].values | ||
|
|
||
| # One-line disclosure if any fallback occurred. | ||
| n_fallback = int(fallback_mask.sum()) | ||
| if n_fallback > 0: | ||
| print( | ||
| f"{self.alias}: quantiles fallback->median \ | ||
| for {n_fallback}/{n_rows} rows " | ||
| f"(min_quota={self.min_quota}); isotonic \ | ||
| skipped on fallback/NaN rows." | ||
| ) | ||
|
|
||
| # Convert quantiles to levels if user requested `level=...` | ||
| fcst_df = qc.maybe_convert_quantiles_to_level(fcst_df, models=[self.alias]) | ||
| return fcst_df |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new TrimmedEnsemble class lacks test coverage. The repository has comprehensive tests for other models in tests/models/, and MedianEnsemble is tested via tests/models/conftest.py. The TrimmedEnsemble should have similar test coverage to verify its trimming logic, fallback behavior, and isotonic regression functionality.
| import numpy as np | ||
| import pandas as pd | ||
|
|
||
| sys.path.insert(0, "/home/el-cacomixtle/timecopilot/") |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sys.path.insert uses a hardcoded absolute path that is specific to a local development environment. This should be removed or replaced with a relative path or proper package installation. Other developers won't have this path on their systems.
| sys.path.insert(0, "/home/el-cacomixtle/timecopilot/") | |
| PROJECT_ROOT = os.path.abspath(os.path.join(os.path.dirname(__file__), "..")) | |
| if PROJECT_ROOT not in sys.path: | |
| sys.path.insert(0, PROJECT_ROOT) |
| return 100 * np.mean( | ||
| 2 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred) + 1e-8) | ||
| ) |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SMAPE implementation adds a small epsilon (1e-8) to prevent division by zero, but adds it to the denominator rather than checking for zero division explicitly. When both y_true and y_pred are zero, the formula should ideally return 0 (perfect match), but this implementation would give a small non-zero value. Consider handling the zero case explicitly or documenting this behavior.
| return 100 * np.mean( | |
| 2 * np.abs(y_pred - y_true) / (np.abs(y_true) + np.abs(y_pred) + 1e-8) | |
| ) | |
| den = np.abs(y_true) + np.abs(y_pred) | |
| # When both y_true and y_pred are zero, define SMAPE contribution as 0 | |
| mask = den == 0 | |
| smape_vals = np.zeros_like(den, dtype=float) | |
| smape_vals[~mask] = ( | |
| 2.0 * np.abs(y_pred[~mask] - y_true[~mask]) / den[~mask] | |
| ) | |
| return 100.0 * np.mean(smape_vals) |
| f"{self.alias}: quantiles fallback->median \ | ||
| for {n_fallback}/{n_rows} rows " | ||
| f"(min_quota={self.min_quota}); isotonic \ | ||
| skipped on fallback/NaN rows." |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string contains line continuation backslashes that will include unwanted whitespace in the printed message. The backslashes and indentation will appear in the output. Consider using parentheses to wrap the string across multiple lines, or use a single line.
| f"{self.alias}: quantiles fallback->median \ | |
| for {n_fallback}/{n_rows} rows " | |
| f"(min_quota={self.min_quota}); isotonic \ | |
| skipped on fallback/NaN rows." | |
| ( | |
| f"{self.alias}: quantiles fallback->median " | |
| f"for {n_fallback}/{n_rows} rows " | |
| f"(min_quota={self.min_quota}); isotonic " | |
| f"skipped on fallback/NaN rows." | |
| ) |
| def __init__( | ||
| self, | ||
| models: list[Forecaster], | ||
| alias: str = "TrimmedEnsemble", | ||
| min_quota: int = 2, | ||
| trim_10p_threshold: int = 8, # n_min >= this -> 10% trim | ||
| ): | ||
| self.tcf = TimeCopilotForecaster(models=models, fallback_model=None) | ||
| self.alias = alias | ||
| self.min_quota = int(min_quota) | ||
| self.trim_10p_threshold = int(trim_10p_threshold) | ||
|
|
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing comprehensive documentation for the TrimmedEnsemble class. The MedianEnsemble class has an extensive docstring with usage examples, args documentation, and a full example. The TrimmedEnsemble should have similar documentation including a usage example and proper documentation of all parameters (min_quota, trim_10p_threshold).
| return 0 | ||
| if n_min <= 4: | ||
| return 1 | ||
| if n_min <= 7: |
Copilot
AI
Jan 8, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential logic issue with the trimming policy. The docstring says "n_min >= 8 -> trim 10%" and the parameter is called trim_10p_threshold with default value 8, but the condition uses if n_min <= 7 instead of if n_min < self.trim_10p_threshold. This makes the threshold parameter unused and the logic inflexible. Consider using if n_min < self.trim_10p_threshold to make the threshold configurable as intended.
| if n_min <= 7: | |
| if n_min < self.trim_10p_threshold: |
added suggestion made by copilto Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
committed print suggestion Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
added suggestion by copilot for the use more accurate comms in percentage displays Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
betterment of lambda melt appliance Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… ensemble logic to curtail fewer values
… ensemble logic to curtail fewer values. Added information on comments in first run script in order to explain users what the experiment is about.
|
Done! |
Hello!
Just created an ensemble proposal in order to provide one alternative to the median ensemble used in the gift eval testing.
This PR contains two scripts. One is the proposal for the trimmed average ensemble. The ensemble consists of the dropping of a percentage of the smallest and largest values in an ordered set of forecasts and then a mean of the remaining values. This percentage is sensible of the size of the forecasts provided. There are 2 caveats to consider:
1.- Current GiftEval execution involves 3 models. So, this alternative will not pose any difference in the result emission and would have to consider more models (at least 2) to see if there's any meaningful difference in the results provided by the ensembles.
2.- This commit does not include a script in order to try this because of current hardware limitations and because of point number 1. This should be discussed previously in order to setup future implementation if determined viable and worthy of further interest.
There is however an ensembles first run script that will work with monthly data as a check to see the current trimmed ensemble is working, though results are identical in smape metric for 50 monthly timeseries adjustments.