Current released version: v1.0.2
Next planned version: v1.1.0
This document defines the implementation path from v1.0.2 forward.
Completed v1.0.1 work is retained below as release history.
It is written to be used directly by Codex for one-task-at-a-time execution.
- Treat each task below as one Codex task and one pull request.
- For every task:
- add focused regression tests first or alongside the fix
- run
python -m pytest -q - update public docs if behavior changes
- Do not combine tasks across releases until the earlier release is green and tagged.
- Prefer backward-compatible changes within patch releases.
- If a task changes file formats, CLI behavior, or public API semantics, document the migration in
README.mdand the relevant docs page.
Status: completed in the current codebase and covered by regression tests.
Eliminate silent schema corruption during ingest and make public CSV-loading behavior consistent across API and CLI.
Primary files
metdatapy/mapper.pytests/test_mapper.py
Implementation
- Update
Detector.detect()inmetdatapy/mapper.py. - Only emit a canonical field when there is real evidence for it.
- Require at least one of:
- a positive column-name pattern match
- a positive unit hint
- Add a minimum confidence threshold before a field is accepted.
- Prevent the same source column from being assigned to multiple canonical fields by default.
- If multiple canonical fields compete for the same source column, keep the best-supported mapping and drop the weaker ones.
- If evidence is weak or absent, omit the canonical field entirely instead of guessing.
Acceptance criteria
- A dataset with only
timestamp,temperature,humidity, andpressuredoes not invent mappings for:wspd_mswdir_deggust_msrain_mmsolar_wm2uv_index
- Existing obvious mappings like temperature and humidity still detect correctly.
- Confidence values remain present for accepted mappings.
Tests
- Add regressions to
tests/test_mapper.pyshowing:- absent variables are not fabricated
- one source column is not reused for multiple canonical outputs
- normal cases still work
Primary files
metdatapy/io.pymetdatapy/core.pymetdatapy/mapper.pymetdatapy/cli.pytests/test_encoding.pytests/test_core.pytests/test_cli.py
Implementation
- Create one shared internal CSV reader in
metdatapy/io.pythat:- detects encoding
- reads CSV with
encoding_errors="replace" - optionally parses a timestamp column
- Route all CSV reads through it:
WeatherSet.from_csv()inmetdatapy/core.pyDetector.detect_from_csv()inmetdatapy/mapper.py- ingest commands in
metdatapy/cli.py
- Keep behavior consistent between Python API and CLI.
Acceptance criteria
- UTF-16, CP1252, and Latin-1 files behave consistently in:
- Python API
- detector path
- CLI ingest commands
- Previously supported UTF-8 cases continue to work unchanged.
Tests
- Extend
tests/test_encoding.pyfor shared-loader coverage. - Add regressions in
tests/test_core.pyforWeatherSet.from_csv()with non-UTF-8 input. - Add regressions in
tests/test_cli.pyformdp ingest detectandmdp ingest applyon non-UTF-8 input.
Primary files
metdatapy/cli.pymetdatapy/manifest.pytests/test_cli.pytests/test_manifest.py
Implementation
- In
metdatapy/cli.py, makemdp ingest templateemit YAML mapping content when writing mapping files. - Reuse
Mapper.save()instead of writing JSON into.ymlfiles. - In
metdatapy/manifest.py, fixManifest.validate_reproducibility()so scaler comparison is correct:- if only one manifest has a scaler,
same_scalermust beFalse - if both have scalers, compare method, columns, and parameters
- if only one manifest has a scaler,
- In
ManifestBuilder.set_dataset_info(), validate the index type and fail with a clear error on non-datetime indexes.
Acceptance criteria
- Template files written via CLI are valid YAML mappings.
- Manifest comparison does not report false compatibility for scaler presence mismatches.
- Non-datetime indexed input to
ManifestBuilder.set_dataset_info()fails cleanly with an actionable message.
Tests
- Add CLI template-output assertions in
tests/test_cli.py. - Add manifest scaler mismatch and non-datetime-index regressions in
tests/test_manifest.py.
Remove false negatives and false positives in missing-row handling and QC flagging.
Primary files
metdatapy/core.pymetdatapy/utils.pytests/test_core.pytests/test_integration.py
Implementation
- Update
WeatherSet.insert_missing()inmetdatapy/core.py. - Replace raw
pd.infer_freq()usage withmetdatapy.utils.infer_frequency(). - Normalize deprecated frequency aliases consistently.
- Only skip reindexing when no usable frequency can be derived.
- Preserve existing
gapsemantics.
Acceptance criteria
- A series like
00:00, 02:00, 03:00can infer an hourly cadence and insert the missing01:00row. - Existing no-gap regular series remain unchanged.
- Existing explicit
frequency=behavior remains supported.
Tests
- Add regressions in
tests/test_core.pyfor a mostly regular hourly series with one missing timestamp. - Add end-to-end coverage in
tests/test_integration.py.
Primary files
metdatapy/qc.pytests/test_qc.py
Implementation
- Update
qc_flatline()inmetdatapy/qc.py. - Do not convert
NaNrolling variance to0.0. - Only flag flatlines when the rolling window has enough valid observations and variance is genuinely below tolerance.
- Preserve existing behavior for true flatline sequences.
Acceptance criteria
- Short series are not automatically marked as flatline.
- Windows dominated by missing values are not marked as flatline by default.
- True constant windows still flag correctly.
Tests
- Add regressions in
tests/test_qc.pyfor:- short series
- windows with NaNs
- true flatlines
Primary files
metdatapy/core.pytests/test_core.py
Implementation
- Replace the
try/except: passinWeatherSet.resample()aroundqc_*propagation with explicit boolean aggregation logic. - Aggregate QC flags with OR semantics over the resample window.
- Keep
gappropagation explicit and testable. - Fail loudly if an unsupported QC column shape is encountered.
Acceptance criteria
- Multiple
qc_*columns are propagated deterministically. - No silent failure path remains in QC propagation.
- Existing gap propagation continues to work.
Tests
- Add regressions in
tests/test_core.pycovering:- multiple QC columns
- gap propagation
- mixed aggregation windows
Make timezone semantics explicit and preserve real instants across ingest and export.
Primary files
metdatapy/mapper.pymetdatapy/cli.pymetdatapy/core.pymetdatapy/utils.pytests/test_utils.pytests/test_core.pytests/test_cli.py
Implementation
- Extend the mapping schema so
tscan carry atimezonefield. - Update
Mapper.template()inmetdatapy/mapper.pyto include timezone support. - Update the interactive mapping wizard in
metdatapy/cli.pyto prompt for timezone. - Thread timezone into
WeatherSet.from_mapping()viaensure_datetime_utc(). - Preserve backward compatibility:
- if timezone is omitted, keep current behavior
- but emit a warning when naive timestamps are mapped without timezone metadata
Acceptance criteria
- Naive local timestamps can be correctly interpreted using mapping-provided timezone metadata.
- Existing timezone-aware timestamps continue to normalize to UTC correctly.
- Old mapping files without timezone still work.
Tests
- Add timezone-hint regressions in
tests/test_utils.py. - Add mapping-based timezone ingestion regressions in
tests/test_core.py. - Add CLI wizard and ingest regressions in
tests/test_cli.py.
Primary files
metdatapy/io.pytests/test_netcdf.py
Implementation
- In
to_netcdf(), if the index is tz-aware:- first convert to UTC
- then strip timezone info for xarray compatibility
- In
from_netcdf(), localize the returned time index to UTC so round-tripped data remains UTC-aware.
Acceptance criteria
- A non-UTC tz-aware input index round-trips through NetCDF without shifting instants.
- Existing UTC input still round-trips correctly.
Tests
- Add a round-trip regression in
tests/test_netcdf.pyusing a non-UTC tz-aware index and assert identical UTC instants after reload.
Fix domain-invalid derived metrics and physically incorrect aggregation behavior.
Primary files
metdatapy/derive.pytests/test_derive.pytests/test_core.py
Implementation
- Update
heat_index_c()inmetdatapy/derive.pyso out-of-domain cases do not return values below ambient temperature. - Update
wind_chill_c()so out-of-domain cases return ambient temperature instead of extrapolated wind chill. - Match the documented validity domains in the docstrings.
Acceptance criteria
- Heat index does not create cooling effects in inappropriate domains.
- Wind chill does not create warming/cooling artifacts outside valid conditions.
- In-domain calculations remain meteorologically reasonable.
Tests
- Add boundary and out-of-domain regressions in
tests/test_derive.py. - Add
WeatherSet.derive()regressions intests/test_core.py.
Primary files
metdatapy/qc.pymetdatapy/core.pytests/test_core.pytests/test_qc.py
Implementation
- In
qc_consistency()inmetdatapy/qc.py, apply heat-index and wind-chill checks only when those metrics are within their valid domains. - In
WeatherSet.resample()inmetdatapy/core.py, replace arithmetic mean aggregation forwdir_degwith circular mean. - Prefer speed-weighted circular mean when
wspd_msis available. - Return
NaNfor undefined calm resultant direction.
Acceptance criteria
350°and10°aggregate to approximately0°, not180°.- Library-generated heat index and wind chill no longer self-trigger false consistency failures.
Tests
- Add QC regressions in
tests/test_qc.py. - Add circular wind-direction resample regressions in
tests/test_core.py.
Add missing-data handling without losing auditability.
Primary files
metdatapy/impute.py(new)metdatapy/core.pytests/test_impute.py(new)
Implementation
- Create
metdatapy/impute.pywith an API such as:impute(df, method, columns=None, limit=None, value=None)
- Support at least:
ffillbfillinterpolate_timeconstant
- Add
WeatherSet.impute()inmetdatapy/core.pyas the façade. - Create or update:
imputedimpute_method
- Make the output compatible with existing NetCDF export support in
metdatapy/io.py.
Acceptance criteria
- Imputed rows are traceable.
- Original non-imputed rows remain clearly distinguishable.
- Multiple methods behave deterministically.
Tests
- Add a new
tests/test_impute.pycovering:- each supported method
- provenance columns
- behavior with gaps and existing missing values
Primary files
metdatapy/manifest.pymetdatapy/cli.pytests/test_manifest.pytests/test_cli.py
Implementation
- Record imputation steps cleanly in
ManifestBuilderpipeline steps and metadata. - Add a CLI command, preferably under a new
prepgroup:mdp prep impute
- The command should:
- read parquet
- impute
- write parquet
- optionally emit a JSON summary
Acceptance criteria
- Imputation done via CLI is reproducible and manifest-friendly.
- Output preserves
imputedandimpute_method.
Tests
- Add CLI regressions in
tests/test_cli.py. - Add manifest integration regressions in
tests/test_manifest.py.
Implement the most natural next-step features already implied by the package direction and optional dependencies.
Primary files
metdatapy/features.pyormetdatapy/exogenous.py(new)metdatapy/core.pytests/test_features.py(new)docs/weatherset.mdREADME.md
Implementation
- Use the existing optional
astraldependency. - Add a function like
solar_features(index, lat, lon, elev_m=None)or equivalent. - Add a
WeatherSet.solar_features(lat, lon, elev_m=None)façade. - Produce useful columns such as:
- solar elevation
- solar azimuth
- possibly daylight/night flag
Acceptance criteria
- Features are deterministic for known coordinates and timestamps.
- The API fits naturally into the
WeatherSetpipeline style.
Tests
- Add deterministic tests for known dates/times and expected ranges.
Primary files
metdatapy/features.pyormetdatapy/exogenous.pymetdatapy/core.pytests/test_features.pydocs/weatherset.mdREADME.md
Implementation
- Use the existing optional
holidaysdependency. - Add a function like
holiday_features(index, country, subdiv=None). - Add a
WeatherSet.holiday_features(country, subdiv=None)façade or a compatible helper. - Produce at least:
is_holidayholiday_name
Acceptance criteria
- Known fixed holidays are detected correctly.
- The feature output aligns cleanly to the UTC index.
Tests
- Add tests for known holiday dates.
Turn the corrected primitives into a declarative, reproducible end-to-end workflow surface.
Primary files
metdatapy/pipeline.py(new)metdatapy/cli.pymetdatapy/manifest.pytests/test_integration.pydocs/
Implementation
- Create a Pydantic
PipelineConfiginmetdatapy/pipeline.py. - The pipeline runner should orchestrate existing library steps without reimplementing them:
- ingest
- mapping
- unit normalization
- QC
- derivation
- gap insertion
- resampling
- calendar features
- exogenous features
- imputation
- supervised-table creation
- split
- scaling
- export
- manifest generation
- Add CLI support:
mdp pipeline run --config pipeline.yml
Acceptance criteria
- A single config can drive an end-to-end processing run.
- Pipeline output is reproducible and manifest-backed.
- The runner composes existing APIs rather than duplicating logic.
Tests
- Add end-to-end config-driven integration tests in
tests/test_integration.py.
Primary files
metdatapy/mlprep.pytests/test_backtesting.py(new)docs/mlprep.md
Implementation
- Add a function such as:
rolling_time_split()- or
walk_forward_split()
- It should yield chronological train/validation/test windows with zero overlap and no leakage.
- Make it optionally usable from the pipeline runner config.
Acceptance criteria
- Windows are ordered and non-overlapping.
- Boundary behavior is deterministic and well documented.
- The API integrates naturally with
make_supervised()andfit_scaler().
Tests
- Add a new
tests/test_backtesting.pycovering:- split boundaries
- counts
- leakage prevention
- Ship
v1.0.2before feature work. - Ship
v1.1.0next because timezone correctness affects every downstream artifact. - Ship
v1.2.0before model-facing feature work so derived metrics and resampling are physically correct. - Ship
v1.3.0andv1.4.0next as the most natural product extensions. - Ship
v1.5.0last after the primitives are corrected and stable.
v1.0.1- ingest mapping correctness
- shared CSV loading
- template and manifest edge-case cleanup
v1.0.2- gap insertion correctness
- flatline QC correctness
- explicit QC propagation during resample
v1.1.0- timezone-aware mapping
- NetCDF instant preservation
v1.2.0- thermal-index correctness
- circular wind-direction aggregation
- domain-aware consistency QC
v1.3.0- imputation with provenance
- CLI and manifest integration
v1.4.0- solar-position features
- holiday features
v1.5.0- declarative pipeline runner
- rolling-origin backtesting