Reproduceability and traceability #79

ian-ross · 2026-02-02T06:58:54Z

ian-ross
Feb 2, 2026
Maintainer

It seems to me that there are two kinds of data that we're going to be producing from AEIC:

the intermediate trajectory and emissions data resulting from simulating a selected set of missions using a selected set of performance, weather, etc. data (stored in what I've been calling "trajectory stores", which are collections of NetCDF files);
post-processing data products, most importantly gridded annual or sub-annual emissions inventories, although the possibilities here are fairly open-ended.

As described in the user stories that Adi wrote, different kinds of users will be interested in different kinds of data products. Some users will be running trajectory-level simulations themselves with self-selected performance data, mission profiles, etc., while some (most?) users will be using only post-processed products.

I think that we need to make some efforts to ensure traceability of the data products that we provide to users. From what I've seen, many users ask for "the AEIC emissions data for 2024", meaning the annual emissions inventories based on OAG mission data, ERA-5 weather and unspecified performance data. From the perspective of that kind of user, they don't really care about the details of exactly how we generate that emissions data: they want to think of what we give them as the AEIC emissions. That means that we need to standardize on a single set of choices for missions, weather, performance and other data for generating these "consumer" data products. For other users, the situation is more fluid: some users will produce post-processed data of one kind or another themselves, some users will run trajectory simulations themselves using custom performance data, and so on.

The range of choices is wide, and we need to make it as easy as possible for users to keep track of how AEIC data was generated and what data choices went into it. I think that that means that we need to embed provenance metadata into all data products that we produce. (If users write their own post-processing code, it's up to them to handle this, but we should at least make it easy for them to get hold of provenance metadata to use themselves.)

Here's a list of the things that go into producing AEIC intermediate trajectory/emissions output:

The AEIC software itself;
AEIC configuration (choice of emissions methods, etc.);
The selection of missions flown;
The performance models used to fly the missions and the mapping from mission data (aircraft type, etc.) to selection of performance model;
Weather data (if used);
Fuel data;
Airport locations (see below).

(For emissions calculations performed after trajectory simulations, the trajectory stores maintain a link from the emissions data to the trajectories that generated them. That could be viewed as an additional data source.)

In addition, post-processed data products will have more configuration choices (mission filtering, grid parameters, parameters related to climate/impacts calculations, and so on).

I have ideas for how to handle most of these items, but I'd like to leave the discussion here open without prejudicing people with those ideas. Have I missed anything from the above list? How would you handle this whole provenance question? What level of traceability do we actually need in AEIC?

(Note: the reason for including "airport locations" in the list above is that these do change from time to time. Airports are closed, so historical mission data may not be usable with current airport data, and some airport location data from the source that we use is not quite stable, which can lead to small differences in simulation output.)

speth · 2026-02-02T18:00:08Z

speth
Feb 2, 2026
Maintainer

I think this attention on traceability / reproducibility is definitely warranted.

A couple specific things come to mind, though you may have already considered these within the items above

The "selection of missions flown" can be broken down into a couple of components: the source schedule data (e.g. OAG), how that schedule is initially processed (what's thrown out / corrected), and then what filters are applied for the AEIC run (where for the case of taking a random subset of flights, we'll have to keep track of any seeds)
There's also source of emissions data, which I think is distinct from the performance models

There are some interesting differences in how much provenance information is available for different inputs. For some things, e.g. the EDB or BADA files, there are clear, robust version numbers that can reported, while in other cases (TASOPT performance models) there's much less structure. I'm curious where the airport location list fits into this.

I think it would be useful to keep both some human-interpretable provenance information as well as some checksums, so you can both tell what a run is after the fact and have some idea of whether or not you should expect to be able to repeat it with a given set of inputs.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproduceability and traceability #79

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Reproduceability and traceability #79

Uh oh!

ian-ross Feb 2, 2026 Maintainer

Replies: 1 comment

Uh oh!

speth Feb 2, 2026 Maintainer

ian-ross
Feb 2, 2026
Maintainer

speth
Feb 2, 2026
Maintainer