Skip to content

Database Backend

mmyrte edited this page Jan 29, 2026 · 17 revisions

Overview

The analytic workflow is centrally managed around a relational database. The database will be a DuckDB file that includes the tables described below.

Tip

Managing model state is, in theory, possible by using declarative configuration files. During early development of the package, it became clear that configuration files would need to be hyperspecific to adequately describe all assumptions that go into a specific analysis. Adequately implementing e.g. data ingestion for multiple sources would lead to a massive scope expansion of the package. Hence, there is not currently a concept of declarative configuration in this database. This may need to be revised.

Tables

  • reporting_t: Key-value store for report metadata
  • coords_t: Landscape tessellation (base grid, polygons, centroids)
  • periods_t: Time periods metadata
  • runs_t: Scenario/run hierarchy for multi-run simulations
  • lulc_data_t + lulc_meta_t: LULC data and metadata
  • pred_data_t_* + pred_meta_t: Predictor data (typed) and metadata
  • trans_meta_t: Transition metadata, indicating the viability of modelling them from a statistical or conceptual point
  • trans_preds_t: Transitions and their associations to predictors (m to n relation)
  • trans_rates_t: Transition rates by period
  • intrv_masks_t + intrv_meta_t: Interventions (masks, values) with their parameters (struct/json)
    • Examples: Forced assignment of deglaciating areas, reduced likelihood of transition in protected areas
  • trans_models_t: Storing model metadata and model objects
  • alloc_params_t: Storing the allocation parametrization
  • neighbors_t: Neighbor relationships between coordinates
erDiagram
    reporting_t {
        string key PK
        string value
    }
    coords_t {
        int id_coord PK
        float lon
        float lat
        float elevation
        polygon geom_polygon
    }
    periods_t {
        int id_period PK
        date start_date
        date end_date
        bool is_extrapolated
    }
    runs_t {
        int id_run PK
        int parent_id_run FK
        string description
    }
    lulc_meta_t {
        int id_lulc PK
        string name
        string pretty_name
        string description
        list src_classes
    }
    lulc_data_t {
        int id_run FK
        int id_coord FK
        int id_lulc FK
        int id_period FK
    }
    pred_meta_t {
        int id_pred PK
        string name
        string pretty_name
        string description
        string orig_format
        list sources
        string unit
        map factor_levels
    }
    pred_data_t {
        int id_run FK
        int id_pred FK
        int id_coord FK
        int id_period FK
        value value
    }
    trans_meta_t {
        int id_trans PK
        int id_lulc_anterior FK
        int id_lulc_posterior FK
        int cardinality
        float frequency_rel
        float frequency_abs
        bool is_viable
    }
    trans_preds_t {
        int id_pred PK_FK
        int id_trans PK_FK
    }
    trans_rates_t {
        int id_period PK_FK
        int id_trans PK_FK
        float rate
    }
    intrv_meta_t {
        int id_intrv PK
        list id_period_list
        list id_trans_list
        bool pre_allocation
        string name
        string pretty_name
        string description
        list sources
        map params
    }
    intrv_masks_t {
        int id_intrv PK_FK
        int id_coord PK_FK
    }
    trans_models_t {
        int id_trans FK
        string model_family
        map model_params
        map goodness_of_fit
        string fit_call
        blob model_obj_part
        blob model_obj_full
    }
    alloc_params_t {
        int id_run PK_FK
        int id_trans PK_FK
        float mean_patch_size
        float patch_size_variance
        float patch_isometry
        float frac_expander
        float frac_patcher
    }
    neighbors_t {
        int id_coord_origin PK_FK
        int id_coord_neighbor PK_FK
        float distance
        factor distance_class
    }

    lulc_data_t zero or more to 1 coords_t : "id_coord"
    lulc_data_t zero or more to 1 lulc_meta_t : "id_lulc"
    lulc_data_t zero or more to 1 periods_t : "id_period"
    lulc_data_t zero or more to 1 runs_t : "id_run"

    pred_data_t zero or more to 1 pred_meta_t : "id_pred"
    pred_data_t zero or more to 1 coords_t : "id_coord"
    pred_data_t zero or more to 1 periods_t : "id_period"
    pred_data_t zero or more to 1 runs_t : "id_run"

    trans_meta_t zero or more to 1 lulc_meta_t : "id_lulc_anterior"
    trans_meta_t zero or more to 1 lulc_meta_t : "id_lulc_posterior"

    trans_preds_t zero or more to 1 pred_meta_t : "id_pred"
    trans_preds_t zero or more to 1 trans_meta_t : "id_trans"

    trans_rates_t zero or more to 1 periods_t : "id_period"
    trans_rates_t zero or more to 1 trans_meta_t : "id_trans"

    intrv_masks_t zero or more to 1 intrv_meta_t : "id_intrv"
    intrv_masks_t zero or more to 1 coords_t : "id_coord"

    trans_models_t zero or more to 1 trans_meta_t : "id_trans"

    alloc_params_t zero or more to 1 trans_meta_t : "id_trans"
    alloc_params_t zero or more to 1 runs_t : "id_run"

    neighbors_t zero or more to 1 coords_t : "id_coord_origin"
    neighbors_t zero or more to 1 coords_t : "id_coord_neighbor"

    runs_t zero or more to 1 runs_t : "parent_id_run"
Loading

reporting_t

This table is a simple key/value (varchar) store for reporting metadata. It contains information such as author, datetime, scenario title, etc.

pk colname type description
* key str Unique key for the entry
value str Value associated with the key

Default keys set on DB initialization:

  • report_name: Short name for the report (default: "evoland_scenario")
  • report_name_pretty: Display name for the report
  • report_include_date: Whether to include date in reports
  • creator_username: Username of the creator
  • last_opened: Timestamp of last access
  • last_opened_username: Username of last accessor

coords_t

pk colname type description
* id_coord int Unique ID for each reference coordinate pair
lon float aka x coord. TODO find CRS storage location
lat float aka y coord. TODO find CRS storage location
elevation float, nullable aka z coord. TODO find CRS storage location
geom_polygon polygon, nullable If the point describes a surface, its polygon

The coords_t table specifies the basic coordinates at which the model is set up and run. Because these coordinates will often also refer to a surface, a polygon describing each surface may be stored alongside the lat/lon coordinates. This table is either constructed based on reference grid specification (e.g. existing LULC data) or derived from parameters (e.g. a hexagonal grid with edge length S and origin at (P, Q)).

Additional attributes are stored as table metadata (via setattr):

  • epsg: EPSG code for the coordinate reference system
  • xmin, xmax, ymin, ymax: Bounding box extent
  • resolution: Grid resolution (for regular grids)

periods_t

pk colname type description
* id_period int Unique ID for each period
start_date date Inclusive boundary for start
end_date date Exclusive boundary for end
is_extrapolated bool False if observation, true if extrapolated

This table identifies periods in the past and future. It is populated using a start and end date for the observation period, and an end date for the extrapolation, plus a time step length. Each period is left-inclusive, i.e. in case of doubt assigned to the anterior class in a transition, not the posterior. For static predictors (variables that do not change over time), a "0 period" is used in the database to indicate their timeless nature, starting and ending at the end of the observed periods. This allows static predictors to be stored and referenced consistently alongside time-varying predictors, even though they do not correspond to a specific time period.

runs_t

pk colname type description
* id_run int Unique ID for each run (0 = base/unperturbed)
parent_id_run int, nullable FK to runs_t for hierarchical scenarios
description str Human-readable description of the run

This table supports multi-run simulations and scenario management. Run ID 0 is reserved for the base/unperturbed scenario. Child runs can reference a parent run to establish scenario hierarchies: the most specific available set of data is used. For instance, if a predictor value is not available for a specific run, the system falls back to the parent run's value, continuing up the hierarchy until a value is found or the base run is reached.

lulc_data_t + lulc_meta_t

lulc_meta_t describes metadata for each Land Use / Land Cover class.

pk colname type description
* id_lulc int Unique ID for each land use class
name str Name for use in code and queries, e.g. forest_dense
pretty_name str Long name for plots/output e.g. Dense/Old Growth Forest
description str Long description / operationalisation
src_classes list List of source class IDs that map to this class

The src_classes column allows mapping from source data class IDs to the harmonized LULC classes used in the model.

lulc_data_t indicates that at a given place and time, a certain land use is detected/projected. To transform these sparse data for a given id_period to a dense matrix, you would join the id_lulc to the coords_t. If coords_t represents a dense raster, you can reshape the 1D-vector. If coords_t is irregular, you can rasterize according to your raster specs.

uniqueness colname type description
* id_run int, optional FK from runs_t (0 for base)
* id_coord int FK from coords_t
id_lulc int FK from lulc_meta_t
* id_period int FK from periods_t

The table is partitioned by (id_run, id_period) for efficient storage and retrieval.

Some observational datasets, e.g. the Swiss Surface Statistics, may have exact dates associated with individual id_coord, id_lulc combinations. If desired, these data should be added as predictors - e.g. in "number of years before or after the id_period".

pred_data_t_* + pred_meta_t

pred_meta_t describes metadata for each predictor.

pk colname type description
* id_pred int (autoincrement) Unique ID for each predictor
name str (unique key) Name for use in code and queries, e.g. distance_road
pretty_name str Name of the predictor, e.g. Distance To Closest Road
description str, nullable Long description / operationalisation
orig_format str, nullable Orig. format: "polygons, annual", "100m raster, daily"
sources list of structs (url, md5sum) Where the data were fetched (maybe validate URL)
unit str, nullable SI-compatible unit, nullable for categorical values
factor_levels map, nullable {int: str} map to recover factor levels

pred_data_t is not actually a single table: multiple tables are employed to efficiently store different data types:

  • pred_data_t_float: Floating-point predictor values
  • pred_data_t_int: Integer predictor values (also holds factor values, see factor_levels above)
  • pred_data_t_bool: Boolean predictor values
uniqueness colname type description
* id_run int, optional FK from runs_t (0 for base)
* id_pred int FK from pred_meta_t
* id_coord int FK from coords_t
* id_period int FK from periods_t
value float, int, bool Predictor value (type-specific)

The tables are partitioned by (id_run, id_period) for efficient storage and retrieval.

trans_meta_t

Relates land use classes to each other (henceforth transitions), including statistics about their frequency in the original data and, essentially, an indicator on whether they are viable for modelling. Example of non-viability: A model may contain an "other" class that is valuable for a simplified representation of what an initial class might transition into. Since this "other" class holds no information on what class may come next, the "other" class could only increase in area, but never decrease. Another exemption for the viability of modelling a transition is because the set of observations may be too limited for statistical inference.

Note

The previous generation of evoland considered the viability of modelling a transition for each id_period. By tying the identity of a transition to a (training) period, we lose out on the data provided by previous periods. The identification of a transition with a period needs to be discussed.

pk uniqueness colname type description
* id_trans int (autoincrement) Unique ID for each transition
* id_lulc_anterior int FK from lulc_meta_t, id before transition
* id_lulc_posterior int FK from lulc_meta_t, id after transition
cardinality int How many times this transition occurred
frequency_rel float Frequency of this transition in relation to all transitions in this timestep
frequency_abs float Frequency of this transition in relation to all coordinates
is_viable bool See explanation

trans_preds_t

An m to n relation enumerates all the predictors useful for modelling a particular transition. This table is populated by a feature selection step that precedes the actual model training. In the absence of a feature selection step, it would default to the cartesian product of predictors and transitions.

uniqueness colname type description
* id_pred int FK from pred_meta_t
* id_trans int FK from trans_meta_t

trans_rates_t

Stores transition rates (probabilities) for each transition type in each time period. Historical rates are calculated from observed transitions, and future rates can be extrapolated using linear regression.

uniqueness colname type description
* id_period int FK from periods_t
* id_trans int FK from trans_meta_t
rate float Transition rate (0 to 1, non-negative)

intrv_masks_t + intrv_meta_t

Tables containing auxiliary information for interventions, that is: manipulations of LULC transition potential, or outright overrides of LULC change predictions. intrv_meta_t describes metadata for each intervention.

pk colname type §§
* id_intrv int Unique ID for each intervention
id_period_list list(int) Associated periods for intervention
id_trans_list list(int) Associated transitions for intervention
pre_allocation bool Whether intervention occurs before allocation step
name str Name for use in code and queries, e.g. pa_expansion
pretty_name str Name of the intervention, e.g. Protected Area Expansion
description str Long description / operationalisation
sources list of structs (url, md5sum) Where the data were fetched (maybe validate URL)
params map Intervention parameters, structured as key-value pairs

intrv_masks_t is used to store intervention masks, defining which coordinate pairs are affected by each intervention. The presence of (id_intrv, id_coord) indicates a positive mask value, i.e. a given coordinate pair is affected by a given intervention.

uniqueness colname type description
* id_intrv int FK from intrv_meta_t
* id_coord int FK from coords_t

trans_models_t

Transition models: For each viable id_trans, there may be multiple models in this table. The table is populated in the following manner:

  1. For each transition, split into a training and validation set
  2. Try out many model specifications, write model and metadata to this table
  3. Identify best performing models (group by id_trans; sort by goodness_of_fit['some_metric']; limit to 1)
  4. Refit best models on full data
uniqueness colname type description
* id_trans int FK from trans_meta_t
model_family str may also be enum; e.g., rf, glm_binomial, ranger
model_params map Map of model (hyper) parameters
goodness_of_fit map Map of various measures of fit, e.g. ROC AUC
* fit_call str Deparsed fit function call for reproducibility
model_obj_part blob BLOB of serialized model object used to establish goodness of fit
model_obj_full blob BLOB of serialized model object used in extrapolation phase

The fit_call column stores a deparsed R function that can be reconstructed using str2lang() and called with different data to reproduce the model fitting process.

alloc_params_t

uniqueness colname type description
* id_run int FK from runs_t (0 = unperturbed base estimate)
* id_trans int FK from trans_meta_t
mean_patch_size float Mean area of new patches (in cell units)
patch_size_variance float Standard deviation of patch area
patch_isometry float Measure of patch shape regularity (Dinamica patcher parameter)
frac_expander float Fraction of transition cells adjacent to existing patches [0, 1]
frac_patcher float Fraction of transition cells forming new patches [0, 1]

The allocation strategy (e.g. establishment of new patches of a particular LULC, expansion of existing patches) stores parameters in this table. The table supports multiple perturbed versions of parameters for sensitivity analysis, with id_run = 0 representing the unperturbed best estimate.

The table is partitioned by id_run for efficient storage.

neighbors_t

Stores neighbor relationships between coordinates, computed using a spatial hash map for efficiency.

uniqueness colname type description
* id_coord_origin int FK from coords_t (origin coordinate)
* id_coord_neighbor int FK from coords_t (neighbor coordinate)
distance float Distance between origin and neighbor
distance_class factor, optional Factor indicating distance class (if breaks provided)

This table is populated by create_neighbors_t() which accepts a max_distance parameter and optional distance_breaks for classifying neighbors into distance bands.

Views

These may be employed once it emerges what data specific processing steps require.

Notable derived tables (not persisted, computed on demand):

  • trans_v: Transitions view computed from consecutive LULC observations
  • trans_pred_data_v: Wide-format predictor data for transition modeling
  • pred_data_wide_v: Wide-format predictor data for a given period
  • trans_pot_t: Transition potential computed from fitted models (see predict_trans_pot())