-
Notifications
You must be signed in to change notification settings - Fork 0
Database Backend
The analytic workflow is centrally managed around a relational database. The database will be a DuckDB file that includes the tables described below.
Tip
Managing model state is, in theory, possible by using declarative configuration files. During early development of the package, it became clear that configuration files would need to be hyperspecific to adequately describe all assumptions that go into a specific analysis. Adequately implementing e.g. data ingestion for multiple sources would lead to a massive scope expansion of the package. Hence, there is not currently a concept of declarative configuration in this database. This may need to be revised.
-
reporting_t: Key-value store for report metadata -
coords_t: Landscape tessellation (base grid, polygons, centroids) -
periods_t: Time periods metadata -
runs_t: Scenario/run hierarchy for multi-run simulations -
lulc_data_t+lulc_meta_t: LULC data and metadata -
pred_data_t_*+pred_meta_t: Predictor data (typed) and metadata -
trans_meta_t: Transition metadata, indicating the viability of modelling them from a statistical or conceptual point -
trans_preds_t: Transitions and their associations to predictors (m to n relation) -
trans_rates_t: Transition rates by period -
intrv_masks_t+intrv_meta_t: Interventions (masks, values) with their parameters (struct/json)- Examples: Forced assignment of deglaciating areas, reduced likelihood of transition in protected areas
-
trans_models_t: Storing model metadata and model objects -
alloc_params_t: Storing the allocation parametrization -
neighbors_t: Neighbor relationships between coordinates
erDiagram
reporting_t {
string key PK
string value
}
coords_t {
int id_coord PK
float lon
float lat
float elevation
polygon geom_polygon
}
periods_t {
int id_period PK
date start_date
date end_date
bool is_extrapolated
}
runs_t {
int id_run PK
int parent_id_run FK
string description
}
lulc_meta_t {
int id_lulc PK
string name
string pretty_name
string description
list src_classes
}
lulc_data_t {
int id_run FK
int id_coord FK
int id_lulc FK
int id_period FK
}
pred_meta_t {
int id_pred PK
string name
string pretty_name
string description
string orig_format
list sources
string unit
map factor_levels
}
pred_data_t {
int id_run FK
int id_pred FK
int id_coord FK
int id_period FK
value value
}
trans_meta_t {
int id_trans PK
int id_lulc_anterior FK
int id_lulc_posterior FK
int cardinality
float frequency_rel
float frequency_abs
bool is_viable
}
trans_preds_t {
int id_pred PK_FK
int id_trans PK_FK
}
trans_rates_t {
int id_period PK_FK
int id_trans PK_FK
float rate
}
intrv_meta_t {
int id_intrv PK
list id_period_list
list id_trans_list
bool pre_allocation
string name
string pretty_name
string description
list sources
map params
}
intrv_masks_t {
int id_intrv PK_FK
int id_coord PK_FK
}
trans_models_t {
int id_trans FK
string model_family
map model_params
map goodness_of_fit
string fit_call
blob model_obj_part
blob model_obj_full
}
alloc_params_t {
int id_run PK_FK
int id_trans PK_FK
float mean_patch_size
float patch_size_variance
float patch_isometry
float frac_expander
float frac_patcher
}
neighbors_t {
int id_coord_origin PK_FK
int id_coord_neighbor PK_FK
float distance
factor distance_class
}
lulc_data_t zero or more to 1 coords_t : "id_coord"
lulc_data_t zero or more to 1 lulc_meta_t : "id_lulc"
lulc_data_t zero or more to 1 periods_t : "id_period"
lulc_data_t zero or more to 1 runs_t : "id_run"
pred_data_t zero or more to 1 pred_meta_t : "id_pred"
pred_data_t zero or more to 1 coords_t : "id_coord"
pred_data_t zero or more to 1 periods_t : "id_period"
pred_data_t zero or more to 1 runs_t : "id_run"
trans_meta_t zero or more to 1 lulc_meta_t : "id_lulc_anterior"
trans_meta_t zero or more to 1 lulc_meta_t : "id_lulc_posterior"
trans_preds_t zero or more to 1 pred_meta_t : "id_pred"
trans_preds_t zero or more to 1 trans_meta_t : "id_trans"
trans_rates_t zero or more to 1 periods_t : "id_period"
trans_rates_t zero or more to 1 trans_meta_t : "id_trans"
intrv_masks_t zero or more to 1 intrv_meta_t : "id_intrv"
intrv_masks_t zero or more to 1 coords_t : "id_coord"
trans_models_t zero or more to 1 trans_meta_t : "id_trans"
alloc_params_t zero or more to 1 trans_meta_t : "id_trans"
alloc_params_t zero or more to 1 runs_t : "id_run"
neighbors_t zero or more to 1 coords_t : "id_coord_origin"
neighbors_t zero or more to 1 coords_t : "id_coord_neighbor"
runs_t zero or more to 1 runs_t : "parent_id_run"
This table is a simple key/value (varchar) store for reporting metadata. It contains information such as author, datetime, scenario title, etc.
| pk | colname | type | description |
|---|---|---|---|
| * | key | str | Unique key for the entry |
| value | str | Value associated with the key |
Default keys set on DB initialization:
-
report_name: Short name for the report (default: "evoland_scenario") -
report_name_pretty: Display name for the report -
report_include_date: Whether to include date in reports -
creator_username: Username of the creator -
last_opened: Timestamp of last access -
last_opened_username: Username of last accessor
| pk | colname | type | description |
|---|---|---|---|
| * | id_coord | int | Unique ID for each reference coordinate pair |
| lon | float | aka x coord. TODO find CRS storage location | |
| lat | float | aka y coord. TODO find CRS storage location | |
| elevation | float, nullable | aka z coord. TODO find CRS storage location | |
| geom_polygon | polygon, nullable | If the point describes a surface, its polygon |
The coords_t table specifies the basic coordinates at which the model is set up and run.
Because these coordinates will often also refer to a surface, a polygon describing each surface may be stored alongside the lat/lon coordinates.
This table is either constructed based on reference grid specification (e.g. existing LULC data) or derived from parameters (e.g. a hexagonal grid with edge length S and origin at (P, Q)).
Additional attributes are stored as table metadata (via setattr):
-
epsg: EPSG code for the coordinate reference system -
xmin,xmax,ymin,ymax: Bounding box extent -
resolution: Grid resolution (for regular grids)
| pk | colname | type | description |
|---|---|---|---|
| * | id_period | int | Unique ID for each period |
| start_date | date | Inclusive boundary for start | |
| end_date | date | Exclusive boundary for end | |
| is_extrapolated | bool | False if observation, true if extrapolated |
This table identifies periods in the past and future. It is populated using a start and end date for the observation period, and an end date for the extrapolation, plus a time step length. Each period is left-inclusive, i.e. in case of doubt assigned to the anterior class in a transition, not the posterior. For static predictors (variables that do not change over time), a "0 period" is used in the database to indicate their timeless nature, starting and ending at the end of the observed periods. This allows static predictors to be stored and referenced consistently alongside time-varying predictors, even though they do not correspond to a specific time period.
| pk | colname | type | description |
|---|---|---|---|
| * | id_run | int | Unique ID for each run (0 = base/unperturbed) |
| parent_id_run | int, nullable | FK to runs_t for hierarchical scenarios | |
| description | str | Human-readable description of the run |
This table supports multi-run simulations and scenario management. Run ID 0 is reserved for the base/unperturbed scenario. Child runs can reference a parent run to establish scenario hierarchies: the most specific available set of data is used. For instance, if a predictor value is not available for a specific run, the system falls back to the parent run's value, continuing up the hierarchy until a value is found or the base run is reached.
lulc_meta_t describes metadata for each Land Use / Land Cover class.
| pk | colname | type | description |
|---|---|---|---|
| * | id_lulc | int | Unique ID for each land use class |
| name | str | Name for use in code and queries, e.g. forest_dense
|
|
| pretty_name | str | Long name for plots/output e.g. Dense/Old Growth Forest | |
| description | str | Long description / operationalisation | |
| src_classes | list | List of source class IDs that map to this class |
The src_classes column allows mapping from source data class IDs to the harmonized LULC classes used in the model.
lulc_data_t indicates that at a given place and time, a certain land use is detected/projected.
To transform these sparse data for a given id_period to a dense matrix, you would join the id_lulc to the coords_t.
If coords_t represents a dense raster, you can reshape the 1D-vector.
If coords_t is irregular, you can rasterize according to your raster specs.
| uniqueness | colname | type | description |
|---|---|---|---|
| * | id_run | int, optional | FK from runs_t (0 for base) |
| * | id_coord | int | FK from coords_t
|
| id_lulc | int | FK from lulc_meta_t
|
|
| * | id_period | int | FK from periods_t
|
The table is partitioned by (id_run, id_period) for efficient storage and retrieval.
Some observational datasets, e.g. the Swiss Surface Statistics, may have exact dates associated with individual id_coord, id_lulc combinations. If desired, these data should be added as predictors - e.g. in "number of years before or after the id_period".
pred_meta_t describes metadata for each predictor.
| pk | colname | type | description |
|---|---|---|---|
| * | id_pred | int (autoincrement) | Unique ID for each predictor |
| name | str (unique key) | Name for use in code and queries, e.g. distance_road
|
|
| pretty_name | str | Name of the predictor, e.g. Distance To Closest Road | |
| description | str, nullable | Long description / operationalisation | |
| orig_format | str, nullable | Orig. format: "polygons, annual", "100m raster, daily" | |
| sources | list of structs (url, md5sum) | Where the data were fetched (maybe validate URL) | |
| unit | str, nullable | SI-compatible unit, nullable for categorical values | |
| factor_levels | map, nullable | {int: str} map to recover factor levels |
pred_data_t is not actually a single table: multiple tables are employed to efficiently store different data types:
-
pred_data_t_float: Floating-point predictor values -
pred_data_t_int: Integer predictor values (also holds factor values, seefactor_levelsabove) -
pred_data_t_bool: Boolean predictor values
| uniqueness | colname | type | description |
|---|---|---|---|
| * | id_run | int, optional | FK from runs_t (0 for base) |
| * | id_pred | int | FK from pred_meta_t
|
| * | id_coord | int | FK from coords_t
|
| * | id_period | int | FK from periods_t
|
| value | float, int, bool | Predictor value (type-specific) |
The tables are partitioned by (id_run, id_period) for efficient storage and retrieval.
Relates land use classes to each other (henceforth transitions), including statistics about their frequency in the original data and, essentially, an indicator on whether they are viable for modelling. Example of non-viability: A model may contain an "other" class that is valuable for a simplified representation of what an initial class might transition into. Since this "other" class holds no information on what class may come next, the "other" class could only increase in area, but never decrease. Another exemption for the viability of modelling a transition is because the set of observations may be too limited for statistical inference.
Note
The previous generation of evoland considered the viability of modelling a transition for each id_period.
By tying the identity of a transition to a (training) period, we lose out on the data provided by previous periods.
The identification of a transition with a period needs to be discussed.
| pk | uniqueness | colname | type | description |
|---|---|---|---|---|
| * | id_trans | int (autoincrement) | Unique ID for each transition | |
| * | id_lulc_anterior | int | FK from lulc_meta_t, id before transition |
|
| * | id_lulc_posterior | int | FK from lulc_meta_t, id after transition |
|
| cardinality | int | How many times this transition occurred | ||
| frequency_rel | float | Frequency of this transition in relation to all transitions in this timestep | ||
| frequency_abs | float | Frequency of this transition in relation to all coordinates | ||
| is_viable | bool | See explanation |
An m to n relation enumerates all the predictors useful for modelling a particular transition. This table is populated by a feature selection step that precedes the actual model training. In the absence of a feature selection step, it would default to the cartesian product of predictors and transitions.
| uniqueness | colname | type | description |
|---|---|---|---|
| * | id_pred | int | FK from pred_meta_t
|
| * | id_trans | int | FK from trans_meta_t
|
Stores transition rates (probabilities) for each transition type in each time period. Historical rates are calculated from observed transitions, and future rates can be extrapolated using linear regression.
| uniqueness | colname | type | description |
|---|---|---|---|
| * | id_period | int | FK from periods_t
|
| * | id_trans | int | FK from trans_meta_t
|
| rate | float | Transition rate (0 to 1, non-negative) |
Tables containing auxiliary information for interventions, that is: manipulations of LULC transition potential, or outright overrides of LULC change predictions.
intrv_meta_t describes metadata for each intervention.
| pk | colname | type | §§ |
|---|---|---|---|
| * | id_intrv | int | Unique ID for each intervention |
| id_period_list | list(int) | Associated periods for intervention | |
| id_trans_list | list(int) | Associated transitions for intervention | |
| pre_allocation | bool | Whether intervention occurs before allocation step | |
| name | str | Name for use in code and queries, e.g. pa_expansion
|
|
| pretty_name | str | Name of the intervention, e.g. Protected Area Expansion | |
| description | str | Long description / operationalisation | |
| sources | list of structs (url, md5sum) | Where the data were fetched (maybe validate URL) | |
| params | map | Intervention parameters, structured as key-value pairs |
intrv_masks_t is used to store intervention masks, defining which coordinate pairs are affected by each intervention.
The presence of (id_intrv, id_coord) indicates a positive mask value, i.e. a given coordinate pair is affected by a given intervention.
| uniqueness | colname | type | description |
|---|---|---|---|
| * | id_intrv | int | FK from intrv_meta_t
|
| * | id_coord | int | FK from coords_t
|
Transition models: For each viable id_trans, there may be multiple models in this table.
The table is populated in the following manner:
- For each transition, split into a training and validation set
- Try out many model specifications, write model and metadata to this table
- Identify best performing models (group by
id_trans; sort bygoodness_of_fit['some_metric']; limit to 1) - Refit best models on full data
| uniqueness | colname | type | description |
|---|---|---|---|
| * | id_trans | int | FK from trans_meta_t
|
| model_family | str | may also be enum; e.g., rf, glm_binomial, ranger
|
|
| model_params | map | Map of model (hyper) parameters | |
| goodness_of_fit | map | Map of various measures of fit, e.g. ROC AUC | |
| * | fit_call | str | Deparsed fit function call for reproducibility |
| model_obj_part | blob | BLOB of serialized model object used to establish goodness of fit | |
| model_obj_full | blob | BLOB of serialized model object used in extrapolation phase |
The fit_call column stores a deparsed R function that can be reconstructed using str2lang() and called with different data to reproduce the model fitting process.
| uniqueness | colname | type | description |
|---|---|---|---|
| * | id_run | int | FK from runs_t (0 = unperturbed base estimate) |
| * | id_trans | int | FK from trans_meta_t
|
| mean_patch_size | float | Mean area of new patches (in cell units) | |
| patch_size_variance | float | Standard deviation of patch area | |
| patch_isometry | float | Measure of patch shape regularity (Dinamica patcher parameter) | |
| frac_expander | float | Fraction of transition cells adjacent to existing patches [0, 1] | |
| frac_patcher | float | Fraction of transition cells forming new patches [0, 1] |
The allocation strategy (e.g. establishment of new patches of a particular LULC, expansion of existing patches) stores parameters in this table. The table supports multiple perturbed versions of parameters for sensitivity analysis, with id_run = 0 representing the unperturbed best estimate.
The table is partitioned by id_run for efficient storage.
Stores neighbor relationships between coordinates, computed using a spatial hash map for efficiency.
| uniqueness | colname | type | description |
|---|---|---|---|
| * | id_coord_origin | int | FK from coords_t (origin coordinate) |
| * | id_coord_neighbor | int | FK from coords_t (neighbor coordinate) |
| distance | float | Distance between origin and neighbor | |
| distance_class | factor, optional | Factor indicating distance class (if breaks provided) |
This table is populated by create_neighbors_t() which accepts a max_distance parameter and optional distance_breaks for classifying neighbors into distance bands.
These may be employed once it emerges what data specific processing steps require.
Notable derived tables (not persisted, computed on demand):
-
trans_v: Transitions view computed from consecutive LULC observations -
trans_pred_data_v: Wide-format predictor data for transition modeling -
pred_data_wide_v: Wide-format predictor data for a given period -
trans_pot_t: Transition potential computed from fitted models (seepredict_trans_pot())