Database Backend

Overview

The analytic workflow is centrally managed around a relational database. The database will be a DuckDB file that includes the tables described below.

Tip

Managing model state is, in theory, possible by using declarative configuration files. During early development of the package, it became clear that configuration files would need to be hyperspecific to adequately describe all assumptions that go into a specific analysis. Adequately implementing e.g. data ingestion for multiple sources would lead to a massive scope expansion of the package. Hence, there is not currently a concept of declarative configuration in this database. This may need to be revised.

Tables

reporting_t: Key-value store for report metadata
coords_t: Landscape tessellation (base grid, polygons, centroids)
periods_t: Time periods metadata
runs_t: Scenario/run hierarchy for multi-run simulations
lulc_data_t + lulc_meta_t: LULC data and metadata
pred_data_t_* + pred_meta_t: Predictor data (typed) and metadata
trans_meta_t: Transition metadata, indicating the viability of modelling them from a statistical or conceptual point
trans_preds_t: Transitions and their associations to predictors (m to n relation)
trans_rates_t: Transition rates by period
intrv_masks_t + intrv_meta_t: Interventions (masks, values) with their parameters (struct/json)
- Examples: Forced assignment of deglaciating areas, reduced likelihood of transition in protected areas
trans_models_t: Storing model metadata and model objects
alloc_params_t: Storing the allocation parametrization
neighbors_t: Neighbor relationships between coordinates

erDiagram
    reporting_t {
        string key PK
        string value
    }
    coords_t {
        int id_coord PK
        float lon
        float lat
        float elevation
        polygon geom_polygon
    }
    periods_t {
        int id_period PK
        date start_date
        date end_date
        bool is_extrapolated
    }
    runs_t {
        int id_run PK
        int parent_id_run FK
        string description
    }
    lulc_meta_t {
        int id_lulc PK
        string name
        string pretty_name
        string description
        list src_classes
    }
    lulc_data_t {
        int id_run FK
        int id_coord FK
        int id_lulc FK
        int id_period FK
    }
    pred_meta_t {
        int id_pred PK
        string name
        string pretty_name
        string description
        string orig_format
        list sources
        string unit
        map factor_levels
    }
    pred_data_t {
        int id_run FK
        int id_pred FK
        int id_coord FK
        int id_period FK
        value value
    }
    trans_meta_t {
        int id_trans PK
        int id_lulc_anterior FK
        int id_lulc_posterior FK
        int cardinality
        float frequency_rel
        float frequency_abs
        bool is_viable
    }
    trans_preds_t {
        int id_pred PK_FK
        int id_trans PK_FK
    }
    trans_rates_t {
        int id_period PK_FK
        int id_trans PK_FK
        float rate
    }
    intrv_meta_t {
        int id_intrv PK
        list id_period_list
        list id_trans_list
        bool pre_allocation
        string name
        string pretty_name
        string description
        list sources
        map params
    }
    intrv_masks_t {
        int id_intrv PK_FK
        int id_coord PK_FK
    }
    trans_models_t {
        int id_trans FK
        string model_family
        map model_params
        map goodness_of_fit
        string fit_call
        blob model_obj_part
        blob model_obj_full
    }
    alloc_params_t {
        int id_run PK_FK
        int id_trans PK_FK
        float mean_patch_size
        float patch_size_variance
        float patch_isometry
        float frac_expander
        float frac_patcher
    }
    neighbors_t {
        int id_coord_origin PK_FK
        int id_coord_neighbor PK_FK
        float distance
        factor distance_class
    }

    lulc_data_t zero or more to 1 coords_t : "id_coord"
    lulc_data_t zero or more to 1 lulc_meta_t : "id_lulc"
    lulc_data_t zero or more to 1 periods_t : "id_period"
    lulc_data_t zero or more to 1 runs_t : "id_run"

    pred_data_t zero or more to 1 pred_meta_t : "id_pred"
    pred_data_t zero or more to 1 coords_t : "id_coord"
    pred_data_t zero or more to 1 periods_t : "id_period"
    pred_data_t zero or more to 1 runs_t : "id_run"

    trans_meta_t zero or more to 1 lulc_meta_t : "id_lulc_anterior"
    trans_meta_t zero or more to 1 lulc_meta_t : "id_lulc_posterior"

    trans_preds_t zero or more to 1 pred_meta_t : "id_pred"
    trans_preds_t zero or more to 1 trans_meta_t : "id_trans"

    trans_rates_t zero or more to 1 periods_t : "id_period"
    trans_rates_t zero or more to 1 trans_meta_t : "id_trans"

    intrv_masks_t zero or more to 1 intrv_meta_t : "id_intrv"
    intrv_masks_t zero or more to 1 coords_t : "id_coord"

    trans_models_t zero or more to 1 trans_meta_t : "id_trans"

    alloc_params_t zero or more to 1 trans_meta_t : "id_trans"
    alloc_params_t zero or more to 1 runs_t : "id_run"

    neighbors_t zero or more to 1 coords_t : "id_coord_origin"
    neighbors_t zero or more to 1 coords_t : "id_coord_neighbor"

    runs_t zero or more to 1 runs_t : "parent_id_run"

`reporting_t`

This table is a simple key/value (varchar) store for reporting metadata. It contains information such as author, datetime, scenario title, etc.

pk	colname	type	description
*	key	str	Unique key for the entry
	value	str	Value associated with the key

Default keys set on DB initialization:

report_name: Short name for the report (default: "evoland_scenario")
report_name_pretty: Display name for the report
report_include_date: Whether to include date in reports
creator_username: Username of the creator
last_opened: Timestamp of last access
last_opened_username: Username of last accessor

`coords_t`

pk	colname	type	description
*	id_coord	int	Unique ID for each reference coordinate pair
	lon	float	aka x coord. TODO find CRS storage location
	lat	float	aka y coord. TODO find CRS storage location
	elevation	float, nullable	aka z coord. TODO find CRS storage location
	geom_polygon	polygon, nullable	If the point describes a surface, its polygon

The coords_t table specifies the basic coordinates at which the model is set up and run. Because these coordinates will often also refer to a surface, a polygon describing each surface may be stored alongside the lat/lon coordinates. This table is either constructed based on reference grid specification (e.g. existing LULC data) or derived from parameters (e.g. a hexagonal grid with edge length S and origin at (P, Q)).

Additional attributes are stored as table metadata (via setattr):

epsg: EPSG code for the coordinate reference system
xmin, xmax, ymin, ymax: Bounding box extent
resolution: Grid resolution (for regular grids)

`periods_t`

pk	colname	type	description
*	id_period	int	Unique ID for each period
	start_date	date	Inclusive boundary for start
	end_date	date	Exclusive boundary for end
	is_extrapolated	bool	False if observation, true if extrapolated

This table identifies periods in the past and future. It is populated using a start and end date for the observation period, and an end date for the extrapolation, plus a time step length. Each period is left-inclusive, i.e. in case of doubt assigned to the anterior class in a transition, not the posterior. For static predictors (variables that do not change over time), a "0 period" is used in the database to indicate their timeless nature, starting and ending at the end of the observed periods. This allows static predictors to be stored and referenced consistently alongside time-varying predictors, even though they do not correspond to a specific time period.

`runs_t`

pk	colname	type	description
*	id_run	int	Unique ID for each run (0 = base/unperturbed)
	parent_id_run	int, nullable	FK to runs_t for hierarchical scenarios
	description	str	Human-readable description of the run

This table supports multi-run simulations and scenario management. Run ID 0 is reserved for the base/unperturbed scenario. Child runs can reference a parent run to establish scenario hierarchies: the most specific available set of data is used. For instance, if a predictor value is not available for a specific run, the system falls back to the parent run's value, continuing up the hierarchy until a value is found or the base run is reached.

`lulc_data_t` + `lulc_meta_t`

lulc_meta_t describes metadata for each Land Use / Land Cover class.

pk	colname	type	description
*	id_lulc	int	Unique ID for each land use class
	name	str	Name for use in code and queries, e.g. `forest_dense`
	pretty_name	str	Long name for plots/output e.g. Dense/Old Growth Forest
	description	str	Long description / operationalisation
	src_classes	list	List of source class IDs that map to this class

The src_classes column allows mapping from source data class IDs to the harmonized LULC classes used in the model.

lulc_data_t indicates that at a given place and time, a certain land use is detected/projected. To transform these sparse data for a given id_period to a dense matrix, you would join the id_lulc to the coords_t. If coords_t represents a dense raster, you can reshape the 1D-vector. If coords_t is irregular, you can rasterize according to your raster specs.

uniqueness	colname	type	description
*	id_run	int, optional	FK from `runs_t` (0 for base)
*	id_coord	int	FK from `coords_t`
	id_lulc	int	FK from `lulc_meta_t`
*	id_period	int	FK from `periods_t`

The table is partitioned by (id_run, id_period) for efficient storage and retrieval.

Some observational datasets, e.g. the Swiss Surface Statistics, may have exact dates associated with individual id_coord, id_lulc combinations. If desired, these data should be added as predictors - e.g. in "number of years before or after the id_period".

`pred_data_t_*` + `pred_meta_t`

pred_meta_t describes metadata for each predictor.

pk	colname	type	description
*	id_pred	int (autoincrement)	Unique ID for each predictor
	name	str (unique key)	Name for use in code and queries, e.g. `distance_road`
	pretty_name	str	Name of the predictor, e.g. Distance To Closest Road
	description	str, nullable	Long description / operationalisation
	orig_format	str, nullable	Orig. format: "polygons, annual", "100m raster, daily"
	sources	list of structs (url, md5sum)	Where the data were fetched (maybe validate URL)
	unit	str, nullable	SI-compatible unit, nullable for categorical values
	factor_levels	map, nullable	{int: str} map to recover factor levels

pred_data_t is not actually a single table: multiple tables are employed to efficiently store different data types:

pred_data_t_float: Floating-point predictor values
pred_data_t_int: Integer predictor values (also holds factor values, see factor_levels above)
pred_data_t_bool: Boolean predictor values

uniqueness	colname	type	description
*	id_run	int, optional	FK from `runs_t` (0 for base)
*	id_pred	int	FK from `pred_meta_t`
*	id_coord	int	FK from `coords_t`
*	id_period	int	FK from `periods_t`
	value	float, int, bool	Predictor value (type-specific)

The tables are partitioned by (id_run, id_period) for efficient storage and retrieval.

`trans_meta_t`

Relates land use classes to each other (henceforth transitions), including statistics about their frequency in the original data and, essentially, an indicator on whether they are viable for modelling. Example of non-viability: A model may contain an "other" class that is valuable for a simplified representation of what an initial class might transition into. Since this "other" class holds no information on what class may come next, the "other" class could only increase in area, but never decrease. Another exemption for the viability of modelling a transition is because the set of observations may be too limited for statistical inference.

Note

The previous generation of evoland considered the viability of modelling a transition for each id_period. By tying the identity of a transition to a (training) period, we lose out on the data provided by previous periods. The identification of a transition with a period needs to be discussed.

pk	uniqueness	colname	type	description
*		id_trans	int (autoincrement)	Unique ID for each transition
	*	id_lulc_anterior	int	FK from `lulc_meta_t`, id before transition
	*	id_lulc_posterior	int	FK from `lulc_meta_t`, id after transition
		cardinality	int	How many times this transition occurred
		frequency_rel	float	Frequency of this transition in relation to all transitions in this timestep
		frequency_abs	float	Frequency of this transition in relation to all coordinates
		is_viable	bool	See explanation

`trans_preds_t`

An m to n relation enumerates all the predictors useful for modelling a particular transition. This table is populated by a feature selection step that precedes the actual model training. In the absence of a feature selection step, it would default to the cartesian product of predictors and transitions.

uniqueness	colname	type	description
*	id_pred	int	FK from `pred_meta_t`
*	id_trans	int	FK from `trans_meta_t`

`trans_rates_t`

Stores transition rates (probabilities) for each transition type in each time period. Historical rates are calculated from observed transitions, and future rates can be extrapolated using linear regression.

uniqueness	colname	type	description
*	id_period	int	FK from `periods_t`
*	id_trans	int	FK from `trans_meta_t`
	rate	float	Transition rate (0 to 1, non-negative)

`intrv_masks_t` + `intrv_meta_t`

Tables containing auxiliary information for interventions, that is: manipulations of LULC transition potential, or outright overrides of LULC change predictions. intrv_meta_t describes metadata for each intervention.

pk	colname	type	§§
*	id_intrv	int	Unique ID for each intervention
	id_period_list	list(int)	Associated periods for intervention
	id_trans_list	list(int)	Associated transitions for intervention
	pre_allocation	bool	Whether intervention occurs before allocation step
	name	str	Name for use in code and queries, e.g. `pa_expansion`
	pretty_name	str	Name of the intervention, e.g. Protected Area Expansion
	description	str	Long description / operationalisation
	sources	list of structs (url, md5sum)	Where the data were fetched (maybe validate URL)
	params	map	Intervention parameters, structured as key-value pairs

intrv_masks_t is used to store intervention masks, defining which coordinate pairs are affected by each intervention. The presence of (id_intrv, id_coord) indicates a positive mask value, i.e. a given coordinate pair is affected by a given intervention.

uniqueness	colname	type	description
*	id_intrv	int	FK from `intrv_meta_t`
*	id_coord	int	FK from `coords_t`

`trans_models_t`

Transition models: For each viable id_trans, there may be multiple models in this table. The table is populated in the following manner:

For each transition, split into a training and validation set
Try out many model specifications, write model and metadata to this table
Identify best performing models (group by id_trans; sort by goodness_of_fit['some_metric']; limit to 1)
Refit best models on full data

uniqueness	colname	type	description
*	id_trans	int	FK from `trans_meta_t`
	model_family	str	may also be enum; e.g., `rf`, `glm_binomial`, `ranger`
	model_params	map	Map of model (hyper) parameters
	goodness_of_fit	map	Map of various measures of fit, e.g. ROC AUC
*	fit_call	str	Deparsed fit function call for reproducibility
	model_obj_part	blob	BLOB of serialized model object used to establish goodness of fit
	model_obj_full	blob	BLOB of serialized model object used in extrapolation phase

The fit_call column stores a deparsed R function that can be reconstructed using str2lang() and called with different data to reproduce the model fitting process.

`alloc_params_t`

uniqueness	colname	type	description
*	id_run	int	FK from `runs_t` (0 = unperturbed base estimate)
*	id_trans	int	FK from `trans_meta_t`
	mean_patch_size	float	Mean area of new patches (in cell units)
	patch_size_variance	float	Standard deviation of patch area
	patch_isometry	float	Measure of patch shape regularity (Dinamica patcher parameter)
	frac_expander	float	Fraction of transition cells adjacent to existing patches [0, 1]
	frac_patcher	float	Fraction of transition cells forming new patches [0, 1]

The allocation strategy (e.g. establishment of new patches of a particular LULC, expansion of existing patches) stores parameters in this table. The table supports multiple perturbed versions of parameters for sensitivity analysis, with id_run = 0 representing the unperturbed best estimate.

The table is partitioned by id_run for efficient storage.

`neighbors_t`

Stores neighbor relationships between coordinates, computed using a spatial hash map for efficiency.

uniqueness	colname	type	description
*	id_coord_origin	int	FK from `coords_t` (origin coordinate)
*	id_coord_neighbor	int	FK from `coords_t` (neighbor coordinate)
	distance	float	Distance between origin and neighbor
	distance_class	factor, optional	Factor indicating distance class (if breaks provided)

This table is populated by create_neighbors_t() which accepts a max_distance parameter and optional distance_breaks for classifying neighbors into distance bands.

Views

These may be employed once it emerges what data specific processing steps require.

Notable derived tables (not persisted, computed on demand):

trans_v: Transitions view computed from consecutive LULC observations
trans_pred_data_v: Wide-format predictor data for transition modeling
pred_data_wide_v: Wide-format predictor data for a given period
trans_pot_t: Transition potential computed from fitted models (see predict_trans_pot())

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database Backend

Overview

Tables

`reporting_t`

`coords_t`

`periods_t`

`runs_t`

`lulc_data_t` + `lulc_meta_t`

`pred_data_t_*` + `pred_meta_t`

`trans_meta_t`

`trans_preds_t`

`trans_rates_t`

`intrv_masks_t` + `intrv_meta_t`

`trans_models_t`

`alloc_params_t`

`neighbors_t`

Views

Uh oh!

Uh oh!

Clone this wiki locally

Database Backend

Overview

Tables

reporting_t

coords_t

periods_t

runs_t

lulc_data_t + lulc_meta_t

pred_data_t_* + pred_meta_t

trans_meta_t

trans_preds_t

trans_rates_t

intrv_masks_t + intrv_meta_t

trans_models_t

alloc_params_t

neighbors_t

Views

Uh oh!

Uh oh!

Clone this wiki locally

`reporting_t`

`coords_t`

`periods_t`

`runs_t`

`lulc_data_t` + `lulc_meta_t`

`pred_data_t_*` + `pred_meta_t`

`trans_meta_t`

`trans_preds_t`

`trans_rates_t`

`intrv_masks_t` + `intrv_meta_t`

`trans_models_t`

`alloc_params_t`

`neighbors_t`