Harvest, normalise, and index climate / earth-system metadata from POSIX, S3/MinIO, and OpenStack Swift using configurable DRS dialects (CMIP6, CMIP5, CORDEX, …). Output to a temporary catalogue (JSONLines) and then index into systems such as Solr or MongoDB. Configuration is TOML with inheritance, templating, and computed rules.
- Define datasets + dialects in
drs_config.toml
mdc add
→ write a temporary catalogue (jsonl.gz
)mdc config
→ inspect a the (merged) crawler config.mdc walk-intake
→ inspect the content of an intake catalogue.mdc <backend> index
→ push records from catalogue into your index backendmdc <backend> delete
→ remove records by facet match
- Multi-backend discovery: POSIX, S3/MinIO, Swift (async REST), Intake
- Two-stage pipeline: crawl → catalogue then catalogue → index
- Schema driven: strong types (e.g.
string
,datetime[2]
,float[4]
,string[]
) - DRS dialects: packaged CMIP6/CMIP5/CORDEX; build your own via inheritance
- Path specs & data specs: parse directory/filename parts and/or read dataset attributes/vars
- Special rules: conditionals, cache lookups and function calls (e.g. CMIP6 realm, time aggregation)
- Index backends: MongoDB (Motor), Solr
- Sync + Async APIs and a clean CLI
- Docs: Sphinx with
pydata_sphinx_theme
pip install metadata-crawler
conda install -c conda-forge metadata-crawler
# 1) Crawl → write catalogue
mdc add \
cat.yaml \
--config-file drs_config.toml \
--dataset cmip6-fs,obs-fs \
--threads 4 --batch-size 100
# 2) Index from catalogue → Solr (or Mongo)
mdc solr index \
cat.yaml \
--server localhot:8983
# 3) Delete by facets (supports globs on values)
mdc delete \
--server localhost:8983 \
--facets "file *.nc" --facets "project CMIP6"
Note
The CLI is a custom framework inspired by Typer (not Typer itself).
Use --help
on any subcommand to see all options.
# === Canonical schema ===
[drs_settings.schema.file]
key = "file"
type = "path"
required = true
indexed = true
unique = true
[drs_settings.schema.uri]
key = "uri"
type = "uri"
required = true
indexed = true
[drs_settings.schema.variable]
key = "variable"
type = "string[]"
multi_valued = true
indexed = true
[drs_settings.schema.time]
key = "time"
type = "datetime[2]" # [start, end]
indexed = true
default = []
[drs_settings.schema.bbox]
key = "bbox"
type = "float[4]" # [W,E,S,N]
default = [0, 360, -90, 90]
# === Dialect: CMIP6 (example) ===
[drs_settings.dialect.cmip6]
sources = ["path","data"] # path | data | storage
defaults.grid_label = "gn"
specs_dir = ["mip_era","activity_id","institution_id","source_id","experiment_id","member_id","table_id","variable_id","grid_label","version"]
specs_file = ["variable_id","table_id","source_id","experiment_id","member_id","grid_label","time"]
[drs_settings.dialect.cmip6.special.realm]
type = "method"
method = "_get_realm"
args = ["table_id","variable_id","__file_name__"]
[drs_settings.dialect.cmip6.special.time_aggregation]
type = "method"
method = "_get_aggregation"
args = ["table_id","variable_id","__file_name__"]
# === Dialect: CORDEX (bbox by domain) ===
[drs_settings.dialect.cordex]
sources = ["path","data"]
specs_dir = ["project","product","domain","institution","driving_model","experiment","ensemble","rcm_name","rcm_version","time_frequency","variable","version"]
specs_file= ["variable","domain","driving_model","experiment","ensemble","rcm_name","rcm_version","time_frequency","time"]
[drs_settings.dialect.cordex.special.bbox]
type = "call"
method = "dialect['cordex']['domains'].get('{{domain | upper }}', [0,360,-90,90])"
[drs_settings.dialect.cordex.domains]
EUR-11 = [-44.14, 64.40, 22.20, 72.42]
AFR-44 = [-24.64, 60.28, -45.76, 42.24]
# === Datasets ===
[cmip6-fs]
root_path = "/data/model/global/cmip6"
drs_format = "cmip6" # dialect name
fs_type = "posix"
[cmip6-s3]
root_path = "s3://test-bucket/data/model/global/cmip6"
drs_format = "cmip6"
fs_type = "s3"
storage_options.endpoint_url = "http://127.0.0.1:9000"
storage_options.aws_access_key_id = "minioadmin"
storage_options.aws_secret_access_key = "minioadmin"
storage_options.region_name = "us-east-1"
storage_options.url_style = "path"
storage_options.use_ssl = false
[obs-fs]
root_path = "/arch/observations"
drs_format = "custom"
# define your specs_dir/specs_file or inherit from another dialect
Each canonical facet describes:
key
: where to read value ("project"
,"variable"
,)type
:string
,integer
,float
,datetime
, with arrays likefloat[4]
,string[]
,datetime[2]
, or special types likefile
,uri
,fs_type
,dataset
,fmt
required
,default
,indexed
,unique
,multi_valued
A dialect tells the crawler how to interpret paths and read data:
sources
: which sources to consult (path
,data
,storage
) in priorityspecs_dir
/specs_file
: ordered facet names encoded in directory and file namesdata_specs
: pull values from dataset content (attrs/variables); supports__variable__
and templated specsspecial
: computed fields (conditional
|method
|function
)- Optional lookups (e.g., CORDEX
domains
for bbox)
- Path specs parse segments from the path, e.g.:
/project/product/institute/model/experiment/.../variable_time.nc
- Data specs read from the dataset itself (e.g., xarray/global attribute, variable
attributes, per-var stats). Example: gather all variables
__variable__
, then their units with a templated selector.
Create new dialects/datasets by inheriting:
[drs_settings.dialect.reana]
inherits_from = "cmip5"
sources = ["path","data"]
[drs_settings.dialect.reana.data_specs.read_kws]
engine = "h5netcdf"
import asyncio
from metadata_crawler.run import async_add, async_index, async_delete
async def main():
# crawl → catalogue
await async_add(
"cat.yaml",
config_file="drs_config.toml",
dataset_names=["cmip6-fs"],
threads=4,
batch_size=100,
)
# index → backend
await async_index(
"solr",
"cat.yaml",
config_file="drs_config.toml",
server="localhost:8983",
)
# delete by facets
await async_delete(
config_path="drs_config.toml",
index_store="solr",
facets=[("file", "*.nc")],
)
asyncio.run(main())
import asyncio
from metadata_crawler import add
add(
store="cat.yaml",
config_file="drs_config.toml",
dataset_names=["cmip6-fs"],
)
- MongoDB (Motor): upserts by unique facet (e.g.,
file
), bulk deletes (glob → regex) - Solr: fields align with managed schema; supports multi-valued facets
Development install:
git clone https://github.com/freva-org/metadata-crawler.git
cd metadata-crawler
pip install -e .
PRs and issues welcome. Please add tests and keep examples minimal & reproducible (use the MinIO compose stack). Run:
python -m pip install tox
tox -e test lint types
For benchmarking you can create a directory tree with roughly 1.5 M files by
calling the create-cordex.sh
script in the dev-env
folder:
./dev-env/create-cordex.sh
python dev-env/benchmark.py --max-files 20000
See code-of-conduct.rst
and whatsnew.rst
for guidelines and changelog.
Use MinIO or LocalStack via docker-compose
and seed a bucket (e.g., test-bucket
).
Then point a dataset’s fs_type = "s3"
and set storage_options
.
Built with Sphinx + pydata_sphinx_theme
. Build locally:
tox -e docs