The Beast is an experimental, flexible, declarative-oriented toolkit to read machine-readable data from the various sources and transform them into follow-the-money entities (FTM).
Do not rely on this one until it is out of alpha. Everything is very volatile.
The Beast is currently in beta and is quite stable. While we can foresee some changes to the mapping format to allow for better flexibility, we are slow to implement them, and we are cautious.
The Beast is battle-tested. Complete documentation is available here.
The FTM proposal: alephdata/followthemoney#717
The sample mapping with tons of comments to make you understand an idea better (beware, it's just an example, format is subject to change): https://github.com/dchaplinsky/thebeast/blob/main/thebeast/tests/sample/mappings/ukrainian_mps.yaml
Validator for the mappings in JSON schema format (again, work in progress and tons of comments): https://github.com/dchaplinsky/thebeast/blob/main/thebeast/conf/mapping_validator.json
First proposal of the mapping (obsolete, but can give you a better idea) https://gist.github.com/dchaplinsky/8021b530ea7e44c9443afcc3318042fd
- Ingest from databases (mongo, postgres) using SQLAlchemy or PeeWee
- Tests for the databases ingest
- Basic CLI
- Signals on exceptions and policy for the incorrectly parsed entity values (drop, drop all, drop entity, reraise)
- Tests for the signals
- Stats collector (number of signals of each type, number of invalid entities, etc)
- Packaging (partially done in
packaging_and_spark_integrationbranch) - Documentation (@legless, your notes will be very valuable)
- Advanced ingest routines: regex validation to discard values that do not pass the test?
- Tests for the resolver wrappers
Done
- Basic ingest for json/jsonlines/csv, both local and remote, compressed or not, singular or multiple files
- Tests for the basic ingest
- Mapping reader
- Tests for mapping reader
- Basic digest routines
- Tests for basic digest routines
- Advanced ingest routines: constant entities (think Country or Organization)
- Advanced ingest routines: backreferencing (think talking from subcollections to parent items)
- Advanced ingest routines: nested collections (think parsing involved JSON)
- Advanced ingest routines: templates (think combining fields when setting the entity field)
- Advanced ingest routines: multiple values for the entity property
- Advanced ingest routines: split string into multiple values
- Advanced ingest routines: full entity validation and red/green sorting
- Advanced ingest routines: augmentations/transformations
- Advanced ingest routines: record transformations
- Tests for record transformations
- Tests for the individual resolvers
- Tests for digest routines
- Advanced digest routines: multiprocessing
- Tests for advanced digest routines
- Basic dump routines (stdout/files)
- Basic dump routines: statements
- Tests for basic dump routines
- Tests for basic dump routines: statements
- Remove inflate/deflate and pass dicts rather than entities between digest and dump
- Python 3.11 support (https://github.com/dchaplinsky/thebeast/actions/runs/3802499820/jobs/6468041810, ICRAR/ijson#80)
pip install -r requirements.txt
python -m pytestThe /bin/ directory contains scripts to run Beast inside a Docker container.
Use /bin/run data/mapping.yaml to run Beast with selected mapping.
Note: mapping and source file(s) must be in the Beast root (sub-)directory. E.g. ./data/mapping.yaml
You can't point Beast to a file outside its root directory.
Use /bin/tests to run tests.
Use /bin/black to run black to format source files before contributing a pull request.