The discovery service is an application for dataset discovery with 3 components:
- It exposes a series of services via
REST API - It automatically ingests newly added datasets using a
schedulerimplemented withCelery. The scheduler can be configured using DATA_INGESTION_INTERVAL variable in.env-default. The default value is60seconds. - It provides services for Jupyter Notebook via a developed plugin: https://github.com/Archer6621/jupyterlab-daisy
The entire project is containerised, therefore the only requirement is Docker
You can browse the full OpenAPI documentation.
The discovery service is available for both development and production.
The environement varibles can be found in .env-default
Always delete the auto-generated
.envfile after changing something in.env-default
DAISY_PRODUCTION-TRUEto run in production mode andFALSEto run in development mode. Default FALSEDATA_INGESTION_INTERVAL- The time interval in SECONDS for starting the auto-ingest pipeline. The time interval should reflect how often new data is uploaded/received.DATA_ROOT_PATH- The location of the datasets
Run docker_start.sh to start Docker. Based on the DAISY_PRODUCTION variable, it will automatically use
the appropriate docker-compose.
Visit the API Documentation via localhost:443 once the application is up.
- Run
/ingest-dataendpoint.- The data should be in the
datafolder and it has to follow this structure:{id}/resources/{file-name}.csv - This endpoint will take a while to run. The more data to process, the more it will run.
- The data should be in the
- Run
/filter-connectionsto remove extra edges.
- Run
/purge. This will remove all the data from neo4j and redis.
- Get joinable tables - Get all assets that share a column(key) with the speficied asset
/get-joinablewith input:asset_id - Get related assets - Given a source and a target, show how and if the assets are connected
/get-realtedwith 2 input variablesfrom_asset_idandto_asset_id
(Development) The following admin-panels are exposed, for inspecting the services:
- Rabbit MQ:
localhost:15672 - Neo4j:
localhost:7474 - Celery Flower:
localhost:5555 - Redis:
localhost:8001
You can edit any python file in the src folder with your favorite text editor and it will live-update while the container is running (and in case of the API, restart/reload automatically).
If you get an error about file sharing on windows, visit this thread.