A consolidated framework for AI and ML components developed for the new portal.
To run the app locally:
- Go to the root folder of this project.
- Create a .env file. (or simply copy the .env.sample file)
- Add the following (with your real keys):
API_KEY="your_actual_api_key_here"
OPENAI_API_KEY="your_actual_openai_api_key_here"
PROFILE="your_actual_environment_here"If you plan to train models, also include:
ES_ENDPOINT="your_actual_elasticsearch_endpoint"
ES_API_KEY="your_actual_es_api_key"- Copy the content in .env.sample
to.env` and fill in your keys. - Run
Then visit http://localhost:8000
./startServer.sh
- Test App health at http://localhost:8000/api/v1/ml/health
-
Install Conda (if not already installed):
Follow the instructions at Conda Installation.
-
Create Conda virtual environment:
conda env create -f environment.yml
Poetry is used for dependency management, the pyproject.toml file is what is the most important, it will orchestrate the project and its dependencies.
You can update the file pyproject.toml for adding/removing dependencies by using
poetry add <pypi-dependency-name> # e.g poetry add numpy
poetry remove <pypi-dependency-name> # e.g. poetry remove numpyYou might want to update the poetry.lock file after manually modifying pyproject.toml with poetry lock command. To update all dependencies, use poetry update command.
-
Activate Conda virtual environment:
conda activate data-discovery-ai
-
Install environment dependencies:
# after cloning the repo with git clone command cd data-discovery-ai poetry install
FastAPI runs internal checks before making \process_record API calls. These checks include:
- β
Required model resource files must be present in
data_discovery_ai/resources/. - β
A valid
OPENAI_API_KEYmust be in.envunless you're indevelopmentenvironment. - β
If
PROFILE=development, Ollama must be running locally at http://localhost:11434.
To use the Llama3 model locally without OpenAI:
-
Go to Ollama download page and download the version that matches your operating system (Windows, Linux, or macOS).
-
After installation, start Ollama either by launching the app or running the following command:
ollama serve
-
Pull the "llama3" model used for local developmentοΌ
ollama pull llama3
-
(Optional) Consider installing Open WebUI to test Llama3 locally through a user-friendly interface:
docker run -d --network=host -v open-webui:/app/backend/data -e PORT=8090 -e OLLAMA_BASE_URL=http://127.0.0.1:11434 --name open-webui ghcr.io/open-webui/open-webui:main
Once the Open WebUI container is running, open your browser and go to: http://localhost:8090.
Simply run:
python -m data_discovery_ai.serverHost will be run at http://localhost:8000.
Run all tests with:
poetry run python -m unittest discover -s testsRun manual checks:
pre-commit run --all-filesChecks are also executed when you run git commit. The configurations for pre-commit hooks are defined in .pre-commit-config.yaml.
We are using gitmoji(OPTIONAL) with husky and commitlint. Here you have an example of the most used ones:
- π¨ - Improving structure/format of the code.
- β‘ - Improving performance.
- π₯ - Removing code or files.
- π - Fixing a bug.
- π - Critical hotfix.
- β¨ - Introducing new features.
- π - Adding or updating documentation.
- π - Deploying stuff.
- π - Updating the UI and style files.
- π - Beginning a project.
Example of use:
:wrench: add husky and commitlint config
hotfix/: for quickly fixing critical issues,usually/: with a temporary solutionbugfix/: for fixing a bugfeature/: for adding, removing or modifying a featuretest/: for experimenting something which is not an issuewip/: for a work in progress
And add the issue id after an / followed with an explanation of the task.
Example of use:
feature/5348-create-react-app
Once the app is running, two routes are available:
| Route | Description |
|---|---|
GET /api/v1/ml/health |
Health check |
POST /api/v1/ml/process_record |
One single point for calling AI models to process metadata record |
DELETE /api/v1/ml/delete_doc |
Deletes a document from the AI-related Elasticsearch index. Requires query parameter doc_id. |
{
"selected_model":["description_formatting"],
"uuid": "test",
"title": "test title",
"abstract": "test abstract"
}Required Header
X-API-Key: your_api_key(Must match the value of API_KEY specified in the environment variables).
AI Model Options
selected_model: the AI models provided bydata-discovery-ai. It should be a list of strings, which are the name of the AI task agents. Currently, four AI task agents available for distinctive tasks:keyword_classification: predict keywords from AODN vocabularies based on metadatatitleandabstractwith pretrained ML model.delivery_classification: predict data delivery mode based on metadatatitle,abstract, andlineagewith pretrained ML model.description_formatting: reformatting long abstract into Markdown format based on metadatatitleandabstractwith LLM model "gpt-4o-mini".link_grouping: categorising links into four groups: ["Python Notebook", "Document", "Data Access", "Other"] based on metadatalinks.
Currently, two machine learning pipelines are available for training and evaluating models:
keyword: keyword classification model, which is a Sequential model for multi-label classification taskdelivery: data delivery classification model, which is a self-learning model for binary classification task
To run one of the pipelines (for example, the keyword one), you can use the following command in your terminal:
python -m data_discovery_ai.ml.pipeline --pipeline keyword --use_cached_raw False --start_from_preprocess True --model_name experimentalYou can also use a shorter version:
python -m data_discovery_ai.ml.pipeline -p keyword -r False -s True -n experimentalIf the raw data has changed (e.g., updated, cleaned, or expanded), you are recommended to re-train the model using the latest data. To do this, set:
--use_cached_raw False --start_from_preprocess TrueAs mentioned in Environment variables, ElasticSearch endpoint and API key are required to be set up in .env file.
Running a pipeline trains a Machine Learning model and saves several resource files for reuse β so you donβt have to reprocess data or retrain the model every time.
For cached purpose, raw data which contains all collections from OGCAPI is saved in data_discovery_ai/resources/raw_data.pkl
-
deliverypipelineOutputs are saved in:
data_discovery_ai/resources/DataDeliveryModeFilter/File Name Description filter_preprocessed.pklPreprocessed data used for training and testing development.pklTrained binary classification model -
keywordpipelineOutputs are saved in:
data_discovery_ai/resources/KeywordClassifier/File Name Description keyword_sample.pklPreprocessed data used for training and testing keyword_label.pklMapping between labels and internal IDs development.kerasTrained Keras model file (name set by --model_name)
These files are generated automatically during pipeline training and saved automatically after pipeline running. They are intended for reuse in subsequent runs to avoid retraining.
The --model_name argument helps organise different versions of your model. Here's how theyβre typically used:
| Name | Purpose | When to Use |
|---|---|---|
development |
Active model development | For testing and iterating on new ideas |
experimental |
Try new techniques or tuning | For exploring new features or architectures |
benchmark |
Compare against the current baseline model | When validating improvements over a previous version |
staging |
Pre-production readiness | When testing full integration before final deployment |
production |
Final production model | Live version used in production APIs or systems |
π Tip: When working locally, use
--model_name experimentalto avoid overwriting files used in deployments.
Each model name reflects a stage in the model lifecycle:
-
Development
- Initial model design and prototyping
- Reaches minimum performance targets with stable training
-
Experimental
- Shows consistent performance improvements
- Experiment logs and results are clearly documented
-
Benchmark
- Outperforms the existing benchmark (usually a copy of the production model)
- Validated using selected evaluation metrics
-
Staging
- Successfully integrated with application components (e.g. APIs)
- Ready for deployment, pending final checks
-
Production
- Deployed in a live environment
- Monitored continuously, supports user feedback and live data updates
In the configuration file data_discovery_ai/common/config-*.yaml, while * specify the environment for use. You can specify which model version each task should use. For example:
model:
delivery_classification:
pretrained_model: developmentThis means the agent handling the delivery_classification task will use the development version of the model.
We use MLflow to track model training and performance over time (like hypermeters, accuracy, precision, etc.).
It is automatically started at the beginning of the pipeline. Once a pipeline starts, you can open the tracking dashboard in your browser: http://127.0.0.1:53000
You can change the model's training settings (like how long it trains or how fast it learns) by editing the trainer section in the file: data_discovery_ai/config/parameters.yaml
| File | Description |
|---|---|
data_discovery_ai/common/constants.py |
Shared constants |
data_discovery_ai/commom/parameters.yaml |
Store parameter settings for ML models and AI agents. |
data_discovery_ai/commom/config-*.yaml |
Store parameter configuration for different environments. |
data_discovery_ai/
βββ config/ # Common utilities and shared configurations/constants used across modules
βββ enum/ # Enums to use
βββ core/ # Core logic of the application such as API routes
βββ agents/ # Task-specific agent modules using ML/AI/rule-based tools
βββ ml/ # Machine learning models: training, inference, evaluation logic
βββ utils/ # Utility functions and helper scripts for various tasks
βββ resources/ # Stored assets such as pretrained models, sample datasets, and other resources required for model inference
βββ notebooks/ # Jupyter notebooks
βββ tests/ # Unit tests for validating core components
β βββ agents
β βββ ml
β βββ utils
βββ server.py # FastAPI application entry point