federated-learning

For the development of differentially private federated machine learning on the CanDIG data services.

federated-learning

Quick Start

Pull submodule updates
- The federated-learning repository relies on various other repositories for providing the backend data services, interfaces, and training data. From the root of this repo, pull these repositories with git submodule update --init --recursive
Configure environment
- The docker-compose.yml file expects a .env file in root folder, so that it can configure the Katsu database with some secrets such as the password. For a generic configuration, you can run the following to copy and use the default configuration: cp .default.env .env
Start Experiment
- Use a quickstart script to start your federated experiment if you have one present. Otherwise, proceed with the following:
  - Create a docker-compose file with orchestration-scripts/configure_docker_compose.py.
  - Spin up docker: docker-compose up -d

Dependencies

CanDIGv2 submodules:

Katsu serves clinical data that the Federated-Learning services may train on and classify
GraphQL-interface fetches data from Katsu and serves it to the Federated-Learning services in the Graph Query Language, greatly reducing the amount of preprocessing code required for running an experiment.

Datasets:

Taken from the Synthea/CodeX synthetic patient mCODE datasets, the 10yrs breast cancer dataset downloadable here is used in all of the federated learning experiments included in this repository.
The MoHCCN-data synthetic dataset was used for some early non-federated experiments.

Running a Federated Learning Experiment

In the future, we would like to provide to the user:

an assortment of federated learning experiments that can be run on the data stored in CanDIG's data services, along with
an API for selecting the experiment and specifying its configurable parameters.

However, at this stage in development, the federated-learning services expect the user to:

provide the source code for the experiment that they want to run, and
run the experiment using the docker CLI.

Several example experiments are available in the experiments/synthea-breast-cancer subdirectory, or you may provide your own.

Generating an experiment

To make another experiment, create an experiment folder like the one in the experiments/mock-experiment/ subdirectory or like the synthea experiments folder. Place this experiment folder inside another experiment subdirectory within the larger experiments root subdirectory (eg. experiments/my-new-experiment/experiment/). Ensure that there are at least the 6 files present in the example folder, within your folder.

__init__.py
experiment.py
flower_client.py
get_eval_fn.py
model.py
settings.py

If there are any supplementary functions required, create a helpers folder in the same subdirectory. Visit the experiments/README.md file for more information.

Running an experiment

In order to run the experiment, it makes the most sense to create your own quickstart.sh script to get the required docker services up in order, with the parameters you need. For example, for the Winter 2022 Synthea federated experiment, use the following line of code from the root of the federated learning directory:

./experiments/synthea-breast-cancer/winter2022/Federated/quickstart.sh -i <INGEST-PATH> -p <PORT> -n <SITES> -r <ROUNDS> -e <PATH-TO-EXPERIMENTS-DIRECTORY>

Perform the following to get help for the quickstart script:

./experiments/synthea-breast-cancer/winter2022/Federated/quickstart.sh -h

If an options is not called, the script uses the following default values:

no -i: Will not ingest data
no -p: Will use port 5000
no -n: Will generate 2 client sites
no -r: Will run the experiment for 100 rounds
no -e: Will look for an ./experiment directory
no -s: Will not put all of the data into one dataset

Technical debt

To prepare the federated-learning services for deployment, there should be a set of models (or a carefully-parametrized model-building pipeline) prepared and packaged into the docker containers. An API should be provided for selecting and running these models. This API should then be connected to Tyk.

Additionally, improvements should be made to the GraphQL-interface that the federated-learning services depend on for data. The data services (Katsu and the variants service) should ideally expose their own GraphQL APIs, eliminating the need for a GraphQL-interface. Alternately, improvements in training speed could be achieved by refactoring the existing GraphQL-interface proof-of-concept to use concurrency, although this is not an optimal long-term solution.

The Flower framework for federated learning used by this repository will soon support communications over secure gRPC; this repository and/or its deployment should be upgraded accordingly.

The orchestration-scripts folder has a dash in its name which makes it extremely hard to work with in Python. Consider renaming this folder with an underscore so that we can import between the files in the folder to reduce code duplication. We should also then move all unrelated code from a module into its own module (eg. move copy_experiment_requirements from configure_docker_compose.py into its own file and then import it into the original file).

Improving differential privacy

Suggestions for implementing a more thorough differential privacy setup are provided in docs/FL_differential_privacy.md.

Contributing

Currently, the federated-learning services have only been run on clinical data stored in Katsu. To add more data services, add the services to the docker-compose.yml file (as well as the configure_docker_compose.py file) and add their default configuration variables to .default.env. You may also wish to add ingestion scripts to the ingestion-scripts subdirectory.

All experiment-specific source code should go in the experiments subdirectory.

To run experiments on a different dataset, see the experiments/mock-experiment/template-quickstarts directory for template quickstart files for both federated and differentially-private federated experiments. These templates are not mandatory to use, if they are not convenient you may ignore or even remove them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

federated-learning

Quick Start

Dependencies

Running a Federated Learning Experiment

Generating an experiment

Running an experiment

Technical debt

Improving differential privacy

Contributing

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 151 Commits
docs		docs
experiments		experiments
ingestion-scripts		ingestion-scripts
orchestration-scripts		orchestration-scripts
services		services
.default.env		.default.env
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

License

CanDIG/federated-learning

Folders and files

Latest commit

History

Repository files navigation

federated-learning

Quick Start

Dependencies

Running a Federated Learning Experiment

Generating an experiment

Running an experiment

Technical debt

Improving differential privacy

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages