This repository hosts all the code required to develop/maintain/update the orphadata API.
The API
-
gives a programmatic access to Orphadata products. Data are monthly updated.
-
is hosted on a Gandi Simple Hosting instance.
-
makes use of an Elasticsearch server instance to store and request data
-
is served through the help of the python Flask framework. Flask is also used as a proxy to query the Elasticsearch server.
-
documentation follows the OpenAPI v3 specification and has been generated with Swagger.
Only free orphadata are consumed through the API. Theses products are:
product1: Rare diseases and alignment with terminologies and databasesproduct3: Clinical classifications of rare diseasesproduct4: Phenotypes associated with rare diseasesproduct6: Genes associated with rare diseasesproduct7: Linearisation of rare diseasesproduct9_prev: Epidemiology rare diseasesproduct9_ages: Natural history rare diseases
Below is a global tree view of the repository.
. ├── api/ ├── datas/ ├── README.md ├── requirements.txt ├── static/ └── wsgi.py
The repository is made of two independent parts:
-
api/, a folder containing all code relative to the flask implementation of the API. -
datas/, a folder containing scripts and modules relative to the processing of data (download, conversion and elastic injection). This folder is used to get the orphadata and store them in an elasticsearch instance that is queried by the API.
There is also:
-
static/, a folder containing all static files used to serve the API. The only reason this folder is not inapi/is related to the way the gandi server instance accesses static files. -
requirements.txt, a file used to install all required python libraries. -
wsgi.py, a python script used to run the application.
All this code has been developed and tested:
- on a UNIX OS system (but WSL(2) on Windows should be fine)
- with python >= 3.8
You will also need:
- a local elasticsearch instance (version 7.* - for development & test purposes)
- an access to the AWS elasticsearch instance
- an access to the Gandi host server
Although not strictly necessary, this step is highly recommended (it will allow you to work on different projects without having any conflict between different python package versions).
First install virtualenv:
python3 -m pip install virtualenvYou can then generate a virtual environment as follows:
virtualenv -p python3 .envThis command will create a directory called .env/ with the latest python3 version you have on your system. This directory has the following structure:
.env/ ├── bin/ └── lib/python3.8/site-packages/
In order to make your virtual environment active, you have to type:
source .env/bin/activateOnce activated, your shell command line should be preceeded by the name of your environment in parenthesis.
From now on, every python package that will be installed with the pip install command will be stored in .env/lib/python3.8/site-packages/. If some binary scripts come with a given package, they will be stored in .env/bin/
If you want to get out of your virtual environment, simply type deactivate.
First clone the whole project:
git clone https://github.com/Orphanet/API_Orphadata.gitOnce downloaded, make sure your virtual environment is activated and type the following commands:
cd API_Orphadata
pip install -r requirements.txtBoth the flask application and the data processing scripts make use of environment variables:
| Name | Accepted values | Default value | Role |
|---|---|---|---|
FLASK_ENV |
test, dev or production |
production |
Only used by the flask API application |
DATA_ENV |
remote or local |
local |
Only used for the full data update process: datas/src/orphadata_update.py |
The FLASK_ENV variable defines an object (see api/config.py) used to configure the flask instantiated application (through the parameter config_name of the factory function create_app() in api/__init__.py).
Basically, if:
FLASK_ENV=test, the application will- have DEBUG and TESTING variables set at True
- connect to a local elasticsearch instance (http://localhost:9200/)
FLASK_ENV=dev, the application will- have DEBUG variable set at True
- connect to the remote AWS elasticsearch instance (see below)
FLASK_ENV=production, the application will- have DEBUG variable set at False
- connect to the remote AWS elasticsearch instance (see below)
The DATA_ENV variable is used by datas/src/orphadata_update.py to define on which elasticsearch instance data will be
injected to. If:
DATA_ENV=remote, data will be injected to the remote AWS elasticsearch instanceDATA_ENV=local, data will be injected to the local elasticsearch instance (http://localhost:9200/)
Those two environment variables can be set in a UNIX-based system as follows:
export FLASK_ENV=dev
export DATA_ENV=localIn case the variables have not been set, the default value is used.
Access to the remote AWS elasticsearch instance (case where FLASK_ENV=dev|production and DATA_ENV=remote) requires its URL and associated login credentials.
To avoid writing sensitive informations on the source code, python-dotenv is used to access credentials from environment variables. Unlike the previous ones, environment variables relative to the remote elasticsearch access must be stored in a file arbitrarily called .varenv.
Create a file .varenv at the root of this repository and into it the following variables:
ELASTIC_URL=the_elastic_url_marc_gives_you
ELASTIC_USER=the_associated_elastic_user_id
ELASTIC_PASS=the_associated_elastic_passwwordPlease note that this file should never be shared/accessible so don't forget to add it to your .gitignore if not already present. Moreover, since this file must be present on the gandi server instance, you will have to upload it to the remote server.
Let's consider at this stage you have not setup a local elasticsearch instance. The application will need to access the remote elasticsearch instance, so you'll need:
.varenvfile being correctly set upFLASK_ENVvariable set to eitherproductionordevor nothing (because default isproduction)
You can then simply type python wsgi.py to run the application. It should now be accessible locally on your browser at the following URL:port address: http://127.0.0.1:5000.
To deploy the application, if not already done, you'll first need to add the remote repository related to the Gandi host server to your git configurations:
git remote add gandi git+ssh://5815773@git.sd5.gpaas.net/default.git
Once added, you will be able to (after having commited your changes if there are) push your branch of interest as follows:
git push gandi your_branch_nameand then enter the password to access the Gandi server instance.
Next, you'll need to add the .varenv file directly in the root directory of the repository in the Gandi server instance (/lamp0/web/vhosts/default/). Note that this file was not pushed with the source code from the preceding command git push gandi your_branch_name since the file must be in .gitignore. To add it the the Gandi server instance, you can use one of the recommended sFTP client software.
You can also connect to the instance with a command line. First make sure you are located where your .varenv file is on your local repository. Then type the following:
# connect to the gandi instance and go to /lamp0/web/vhosts/default/ (you'll have to enter your password)
sftp 5815773@sftp.sd5.gpaas.net:/lamp0/web/vhosts/default/
# you should see sftp>. Now you can place your local .varenv file to the remote instance you are connected to
put .varenv
# quit the instance
exitNow the Gandi remote repository can be deployed with the following command:
ssh 5815773@git.sd5.gpaas.net deploy default.git your_branch_nameand then enter the password to access the Gandi server instance.
After having done some change on a given branch, you can update the Marc's remote repository with the following:
git commit -am "Comment your changes"
git push origin your_branch_nameThis section describes how to retrieve orphadata and inject them into an elasticsearch instance.
Source codes dedicated in processing data is located in the datas/ folder:
API_Orphadata/datas/src/ ├── lib ├── orphadata_download.py ├── orphadata_generic.py ├── orphadata_injection.py ├── orphadata_update.py └── orphadata_xml2json.py
Orphadata in XML format can simply be retrieved as follows:
python datas/src/orphadata_download.pyThis command will write all XML retrieved files into API_Orphadata/datas/xml_data/.
NOTE
Each time orphadata are updated a JSON file is generated for each product. This JSON file contains detailed informations about its related product such as its size, its languages, or also the URL where it can be accessed. Thereby all orphadata products are retrieved in XML format from the URLs given in each of these product-related JSONs.
JSONs URLs are stored in the PATH_PRODUCTS_INFOS variable found in datas/src/lib/config.py:
product1: http://www.orphadata.org/cgi-bin/free_product1_cross_xml.jsonproduct3: http://www.orphadata.org/cgi-bin/free_product3_class.jsonproduct4: http://www.orphadata.org/cgi-bin/free_product4_hpo.jsonproduct6: http://www.orphadata.org/cgi-bin/free_product6_genes.jsonproduct7: http://www.orphadata.org/cgi-bin/free_product7_linear.jsonproduct9_prev: http://www.orphadata.org/cgi-bin/free_product9_prev.jsonproduct9_ages: http://www.orphadata.org/cgi-bin/free_product9_ages.json
Before being injected into an elasticsearch instance, data must be parsed and written in an elasticsearch compatible JSON format.
The conversion of XML orphadata into elasticsearch compatible JSON files is done with the following command:
python datas/src/orphadata_xml2json.pyThis command will write all JSON files into API_Orphadata/datas/json_data/.
Now that we have JSON files, we can inject them into an elasticsearch instance.
If you are using your local elasticsearch instance make sure that it is running and accessible at localhost:9200:
nche@orphanet13:~$ curl localhost:9200
{
"name" : "orphanet13",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "7ODmxFEVQh2-bQS3qWFHLg",
"version" : {
"number" : "7.17.0",
"build_flavor" : "default",
"build_type" : "deb",
"build_hash" : "bee86328705acaa9a6daede7140defd4d9ec56bd",
"build_date" : "2022-01-28T08:36:04.875279988Z",
"build_snapshot" : false,
"lucene_version" : "8.11.1",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
To inject your data, type the following command:
# for local elasticsearch instance (default ES url: http://127.0.0.1:9200)
python datas/src/orphadata_injection.py -url local
# for remote elasticsearch instance (reads .varenv to access ES url)
python datas/src/orphadata_injection.py -url remoteThis will create an elastic index named according to each JSON file and prefixed with orphadata_ (except for orphadata_en_product3.json which already contains the prefix). For instance, the elastic index for en_product1.json will be orphadata_en_product1.
You can check that those indices are now stored on your elasticsearch instance:
# for local elasticsearch instance
curl localhost:9200/_cat/indices
# for remote elasticsearch instance
curl --user $ELASTIC_USER:$ELASTIC_PASS $ELASTIC_URL:9243/_cat/indicesNOTE
The orphadata_injection.py can also be used with parameters:
(.env) nche@orphanet13:~/projects/API_Orphadata$ python datas/src/orphadata_injection.py -h
usage: orphadata_injection.py [-h] [-path [PATH]] [-match [MATCH]] [-index [INDEX]] [-url [{local,remote}]] [--print]
Bulk inject ORPHADATA products in ES
optional arguments:
-h, --help show this help message and exit
-path [PATH] Path or filename of JSON file(s)
-match [MATCH] String used to filter JSON filenames matching it (only if -path is a directory)
-index [INDEX] Name of the index to create
-url [{local,remote}]
ES URL type: either 'local' or 'remote'
--print Print path of JSON files that will be processedWhile running individual steps described above could be useful for development purpose, the whole process has been automatized for production purpose. First you have to set up the environment variable DATA_ENV to chose your elasticsearch instance:
export DATA_ENV=localfor your local elasticsearch instanceexport DATA_ENV=remotefor the remote elasticsearch instance (requires.varenvto be set up too) Note that instead of declaring the variable in a terminal, you can alternatively define it in the.varenvfile.
The following command will execute sequentially steps 1, 2 and 3 in one shot:
python datas/src/orphadata_update.pyThe API uses an interface based on OpenAPI specification. For this, Flask reads through Connexion a swagger.yaml file containing all specifications of available requests:
# API_Orphadata/api/__init__.py
def create_app(config_name):
options = {'swagger_url': '/'}
app = connexion.App(__name__, specification_dir='./swagger/', options=options)
app.add_api('swagger.yaml', arguments={'title': 'API Orphadata'}, pythonic_params=True)
...If you need to update the specifications (to update/remove/add an operation), it is recommended to do it through the following workflow:
- go here
API_Orphadata/api/swagger/ - make your modifications in the template
_swagger_template.yaml - update
swagger.yamlfrom the template:swagger-cli bundle _swagger_template.yaml -t yaml -o swagger.yaml
Why ? Simply to avoid manually writing the schema response describing the response to each request.
In case you need to add a new operation, first follow the main workfow described above. When adding the specification of the new operation, you don't need to specify the description schema of the response.
After having updated the swagger.yaml from the template, you can automatically build the description of the schema response of this new operation from the response itself:
- open
API_Orphadata/datas/src/lib/json2yaml.py - add in the
REQlist variable the following key-pair values with values related to the new operation:
REQ = [
{
"url": "/rd-new-operation", # relative path of your new operation (as specified in swagger.yaml).
"yaml_outfile": SCHEMAS_PATH / "tag-name-new-operation" / "_descr-name.yaml" # absolute location of the output file that will contain the schema response
},
...
]- save and close the file
- make sure your flask server runs locally (-> http://localhost:5000)
- generate the schema response:
python API_Orphadata/datas/src/lib/json2yaml.py - add in
_swagger_template.yamla pointer to the specification describing the schema response of the new operation:
responses:
"200":
description: Successful operation
content:
application/json:
schema:
$ref: './schemas/tag-name-new-operation/_descr-name.yaml'- update
swagger.yamlfrom the template:swagger-cli bundle _swagger_template.yaml -t yaml -o swagger.yaml
"""
Helper script used to build a swagger compatible schema description
of responses from the defined requests.
The script requires the API running on the local server (see API_ROOT variable)
to make the call to each requests defined in the 'REQ' variable.
The 'REQ' variable contains the list of all requests that will be called.
For each request, the response in JSON format is converted in a
swagger-compatible YAML format that will be used to describe/display
the schema of the response. Please note that not all the content of
the response is converted, only the minimim useful information (e.g.
only the 1st element of lists is converted).
"""