Skip to content

Commit 6a466c6

Browse files
FrancoisZimSilva
andauthored
CloudDB extractor: Add support for Redis, Postgres, MySQL (#35)
* Refactored to support DBAPIv2 for simple db implementations * Additional logging and timing, fixed dbapi bug in bigquery_extractor * Added extractors for Postgres and MySQL. Description of config.yaml: include adding the keys for DB properties to establish the connection base_extractor.py: Added extra parameter "sql_query" to the export_load function to receive a sql query to be sent to the "query_to_hyper_files" function extractor_cli.py : - Added the new extractors to the EXTRACTORS list - Added extra defaults for the DB properties to be read from the config.yml or cli - Implemented the read from SQL file arg and pass it to the export_load function * Refactor base_extractor, move connector args to config.yml, redshift support * Better arg checking, bug fixes in delete method and doc updates * Fixed bug in conditional delete * Moved match_conditions_json to input from file, better logging * Fixed introspection error for server side cursors * Fixed sql identifier quoting / parsing Co-authored-by: Silva <christian.sila.r@gmail.com>
1 parent c22f638 commit 6a466c6

File tree

10 files changed

+1308
-669
lines changed

10 files changed

+1308
-669
lines changed

Community-Supported/clouddb-extractor/README.md

Lines changed: 141 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -11,21 +11,26 @@ Cloud Database Extractor Utility - This sample shows how to extract data from a
1111
This package defines a standard Extractor Interface which is extended by specific implementations
1212
to support specific cloud databases. For most use cases you will probably only ever call the
1313
following methods:
14-
* __load_sample__ - Loads a sample of rows from source_table to Tableau Server
15-
* __export_load__ - Bulk export the contents of source_table and load to a Tableau Server
16-
* __append_to_datasource__ - Appends the result of sql_query to a datasource on Tableau Server
17-
* __update_datasource__ - Updates a datasource on Tableau Server with the changeset from sql_query
18-
* __delete_from_datasource__ - Delete rows matching the changeset from a datasource on Tableau Server. Simple delete by condition when sql_query is None
14+
* __load_sample__ - Used during testing - extract a sample of rows from the source table to a new published datasource
15+
* __export_load__ - Used for initial load - full extract of source table to a new published datasource
16+
* __append_to_datasource__ - Append rows from a query or table to an existing published datasource
17+
* __update_datasource__ - Updates an existing published datasource with the changeset from a query or table
18+
* __delete_from_datasource__ - Delete rows from a published datasource that match a condition and/or that match the primary keys in the changeset from a query or table
1919

2020
For a full list of methods and args see the docstrings in the BaseExtractor class.
2121

2222
## Contents
23-
* __base_extractor.py__ - provides an Abstract Base Class with some utility methods to extract from cloud databases to "live to hyper" Tableau Datasources. Database specific Extractor classes extend this to manage queries, exports and schema discovery via the database vendor supplied client libraries.
23+
* __base_extractor.py__ - provides an Abstract Base Class with some utility methods to extract from cloud databases to "live to hyper" Tableau Datasources. Database specific Extractor classes extend this to manage connections and schema discovery
24+
and may override the generic query processing methods based on DBAPIv2 standards with database specific optimizations.
2425
* __bigquery_extractor.py__ - Google BigQuery implementation of Base Hyper Extractor ABC
25-
* __config.yml__ - Defines site defaults for extractor_cli utility
26+
* __config.yml__ - Defines site defaults for extractor utility
2627
* __extractor_cli.py__ - Simple CLI Wrapper around Extractor Classes
28+
* __mysql_extractor.py__ - MySQL implementation of Base Hyper Extractor ABC
29+
* __postgres_extractor.py__ - PostgreSQL implementation of Base Hyper Extractor ABC
30+
* __README.md__ - This file
31+
* __redshift_extractor.py__ - AWS Redshift implementation of Base Hyper Extractor ABC
2732
* __requirements.txt__ - List of third party python library dependencies
28-
* __restapi_helpers.py__ - Helper functions for REST operations that are not yet available in the standard tableauserverclient libraries (e.g. PATCH for update/upsert). Once these get added to the standard client libraries then this module will be refactored out.
33+
* __tableau_restapi_helpers.py__ - Helper functions for REST operations that are not yet available in the standard tableauserverclient libraries (e.g. PATCH for update/upsert). Once these get added to the standard client libraries then this module will be refactored out.
2934

3035
## CLI Utility
3136
We suggest that you import one of the Extractor implementations and call this directly however we've included a command line utility to illustrate the key functionality:
@@ -91,8 +96,17 @@ For latest instructions refer to: [Tableau Server Client Libraries](https://tabl
9196
pip install tableauserverclient
9297
```
9398

94-
## Install and configure Google Cloud SDK and BigQuery Client Libraries
95-
### Google Cloud SDK Configuration Notes
99+
## Install third party python library dependencies
100+
From the directory where you extracted the hyper api samples execute the following:
101+
```console
102+
cd hyper-api-samples/Community-Supported/clouddb-extractor
103+
pip install -r requirements.txt
104+
```
105+
106+
## Google BigQuery Configuration
107+
The following steps are required if using bigquery_extractor.
108+
109+
### Google Cloud SDK
96110
Install Google Cloud SDK using these instructions: https://cloud.google.com/sdk/docs/install#deb
97111

98112
At the end of the installation process above you will need to run `gcloud init` to configure accont credentials and cloud environment defaults. Ensure that credentials and default compute zone/regions are defined correctly by reviewing the output or using `gcloud auth list` and `gcloud config configurations list`.
@@ -129,6 +143,16 @@ Install BigQuery Python Client Libraries using these instructions: https://githu
129143
```console
130144
pip install google-cloud-bigquery
131145
```
146+
### (Optional) Install additional Google libraries if USE_DBAPI=True
147+
By default we will use bigquery.table.RowIterator to handle queries however
148+
bigquery_extractor can be configured to use the DBAPIv2 libraries and you may find that this gives performance advantages for some datasets as it can take advantage of the Storage Read API.
149+
150+
To use this option you will need to install the following additional libraries:
151+
152+
```console
153+
pip install google-cloud-bigquery-storage
154+
pip install pyarrow
155+
```
132156

133157
### Cloud API Access
134158
In testing we used service account credentials in a GCP Compute Engine VM to invoke all required cloud service APIs. In order for these utilities to work you will need to enable the following API Access Scopes for your VM:
@@ -139,11 +163,112 @@ For more details refer to: https://cloud.google.com/compute/docs/access/create-e
139163

140164
Alternatively, a best practice is to set the cloud-platform access scope on the instance, then securely limit the service account's API access with IAM roles.
141165

142-
## Install third party python library dependencies
143-
From the directory where you extracted the hyper api samples execute the following:
166+
## AWS Redshift Configuration
167+
The following steps are required if using redshift_extractor
168+
169+
### Install and configure AWS Cloud SDK
170+
Install Boto3 using these instructions: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html
171+
144172
```console
145-
cd hyper-api-samples/Community-Supported/clouddb-extractor
146-
pip install -r requirements.txt
173+
pip install boto3
174+
```
175+
176+
Install the AWS CLI using these instructions: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html
177+
178+
Configure access credentials and defaults
179+
180+
```console
181+
aws configure
147182
```
148-
## Configure cloud and tableau environment defaults
149-
Edit `config.yml` to define your environment defaults
183+
### Install Redshift Client libraries
184+
Install redshift connector using these instructions:
185+
https://github.com/aws/amazon-redshift-python-driver
186+
187+
```console
188+
pip install redshift_connector
189+
```
190+
191+
### Authentication and Database configuration
192+
All connection parameters are defined in the redshift.connection section in `config.yml`, for example:
193+
194+
```console
195+
redshift: #Redshift configuration defualts
196+
connection:
197+
host : 'redshift-cluster-1.xxxxxxxxx.eu-west-1.redshift.amazonaws.com'
198+
database : 'dev'
199+
user : 'db_username'
200+
password : 'db_password'
201+
```
202+
203+
If you are using IAM for authentication instead of username/password then you should follow the instructions here:
204+
- [Options for providing IAM credentials](https://docs.aws.amazon.com/redshift/latest/mgmt/options-for-providing-iam-credentials.html)
205+
- [Redshift Connector Python Tutorial](https://github.com/aws/amazon-redshift-python-driver/blob/master/tutorials/001%20-%20Connecting%20to%20Amazon%20Redshift.ipynb)
206+
207+
After you have configured your IAM roles etc. in AWS Management Console you will need to specify additional parameters in the redshift.connection section in `config.yml`, i.e.:
208+
209+
```console
210+
# Connects to Redshift cluster using IAM credentials from default profile defined in ~/.aws/credentials
211+
redshift: #Redshift configuration defualts
212+
connection:
213+
iam : True,
214+
database : 'dev',
215+
db_user : 'awsuser',
216+
password : '',
217+
user : '',
218+
cluster_identifier : 'examplecluster',
219+
profile : 'default'
220+
```
221+
222+
Other options for federated API access using external identity providers are discussed in the following blog: - https://aws.amazon.com/blogs/big-data/federated-api-access-to-amazon-redshift-using-an-amazon-redshift-connector-for-python/
223+
224+
## MySQL Configuration
225+
226+
### Install MySQL Client Libraries:
227+
Install MySQL connector using these instructions: https://dev.mysql.com/doc/connector-python/en/connector-python-installation-binary.html
228+
229+
```console
230+
pip install mysql-connector-python
231+
```
232+
233+
### Authentication and Database configuration
234+
All connection parameters are defined in the mysql.connection section in `config.yml`, for example:
235+
236+
```console
237+
mysql: #Mysql configuration defaults
238+
connection:
239+
host : "mysql.test"
240+
database : "dev"
241+
port : 3306
242+
username : "test"
243+
password : "password"
244+
raise_on_warnings : True
245+
```
246+
Database connection configuration options are documented here: https://dev.mysql.com/doc/connector-python/en/connector-python-connectargs.html
247+
248+
## PostgreSQL Configuration
249+
250+
### Install PostgreSQL Client Libraries:
251+
Install Psycopg using hese instructions: https://www.psycopg.org/docs/install.html
252+
253+
```console
254+
pip install psycopg2-binary
255+
```
256+
257+
### Authentication and Database configuration
258+
All connection parameters are defined in the postgres.connection section in `config.yml`, for example:
259+
260+
```console
261+
postgres: #PostgreSQL configuration defaults
262+
connection:
263+
dbname : "dev"
264+
username : "test"
265+
password : "password"
266+
host : "postgres.test"
267+
port : 5432
268+
```
269+
Database connection configuration options are documented here:
270+
- https://www.psycopg.org/docs/module.html?highlight=connect#psycopg2.connect
271+
- https://www.postgresql.org/docs/current/libpq-connect.html#LIBPQ-PARAMKEYWORDS
272+
273+
# Configuration
274+
All configuration defaults are loaded from the config.yml file.

0 commit comments

Comments
 (0)