You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CloudDB extractor: Add support for Redis, Postgres, MySQL (#35)
* Refactored to support DBAPIv2 for simple db implementations
* Additional logging and timing, fixed dbapi bug in bigquery_extractor
* Added extractors for Postgres and MySQL.
Description of config.yaml: include adding the keys for DB properties to establish the connection
base_extractor.py: Added extra parameter "sql_query" to the export_load function to receive a sql query to be sent to the "query_to_hyper_files" function
extractor_cli.py :
- Added the new extractors to the EXTRACTORS list
- Added extra defaults for the DB properties to be read from the config.yml or cli
- Implemented the read from SQL file arg and pass it to the export_load function
* Refactor base_extractor, move connector args to config.yml, redshift support
* Better arg checking, bug fixes in delete method and doc updates
* Fixed bug in conditional delete
* Moved match_conditions_json to input from file, better logging
* Fixed introspection error for server side cursors
* Fixed sql identifier quoting / parsing
Co-authored-by: Silva <christian.sila.r@gmail.com>
@@ -11,21 +11,26 @@ Cloud Database Extractor Utility - This sample shows how to extract data from a
11
11
This package defines a standard Extractor Interface which is extended by specific implementations
12
12
to support specific cloud databases. For most use cases you will probably only ever call the
13
13
following methods:
14
-
*__load_sample__ - Loads a sample of rows from source_table to Tableau Server
15
-
*__export_load__ - Bulk export the contents of source_table and load to a Tableau Server
16
-
*__append_to_datasource__ - Appends the result of sql_query to a datasource on Tableau Server
17
-
*__update_datasource__ - Updates a datasource on Tableau Server with the changeset from sql_query
18
-
*__delete_from_datasource__ - Delete rows matching the changeset from a datasource on Tableau Server. Simple delete by condition when sql_query is None
14
+
*__load_sample__ - Used during testing - extract a sample of rows from the source table to a new published datasource
15
+
*__export_load__ - Used for initial load - full extract of source table to a new published datasource
16
+
*__append_to_datasource__ - Append rows from a query or table to an existing published datasource
17
+
*__update_datasource__ - Updates an existing published datasource with the changeset from a query or table
18
+
*__delete_from_datasource__ - Delete rows from a published datasource that match a condition and/or that match the primary keys in the changeset from a query or table
19
19
20
20
For a full list of methods and args see the docstrings in the BaseExtractor class.
21
21
22
22
## Contents
23
-
*__base_extractor.py__ - provides an Abstract Base Class with some utility methods to extract from cloud databases to "live to hyper" Tableau Datasources. Database specific Extractor classes extend this to manage queries, exports and schema discovery via the database vendor supplied client libraries.
23
+
*__base_extractor.py__ - provides an Abstract Base Class with some utility methods to extract from cloud databases to "live to hyper" Tableau Datasources. Database specific Extractor classes extend this to manage connections and schema discovery
24
+
and may override the generic query processing methods based on DBAPIv2 standards with database specific optimizations.
24
25
*__bigquery_extractor.py__ - Google BigQuery implementation of Base Hyper Extractor ABC
25
-
*__config.yml__ - Defines site defaults for extractor_cli utility
26
+
*__config.yml__ - Defines site defaults for extractor utility
26
27
*__extractor_cli.py__ - Simple CLI Wrapper around Extractor Classes
28
+
*__mysql_extractor.py__ - MySQL implementation of Base Hyper Extractor ABC
29
+
*__postgres_extractor.py__ - PostgreSQL implementation of Base Hyper Extractor ABC
30
+
*__README.md__ - This file
31
+
*__redshift_extractor.py__ - AWS Redshift implementation of Base Hyper Extractor ABC
27
32
*__requirements.txt__ - List of third party python library dependencies
28
-
*__restapi_helpers.py__ - Helper functions for REST operations that are not yet available in the standard tableauserverclient libraries (e.g. PATCH for update/upsert). Once these get added to the standard client libraries then this module will be refactored out.
33
+
*__tableau_restapi_helpers.py__ - Helper functions for REST operations that are not yet available in the standard tableauserverclient libraries (e.g. PATCH for update/upsert). Once these get added to the standard client libraries then this module will be refactored out.
29
34
30
35
## CLI Utility
31
36
We suggest that you import one of the Extractor implementations and call this directly however we've included a command line utility to illustrate the key functionality:
@@ -91,8 +96,17 @@ For latest instructions refer to: [Tableau Server Client Libraries](https://tabl
91
96
pip install tableauserverclient
92
97
```
93
98
94
-
## Install and configure Google Cloud SDK and BigQuery Client Libraries
95
-
### Google Cloud SDK Configuration Notes
99
+
## Install third party python library dependencies
100
+
From the directory where you extracted the hyper api samples execute the following:
101
+
```console
102
+
cd hyper-api-samples/Community-Supported/clouddb-extractor
103
+
pip install -r requirements.txt
104
+
```
105
+
106
+
## Google BigQuery Configuration
107
+
The following steps are required if using bigquery_extractor.
108
+
109
+
### Google Cloud SDK
96
110
Install Google Cloud SDK using these instructions: https://cloud.google.com/sdk/docs/install#deb
97
111
98
112
At the end of the installation process above you will need to run `gcloud init` to configure accont credentials and cloud environment defaults. Ensure that credentials and default compute zone/regions are defined correctly by reviewing the output or using `gcloud auth list` and `gcloud config configurations list`.
@@ -129,6 +143,16 @@ Install BigQuery Python Client Libraries using these instructions: https://githu
129
143
```console
130
144
pip install google-cloud-bigquery
131
145
```
146
+
### (Optional) Install additional Google libraries if USE_DBAPI=True
147
+
By default we will use bigquery.table.RowIterator to handle queries however
148
+
bigquery_extractor can be configured to use the DBAPIv2 libraries and you may find that this gives performance advantages for some datasets as it can take advantage of the Storage Read API.
149
+
150
+
To use this option you will need to install the following additional libraries:
151
+
152
+
```console
153
+
pip install google-cloud-bigquery-storage
154
+
pip install pyarrow
155
+
```
132
156
133
157
### Cloud API Access
134
158
In testing we used service account credentials in a GCP Compute Engine VM to invoke all required cloud service APIs. In order for these utilities to work you will need to enable the following API Access Scopes for your VM:
@@ -139,11 +163,112 @@ For more details refer to: https://cloud.google.com/compute/docs/access/create-e
139
163
140
164
Alternatively, a best practice is to set the cloud-platform access scope on the instance, then securely limit the service account's API access with IAM roles.
141
165
142
-
## Install third party python library dependencies
143
-
From the directory where you extracted the hyper api samples execute the following:
166
+
## AWS Redshift Configuration
167
+
The following steps are required if using redshift_extractor
168
+
169
+
### Install and configure AWS Cloud SDK
170
+
Install Boto3 using these instructions: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/quickstart.html
171
+
144
172
```console
145
-
cd hyper-api-samples/Community-Supported/clouddb-extractor
146
-
pip install -r requirements.txt
173
+
pip install boto3
174
+
```
175
+
176
+
Install the AWS CLI using these instructions: https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html
177
+
178
+
Configure access credentials and defaults
179
+
180
+
```console
181
+
aws configure
147
182
```
148
-
## Configure cloud and tableau environment defaults
149
-
Edit `config.yml` to define your environment defaults
183
+
### Install Redshift Client libraries
184
+
Install redshift connector using these instructions:
After you have configured your IAM roles etc. in AWS Management Console you will need to specify additional parameters in the redshift.connection section in `config.yml`, i.e.:
208
+
209
+
```console
210
+
# Connects to Redshift cluster using IAM credentials from default profile defined in~/.aws/credentials
211
+
redshift: #Redshift configuration defualts
212
+
connection:
213
+
iam : True,
214
+
database : 'dev',
215
+
db_user : 'awsuser',
216
+
password : '',
217
+
user : '',
218
+
cluster_identifier : 'examplecluster',
219
+
profile : 'default'
220
+
```
221
+
222
+
Other options for federated API access using external identity providers are discussed in the following blog: - https://aws.amazon.com/blogs/big-data/federated-api-access-to-amazon-redshift-using-an-amazon-redshift-connector-for-python/
223
+
224
+
## MySQL Configuration
225
+
226
+
### Install MySQL Client Libraries:
227
+
Install MySQL connector using these instructions: https://dev.mysql.com/doc/connector-python/en/connector-python-installation-binary.html
228
+
229
+
```console
230
+
pip install mysql-connector-python
231
+
```
232
+
233
+
### Authentication and Database configuration
234
+
All connection parameters are defined in the mysql.connection section in `config.yml`, for example:
235
+
236
+
```console
237
+
mysql: #Mysql configuration defaults
238
+
connection:
239
+
host : "mysql.test"
240
+
database : "dev"
241
+
port : 3306
242
+
username : "test"
243
+
password : "password"
244
+
raise_on_warnings : True
245
+
```
246
+
Database connection configuration options are documented here: https://dev.mysql.com/doc/connector-python/en/connector-python-connectargs.html
247
+
248
+
## PostgreSQL Configuration
249
+
250
+
### Install PostgreSQL Client Libraries:
251
+
Install Psycopg using hese instructions: https://www.psycopg.org/docs/install.html
252
+
253
+
```console
254
+
pip install psycopg2-binary
255
+
```
256
+
257
+
### Authentication and Database configuration
258
+
All connection parameters are defined in the postgres.connection section in `config.yml`, for example:
259
+
260
+
```console
261
+
postgres: #PostgreSQL configuration defaults
262
+
connection:
263
+
dbname : "dev"
264
+
username : "test"
265
+
password : "password"
266
+
host : "postgres.test"
267
+
port : 5432
268
+
```
269
+
Database connection configuration options are documented here:
0 commit comments