This project is a reference implementation for how to consume files from S3 that have been produced by Guidewire's Cloud Data Access (CDA).
The example code can be used to jump start development of a custom implementation that can store CDA data to other storage formats (e.g., RDBMS Tables, Hadoop/Hive, JSON Files).
The Cloud Data Access Client provides a utility to download table data from an Amazon S3 bucket that has been populated by CDA. The utility reads the .parquet files generated by Cloud Data Access for each table and can either convert them into human readable .csv files, transfer them as .parquet files, or load the data into a common database platform: SQL Server, Oracle, PostgreSQL.
When converting to .csv files, the utility provides the schema for each table in a schema.yaml file, and can be configured to output the resultant files into a designated local filesystem location or another Amazon S3 bucket. When writing to a database, the data can be loaded in its raw format, with each activity recorded from the source system database, or it can be merged into tables that more closely resemble the source system database.
The utility also resumes downloading from the point which it last read up to when rerun, so that new data in the source bucket can be read efficiently and incrementally over multiple runs.
- Use Java/JDK 8
- Open project dir with IntelliJ
Run ./gradlew build
Run the utility by executing the jar from the command line.
java -jar cloud-data-access-client-1.0.jar --configPath "config.yaml"
Where the option --configPath or -c designates the path to the configuration file. See below for an example usage of a configuration file. The utility also relies on AWS credentials that need to be defined through environment variables before running the utility. See below for details about setting AWS credentials.
If you are running on a Windows machine, there is an additional utility that must be installed in order for the client to work. See the section below.
Note that for large amounts of data, especially running without savepoints, both reading and writing can take a substantial amount of time depending on your machine. It may be necessary to increase the memory available to the application for larger amounts of data. For example,
java -Xmx8g -jar cloud-data-access-client-1.0.jar --configPath "config.yaml"
will run the client with an increased maximum memory allocation of 8 GB ("8g"). By default, the Java runtime environment [allocates a maximum of 1/4 of the computer's memory|https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gc-ergonomics.html].
- NOTE: if tests fail due to a Spark error (something like a
BindException), see this link - For AWS credentials issues with Spark/Hadoop, see this link
- For using Scala test assertions, see this link
- Written in Scala
- Main program starts in gw.cda.api.CloudDataAccessClient
- Config, SavePoints, Manifest, TableReader, OutputWriter
Configuration parameters are specified through a .yaml file (typically named config.yaml). Config parameters are structured in the file as such:
sourceLocation:
bucketName: ...
manifestKey: ...
outputLocation:
path: ...
savepointsLocation:
path: ...
outputSettings:
tablesToInclude: ...
saveIntoJdbcRaw: ...
saveIntoJdbcMerged: ...
exportTarget: ...
fileFormat: ...
includeColumnNames: ...
saveAsSingleFile: ...
saveIntoTimestampDirectory: ...
jdbcConnectionRaw:
jdbcUsername: ...
jdbcPassword: ...
jdbcUrl: ...
jdbcSchema: ...
jdbcSaveMode: ...
jdbcConnectionMerged:
jdbcUsername: ...
jdbcPassword: ...
jdbcUrl: ...
jdbcSchema: ...
jdbcApplyLastestUpdatesOnly: ...
performanceTuning:
numberOfJobsInParallelMaxCount: ...
numberOfThreadsPerJob: ...
sparkTuning:
maxResultSize: ...
driverMemory: ...
executorMemory: ...
sourceLocation:
-
"bucketName" is the name of the source S3 bucket to read from
-
and the "manifestKey" is the path to the manifest.json file in the source bucket, from which the utility reads information about each table. Under "outputLocation":
-
"path" is the local file system directory to which the csv files will be written. The directory must exist in the local filesystem before the utility can write to it. To write the CSV's to a S3 bucket, simply replace the path with a valid s3 url (e.g. s3://example-bucket/cda_client_output). The utility uses the same AWS credentials for reading from the source bucket and for writing to the destination bucket.
savepointsLocation:
- "path" is the local filesystem directory of where the savepoints.json file exists. See below for more information see the Savepoint section below.
outputSettings:
- "tablesToInclude" (should be blank by default) is a comma delimited list of tables to include. Leave blank or omit to include all tables in the output. This is for testing or troubleshooting purposes only. In a Production environment there should be no values here. It is for loading one or more tables to test connectivity, reviewing individual tables in a testing scenario.
- "saveIntoJdbcRaw" boolean (defaults to false) - should be "true" to write data to a database in Raw format (all activities and operations included in the output).
- "saveIntoJdbcMerged" boolean (defaults to false) - should be "true" to write data to a database in Merged format (more closely representing the source system data).
- "exportTarget" (defaults to file) - available export targets are file and jdbc.
- "fileFormat" (defaults to csv) - available output formats are .csv and .parquet.
- "includeColumnNames" boolean (defaults to false) - should be "true" to include a row of column names at the top of the csv file for each table, and "false" for no header row.
- "saveAsSingleFile" boolean (defaults to false) - should be "true" for writing out a single file (.csv or .parquet) per table, and "false" to have multiple/smaller files be written based on SPARK partitioning.
- "saveIntoTimestampDirectory" boolean (defaults to false) - should be "true" to save the CSV files into a directory with savepoint timestamp (/outputLocation/path/table/timestamp/.csv), and "false" to save directly into the table directory (/outputLocation/path/table/.csv).
jdbcConnectionRaw: (optional section)
- "jdbcUsername" is the user name used to connect to the database. can be a placeholder value if using windows authentication for database connectivity.
- "jdbcPassword" is the password used to connect to the database. can be a placeholder value if using windows authentication for database connectivity.
- "jdbcUrl" is the connection string for database connectivity.
- "jdbcSchema" is the database schema owner designation for tables written to the database. i.e. - 'dbo' is the default for SQL Server, 'public' is the default for PostgreSQL.
- "jdbcSaveMode" values overwrite or append (defaults to append) - when saveIntoJdbcMerged is true, savemode is not relavant.
jdbcConnectionMerged: (optional section)
- "jdbcUsername" is the user name used to connect to the database. can be a placeholder value if using windows authentication for database connectivity.
- "jdbcPassword" is the password used to connect to the database. can be a placeholder value if using windows authentication for database connectivity.
- "jdbcUrl" is the connection string for database connectivity.
- "jdbcSchema" is the database schema owner designation for tables written to the database. i.e. - 'dbo' is the default for SQL Server, 'public' is the default for PostgreSQL.
- "jdbcApplyLatestUpdatesOnly" boolean (defaults to false) - should be "true" for applying the latest version of a record for a given table. "false" will process all the activities for a record in the order they occurred. for CDC processing, the most recent entry for a given record is the current state of that record. this option allows the application of only that most recent activity and version of the record.
performanceTuning: (optional section)
- "numberOfJobsInParallelMaxCount" integer - defaults to the number of processors on your machine. Depending on your machine/network, you can go to about 2 times that to get more concurrency.
- "numberOfThreadsPerJob" integer - defaults to 10. This allows for parallel parquet file downloads while processing a given table.
sparkTuning: (optional section)
- "maxResultSize": See [spark.driver.maxResultSize|https://spark.apache.org/docs/latest/configuration.html#application-properties]. The CDA client places no limit on this by default, so you usually don't have to touch it.
- "driverMemory": See [spark.driver.memory|https://spark.apache.org/docs/latest/configuration.html#application-properties]. Set this to a large value for better performance.
- "executorMemory": See [spark.executor.memory|https://spark.apache.org/docs/latest/configuration.html#application-properties]. \
Warning: Boolean parameters default to "false" if they are not set.
The utility requires AWS credentials to access the source S3 bucket that contains the table data (and optionally to write to a destination S3 bucket). These must be exported in the command line before running the program. For example:
export AWS_SECRET_ACCESS_KEY=<secret key> AWS_ACCESS_KEY_ID=<access key> AWS_REGION=<region name>
Where <secret key>, <access key>, and <region name> should be replaced by values from a user's credentials file located in ~/.aws/credentials.
If you are using awscli with a credentials file and profiles, you will need to use the environment var AWS_PROFILE, instead of setting the keys directly.
- For example:
export AWS_PROFILE=myProfile
- More information can be found here for setting up AWS Credentials
The savepoints file is used by the utility to keep track of the last batch of table data which the utility has successfully read and written. An example of a savepoints file's contents:
{
"taccounttransaction": "1562112543749",
"taccount": "1562112543749",
"note": "1562112543749",
"taccountlineitem": "1562112543749",
"taccttxnhistory": "1562112543749",
"history": "1562112543749"
}
In the source location, each table has a corresponding timestamp. Each table's timestamp corresponds to the timestamped subfolder in the source destination bucket for when it was written by CDA. For example:
/history
/1562111987488 (timestamped subfolder)
x.parquet
y.parquet
/1562112022178
z.parquet
...
The utility creates a savepoints file if run initially without any pre-existing savepoints file, during which the utility will consume all available data in the source bucket.
The CDA client uses the CDA writer's manifest.json file to determine which timestamp directories are eligible for copying. For example, if source bucket data exists, but its timestamp has not been persisted by the CDA writer to the manifest.json file, this data will not be copied by the CDA client, since it is considered uncommitted.
Each time the utility runs, the utility derives a time range (for each table) of timestampOfLastSavePoint to timestampInManifestJsonFile to determine the files to copy.
There can be multiple source files (based on the multiple timestamp directories), and we will combine them all into 1 CSV when writing the output file. This will happen since the CDA Writer is writing continuously, which results in a new timestamp directory say every few minutes, but the CDA client may only run once daily. All new timestamp directories (since the last savepoint) will get copied into the 1 CSV file.
To re-run the utility to re-copy all data in the source bucket, simply delete the savepoints file. Dont forget to first clean your output location in this case.
Each time a table has been copied (read/written) the savepoints file will be updated. This allows you to stop the utility in the middle while running. In this case, we recommend looking at the in-flight table copy/jobs output directories before re-starting again.
A note about the savepoints file: The ability to save to "Raw" database tables, and "Merged" database tables at the same time is allowed. However, only one savepoints file is written per instance of the client application. If either of the output methods fail, the savepoints data will not be written for the table that fails.
The utility writes CSV files to the configured location directory.
The names of these files are randomly generated by the SPARK partition, and look like "part-00000-216b8c03-8c73-47ca-bcb6-0d38aef6be37-c000.csv". The name does not reflect any chronological importance. So running the utility over the same data will result in different filenames. This is why we recommend using the setting "saveIntoTimestampDirectory" to help differentiate data files.
If you set saveAsSingleFileCSV=false, you will get multiple files, they will all be prefixed with "part-00000", "part-00001", "part-00002", etc.
Output to standard RDBMS platforms allows for two options: "Raw" and "Merged".
"Raw" output maintains all activities and transactions as seen in the CSV files output. Each Insert, Update, and Delete activity recorded are included in the "Raw" database output, along with the gwcbi___* columns indicating the sequence and operations.
"Merged" output merges the activities of a given record down to a view of the record as it looked at a point in time in the source system database. Instead of inserting each activity, only the latest version of the record exists, making it appear more like the source system database table it represents.
Database permissions for the account running the application must include:
- CREATE TABLE
- ALTER TABLE
- INSERT
- UPDATE
- DELETE
This version of teh client application supports Limited programmatic table definition changes. If a parquet file structure changes - i.e. - columns have been added in the underlying source system for that table - the application will automatically add any new columns to the existing table via ALTER TABLE statements.
To accomplish this, the ability to run in parallel for fingerprint folders for any given table has been turned off. If there are multiple fingerprint folders in a given load for a given table, only the earliest fingerprint folder will be processed during that run. Additional fingerprint folders will be picked up in subsequent loads.
The application generates a cdawarnings.log log file in the application root directory when:
- Errors are encountered
- A table has multiple fingerprint folders to load, requiring multiple job runs to process them
- ALTER TABLE statements have been executed - the Success or Failure of the execution of those statements and the statement that was generated and executed due to table schema changes will be listed
This utility uses Spark, which in turn uses Hadoop to interact with local filesystems. Hadoop requires an additional Windows library to function correctly with the Windows file system.
- Download the winutils.exe file for Hadoop 2.7 (e.g., winutils). Place it in a folder named "bin"
- Download and install this Visual C++ Redistributable package: Visual C++ 2010 Redistributable Package (x86) if you're on a 32-bit machine, or Visual C++ 2010 Redistributable Package (x64) if you're on a 64-bit machine.
- Before running the utility, you must set an additional environment variable named HADOOP_HOME. This variable must be set to the system path to the folder which contains the "bin" folder that contains the winutils.exe executable. For example, the winutils.exe file is located at C:\Users\myusername\Documents\cloud-data-access-client\bin\winutils.exe. Therefore the HADOOP_HOME variable must be set using
set HADOOP_HOME=C:\Users\myusername\Documents\cloud-data-access-client-demo
These fixes are documented online here: (https://cwiki.apache.org/confluence/display/HADOOP2/WindowsProblems), (https://answers.microsoft.com/en-us/insider/forum/insider_wintp-insider_repair/how-do-i-fix-this-error-msvcp100dll-is-missing/c167d686-044e-44ab-8e8f-968fac9525c5?auth=1)
The source bucket is called cda-client-test. Its contents under the directory CDA/SQL include the folder containing the files for each table, as well as the manifest.json file.
Thus the source bucket as well as the manifest.json location should be configured in the config.yaml file as such:
sourceLocation: bucketName: cda-client-test manifestKey: CDA/SQL/manifest.json
In the local filesystem, the client jar and config.yaml file in the current directory, along with a directory in which to contain the .csv outputs:
cloud-data-access-client-1.0.jar
config.yaml
cda_client_output/
As a result the output and savepoints are configured as such. I'm designating the savepoints file to also be stored in the cda_client_output directory. Here we are also specifying not to include column names in the outputted csv:
outputLocation:
path: cda_client_output
savepointsLocation:
path: cda_client_output
outputSettings:
includeColumnNames: false
saveAsSingleFileCSV: true
saveIntoTimestampDirectory: false
After exporting my credentials, I run the jar from the current directory with the command
java -jar cloud-data-access-client-1.0.jar -c "config.yaml"
After writing is completed, the contents of cda_client_output looks like so:
Where in each folder corresponding to a table, the .csv file contains the table data, and the schema.yaml contains information about the columns, namely the name, dataType, and nullable boolean for each column.
When rerunning the utility, the client will resume from the savepoints written in the savepoints.json file. The existing .csv file is deleted, and a new .csv file will be written in its place containing the new data.

