diff --git a/README.md b/README.md
index 55b1787..8a77379 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,11 @@
-# Cloud Data Access Client
+# Open Source Reader for Cloud Data Access
-The Cloud Data Access Client (CDA Client) is a utility to download table data from an Amazon S3 bucket that has been populated by Guidewire's Cloud Data Access (CDA). The utility reads the .parquet files generated by CDA for each table, and can do one of the following:
+The Open Source Reader (OSR) for Cloud Data Access (CDA) is a utility to download table data from an Amazon S3 bucket that has been populated by Guidewire's Cloud Data Access application. The utility reads the .parquet files generated by CDA for each table, and can do one of the following:
- Convert them into human-readable .csv files
- Regenerate them into new `.parquet` files or aggregate them into fewer `.parquet` files
- Load the data into a database: SQL Server, Oracle, PostgreSQL
-The example code can be used to jump-start development of a custom implementation that can store CDA data in other storage formats (e.g., RDBMS Tables, Hadoop/Hive, JSON Files).
+The OSR is also referred to as the CDA Client and CDA Reader. The example code can be used to jump-start development of a custom implementation that can store CDA data in other storage formats (e.g., RDBMS Tables, Hadoop/Hive, JSON Files).
Learn more about CDA [here](https://docs.guidewire.com/cloud/cda/banff/index.html).
- - -
@@ -99,7 +99,7 @@ Guidewire has completed some basic performance testing in AWS EMR. The test CDA
- - -
-# Overview of CDA Client
+# Overview of OSR
Click to expand
When converting CDA output to `.csv` files, the utility provides the schema for each table in a `schema.yaml` file, and can be configured to put these files into a local filesystem location or another Amazon S3 bucket. When writing to a database, the data can be loaded in "raw" format, with each insert/update/delete recorded from the source system database, or it can be merged into tables that more closely resemble the source system database.
@@ -115,21 +115,21 @@ The utility also resumes downloading from the point which it last read up to whe
- - -
-## Build the CDA Client
+## Build the OSR
Click to expand
1. Set up your IDE:
- Use Java/JDK 8
- Open project dir with IntelliJ
-2. Download the CDA Client code.
+2. Download the OSR code.
3. Build by executing this command:
~~~~
./gradlew build
~~~~
4. **For Windows only, download additional utilities**: This utility uses Spark, which in turn uses Hadoop to interact with local filesystems. Hadoop requires an additional Windows library to function correctly with the Windows file system.
- 1. Create a `bin` folder in the folder that contains the CDA Client JAR file.
+ 1. Create a `bin` folder in the folder that contains the OSR JAR file.
2. Download the winutils.exe file for Hadoop 2.7 and place it in `bin` folder
(e.g., [winutils](https://github.com/cdarlint/winutils/tree/master/hadoop-2.7.7/bin)).
3. Download and install this Visual C++ Redistributable package:
@@ -146,7 +146,7 @@ For more info, see:
- - -
-## Run the CDA Client
+## Run the OSR
Click to expand
@@ -164,12 +164,13 @@ export AWS_PROFILE=myProfile
2. Download the sample configuration file from the Git repository folder `/src/test/resources/sample_config.yaml` and save under a new name such as `config.yaml`.
3. Configure the `config.yaml` file.
4. Run the utility by executing the jar from the command line with one of these commands:
- - If you are running the CDA Client for the first time (without a `savepoints.json` file from a previous run) or have a large amount of data in the S3 bucket, both reading and writing can take a substantial amount of time depending on your machine. By default, the Java runtime environment [allocates a maximum of 1/4 of the computer's memory](https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gc-ergonomics.html). It may be necessary to increase the memory available to the application for larger amounts of data. For example, run the client with an increased maximum memory allocation of 8 GB ("8g") with this command:
+
- If you are running the OSR for the first time (without a `savepoints.json` file from a previous run) or have a large amount of data in the S3 bucket, both reading and writing can take a substantial amount of time depending on your machine. By default, the Java runtime environment [allocates a maximum of 1/4 of the computer's memory](https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gc-ergonomics.html). It may be necessary to increase the memory available to the application for larger amounts of data. For example, run the OSR with an increased maximum memory allocation of 8 GB ("8g") with this command:
~~~~
java -Xmx8g -jar cloud-data-access-client-1.0.jar --configPath "config.yaml"
~~~~
- If you are downloading incremental changes, run the utility with this command, where the option --configPath or -c designates the path to the configuration file:
+
~~~~
java -jar cloud-data-access-client-1.0.jar --configPath "config.yaml"
~~~~
@@ -298,7 +299,7 @@ sparkTuning:
- Boolean (defaults to false)
- Should be "true" to save the CSV files into a directory with savepoint timestamp (/outputLocation/path/table/timestamp/*.csv), and "false" to save directly into the table directory (/outputLocation/path/table/*.csv).
- largeTextFields
- A comma delimited list of table.column columns in your target database that can have very large strings and that must allow max length varchar types.
-- If tables in this list does not exist, CDA Client will create the columns in the list with max length varchar based on target database platform.
+- If tables in this list does not exist, OSR will create the columns in the list with max length varchar based on target database platform.
- If table already exists in the target database, you must also manually ALTER TABLE to expand the column length. Length values you add **must** expand sufficiently for code to pick up the changes and process properly. You **must** use the following length values based on the database type:
- For Microsoft SQL Server
-
@@ -315,7 +316,7 @@ ALTER COLUMN [column] VARCHAR2(32767) // requires MAX_STRING_SIZE Oracle paramet
-- The following lists known table.column values that require "largeTextFields" inclusion. Before you run CDA Client, add this list to the configuration file:
+
- The following lists known table.column values that require "largeTextFields" inclusion. Before you run OSR, add this list to the configuration file:
cc_outboundrecord.content, cc_contactorigvalue.origval, pc_diagratingworksheet.diagnosticcapture, cc_note.body, bc_statementbilledworkitem.exception, bc_invoicebilledworkitem.exception, pc_outboundrecord.content, pc_datachange.externalreference, pc_datachange.gosu, bc_workflowworkitem.exception
@@ -372,7 +373,7 @@ ALTER COLUMN [column] VARCHAR2(32767) // requires MAX_STRING_SIZE Oracle paramet
- sparkTuning
- Optional section
- maxResultSize
-- See spark.driver.maxResultSize. The CDA client places no limit on this by default, so you usually don't have to touch it.
+- See spark.driver.maxResultSize. The OSR places no limit on this by default, so you usually don't have to touch it.
- driverMemory
- See spark.driver.memory. Set this to a large value for better performance.
- executorMemory
@@ -388,7 +389,7 @@ ALTER COLUMN [column] VARCHAR2(32767) // requires MAX_STRING_SIZE Oracle paramet
### Savepoints file
Click to expand
-The CDA Client creates a savepoints.json file to keep track of the last batch of table data which the utility has successfully read and written. An example of a savepoints file's contents:
+The OSR creates a savepoints.json file to keep track of the last batch of table data which the utility has successfully read and written. An example of a savepoints file's contents:
~~~~
{
@@ -414,17 +415,17 @@ In the source location, each table has a corresponding timestamp. Each table's t
The utility creates a savepoints file if run initially without any pre-existing savepoints file, during which the utility will consume all available data in the source bucket.
-The CDA client uses the CDA writer's manifest.json file to determine which timestamp directories are eligible for copying. For example, if source bucket data exists, but its timestamp has not been persisted by the CDA writer to the manifest.json file, this data will not be copied by the CDA client, since it is considered uncommitted.
+The OSR uses the CDA writer's manifest.json file to determine which timestamp directories are eligible for copying. For example, if source bucket data exists, but its timestamp has not been persisted by the CDA writer to the manifest.json file, this data will not be copied by the OSR, since it is considered uncommitted.
Each time the utility runs, the utility derives a time range (for each table) of timestampOfLastSavePoint to timestampInManifestJsonFile to determine the files to copy.
-There can be multiple source files (based on the multiple timestamp directories), and we will combine them all into 1 CSV when writing the output file. This will happen since the CDA Writer is writing continuously, which results in a new timestamp directory say every few minutes, but the CDA client may only run once daily. All new timestamp directories (since the last savepoint) will get copied into the 1 CSV file.
+There can be multiple source files (based on the multiple timestamp directories), and we will combine them all into 1 CSV when writing the output file. This will happen since the CDA Writer is writing continuously, which results in a new timestamp directory say every few minutes, but the OSR may only run once daily. All new timestamp directories (since the last savepoint) will get copied into the 1 CSV file.
To re-run the utility to re-copy all data in the source bucket, simply delete the savepoints file. Dont forget to first clean your output location in this case.
Each time a table has been copied (read/written) the savepoints file will be updated. This allows you to stop the utility in the middle while running. In this case, we recommend looking at the in-flight table copy/jobs output directories before re-starting again.
-A note about the savepoints file: The ability to save to "Raw" database tables, and "Merged" database tables at the same time is allowed. However, only one savepoints file is written per instance of the client application. If either of the output methods fail, the savepoints data will not be written for the table that fails.
+A note about the savepoints file: The ability to save to "Raw" database tables, and "Merged" database tables at the same time is allowed. However, only one savepoints file is written per instance of the OSR application. If either of the output methods fail, the savepoints data will not be written for the table that fails.
- - -
@@ -457,7 +458,7 @@ Database permissions for the account running the application _must_ include:
- - -
#### **RDBMS - Column Exclusions**
-This version of the CDA Client excludes certain data types that contain compound attributes due to an inability to properly insert the data into the database.
+This version of the OSR excludes certain data types that contain compound attributes due to an inability to properly insert the data into the database.
The current exclusions include columns with these words in the column name:
- spatial
@@ -466,7 +467,7 @@ The current exclusions include columns with these words in the column name:
- - -
#### **RDBMS - Table changes**
-This version of the client application supports Limited programmatic table definition changes. If a parquet file structure changes - i.e. - columns have been added in the underlying source system for that table - the application will automatically add any new columns to the existing table via ALTER TABLE statements.
+This version of the OSR application supports Limited programmatic table definition changes. If a parquet file structure changes - i.e. - columns have been added in the underlying source system for that table - the application will automatically add any new columns to the existing table via ALTER TABLE statements.
To accomplish this, the ability to run in parallel for fingerprint folders for any given table has been turned off. If there are multiple fingerprint folders in a given load for a given table, only the earliest fingerprint folder will be processed during that run. Additional fingerprint folders will be picked up in subsequent loads.
@@ -477,7 +478,7 @@ The application generates a _cdawarnings.log_ log file in the application root d
- - -
-## Example of a CDA Client run with CSV output
+## Example of an OSR run with CSV output
Click to expand
@@ -493,7 +494,7 @@ sourceLocation:
`
-In the local filesystem, the client jar and config.yaml file exist in the current directory, along with a directory in which to contain the .csv outputs:
+In the local filesystem, the OSR jar and config.yaml file exist in the current directory, along with a directory in which to contain the .csv outputs:
~~~~
cloud-data-access-client-1.0.jar
config.yaml
@@ -521,11 +522,11 @@ java -jar cloud-data-access-client-1.0.jar -c "config.yaml"
~~~~
-After the CDA Client completes writing, the contents of cda_client_output looks like so:
+After the OSR completes writing, the contents of cda_client_output looks like so:

Each table has a corresponding folder. The .csv file in a folder contains the table's data, and the schema.yaml contains information about the columns, namely the name, dataType, and nullable boolean for each column.
-When rerunning the utility, the client will resume from the savepoints written in the savepoints.json file from the previous. The existing .csv file is deleted, and a new .csv file containing new data will be written in its place.
+When rerunning the utility, the OSR will resume from the savepoints written in the savepoints.json file from the previous. The existing .csv file is deleted, and a new .csv file containing new data will be written in its place.
diff --git a/build.gradle b/build.gradle
index dc8c588..630fa9e 100644
--- a/build.gradle
+++ b/build.gradle
@@ -16,6 +16,16 @@ repositories {
}
dependencies {
+ constraints {
+ implementation ("org.apache.parquet:parquet-avro") {
+ version {
+ prefer '1.15.1'
+ strictly '[1.15.1,2.0.0]'
+ }
+ because 'CVE-2025-30065 : Apache Parquet Remote Code Execution Vulnerability'
+ }
+ }
+
implementation "org.scala-lang:scala-library:$scalaVersion.$scalaBuild"
testImplementation "org.scalatest:scalatest_$scalaVersion:$scalaTestVersion"
testImplementation "junit:junit:$jUnitVersion"
diff --git a/src/main/scala/gw/cda/api/CloudDataAccessClient.scala b/src/main/scala/gw/cda/api/CloudDataAccessClient.scala
index 35a6918..4cb213d 100644
--- a/src/main/scala/gw/cda/api/CloudDataAccessClient.scala
+++ b/src/main/scala/gw/cda/api/CloudDataAccessClient.scala
@@ -19,7 +19,6 @@ object CloudDataAccessClient {
// Log the name of the config file, so if there is a problem processing it, you will know the name of the file
log.info(s"Loading config file '$configFilePath'")
val clientConfig: ClientConfig = ClientConfigReader.processConfigFile(configFilePath)
- log.info(s"The config file has been loaded - $clientConfig")
// Moved processConfig() outside of TableReader, parsing it is an unnecessary responsibility of the TableReader
val tableReader = new TableReader(clientConfig)