Skip to content

Spark35 upgrade#222

Open
BalamuruganDE wants to merge 16 commits intohomeaway:masterfrom
BalamuruganDE:spark35_upgrade
Open

Spark35 upgrade#222
BalamuruganDE wants to merge 16 commits intohomeaway:masterfrom
BalamuruganDE:spark35_upgrade

Conversation

@BalamuruganDE
Copy link
Contributor

DataPull PR

Upgrade DataPull core to Spark 3.5.0, Scala 2.12, Java 11

Dependency upgrades (pom.xml):

  • Spark 2.4.6 -> 3.5.0, Scala 2.11 -> 2.12, Hadoop 2.10.1 -> 3.2.1
  • MongoDB Spark Connector 10.4.0 (new format API)
  • Cassandra Spark Connector 3.5.0 (driver 4.x)
  • Iceberg 1.5.0 (new platform support)
  • Elasticsearch 7.17.0 with REST clients
  • mssql-jdbc 11.2.3.jre11, ojdbc8 21.9.0.0, terajdbc4 17.20.00.12
  • ABRis 6.4.0, Snowflake 2.10.0, PostgreSQL 42.6.0
  • Guava shade plugin for Spark 3.5 classloader compatibility
  • Log4j 2.17.1 (CVE-2021-44228 fix retained)
  • Added expediahotelloader with Spark/Scala exclusions

MongoDB modernization (DataFrameFromTo.scala):

  • Replaced MongoSpark/ReadConfig/WriteConfig with format("mongodb") API
  • Replaced MongoClient/MongoClientURI with MongoClients.create()
  • Updated mongodbToDataFrame, dataFrameToMongodb, mongoRunCommand

New Iceberg support (DataFrameFromTo.scala, Migration.scala):

  • Added dataFrameToIceberg with MERGE INTO SQL support
  • Added icebergToDataFrame for SQL-based reads
  • Added iceberg as source/destination platform in Migration

Spark 3.5 compatibility (DataPull.scala):

  • Cassandra UUIDs -> Uuids (driver 4.x)
  • Binary type from org.bson.types.Binary
  • DataPull object extends Serializable
  • Hive caseSensitiveInferenceMode: INFER_ONLY -> NEVER_INFER
  • Hive metastore.version config commented out

IMDSv2 and MSSQL fixes (Helper.scala):

  • Added imdsv2Token() for EC2 metadata token-based access
  • Updated GetEC2pkcs7() and GetEC2Role() with IMDSv2 headers
  • MSSQL JDBC URL: added encrypt and trustServerCertificate params
  • Commented out URI logging to prevent credential exposure

All homeaway bug fixes preserved (v0.1.83-0.1.90):

  • ConcurrentHashMap for thread-safe stepPipelineMap
  • Subnet NULL/invalid validation with default pool fallback
  • Credentials display fix, duplicate tags fix
  • Default 'datapullemr' application tag
  • url parameter in RDBMS methods, KMS encryption support
  • setExternalSparkConf, ReplaceInlineExpressions

Changed

core/pom.xml
core/src/main/scala/core/DataFrameFromTo.scala
core/src/main/scala/core/DataPull.scala
core/src/main/scala/core/Migration.scala
core/src/main/scala/core/Controller.scala
core/src/main/scala/helper/Helper.scala
core/src/main/resources/Samples/Input_Json_Specification.json

PR Checklist Forms

  • CHANGELOG.md updated
  • Reviewer assigned
  • PR assigned (presumably to submitter)
  • Labels added (enhancement, bug, documentation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant