Setting op VSCode to run Spark Job Definitions locally #38

broeks-m · 2024-07-23T15:10:03Z

broeks-m
Jul 23, 2024

Hello,

I've been working on getting the spark job definition running locally. And although I've come along way, I'm not quite there yet. Thats way I'm reaching out over here, in the hope that someone on the development team can help me. My ultimate goal is to develop spark job definitions locally, be able to run unit tests on the created functions and deploy the locally developed spark job definition to fabric.

I've set-up VSCode based on the following youtube video and documentation:
https://www.youtube.com/watch?v=A9SjAyZ_JSc
https://learn.microsoft.com/en-us/fabric/data-engineering/setup-vs-code-extension
https://learn.microsoft.com/en-us/fabric/data-engineering/author-sjd-with-vs-code

And one error I've got and was able to fix is the error:
[ERROR] 2024-07-23 16:13:06.614 [Thread-3] SparkContext: Error initializing SparkContext.org.apache.spark.SparkException: Invalid Spark URL: spark://HeartbeatReceiver@HKCHG14_IL.home:59920

By adding the following config to the sparkconf.py.
conf.set("spark.driver.host", "localhost")
And I'm wondering, is this a known error and should this setting been added to the sparkconfig.py file?

And now I'm getting a new error that i'm not sure how to fix:

[ForkJoinPool.commonPool-worker-25] PublicClientApplication: [Correlation ID: ae95b07b-05cf-4474-8657-33986c324f1e] Execution of class com.microsoft.aad.msal4j.AcquireTokenByDeviceCodeFlowSupplier failed.
com.microsoft.aad.msal4j.MsalServiceException: AADSTS70020: The provided value for the input parameter 'device_code' is not valid. This device code has expired. Trace ID: 1785d0b5-7034-40f6-b231-1f4f816b6300 Correlation ID: ae95b07b-05cf-4474-8657-33986c324f1e Timestamp: 2024-07-2

Do you know how to fix this error?

Two other observation are:

When I'm running my code locally, the following code leads to an error. Running the same code in fabric works:
print("spark.synapse.pool.name : " + spark_context.getConf().get("spark.synapse.pool.name"))
I'm getting the following error:

Traceback (most recent call last):
  File "c:\dev\fabric_vscode\28157445-4999-4c43-8d01-1d94f21dba1c\SparkJobDefinition\15e8cdfd-3ccd-45c1-8c61-9367de8b672b\ETL\createTablefromCSV.py", line 24, in <module>
    print("spark.synapse.pool.name : " + spark_context.getConf().get("spark.synapse.pool.name"))
TypeError: can only concatenate str (not "NoneType") to str

Another thing that sparked my interest is the warning message:

[WARN ] 2024-07-23 17:02:49.508 [main] Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems
Setting default log level to "WARN"

Is there a way to fix this warning?

Keep up the good work. I'm hoping to develop a structured and modular spark job definition, instead of the multitude of notebooks that were using right now.
This is the file that I'm running: createTablefromCSV.py.txt
And the CSV that I'm referencing in the file: dimension_customer.csv

Kind regards

Martijn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setting op VSCode to run Spark Job Definitions locally #38

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Setting op VSCode to run Spark Job Definitions locally #38

Uh oh!

broeks-m Jul 23, 2024

Replies: 0 comments

broeks-m
Jul 23, 2024