Collecting a dataframe do not work on a Spark cluster deployed by the radanalytics Operator

### Description:

I deployed the radanalytics/spark-operator on OKD 4.6 (using OpenDataHub, you can find the full ODH manifests we are using here: https://github.com/MaastrichtU-IDS/odh-manifests)

From this spark-operator I started a Spark cluster (1 master, 10 workers, no limit)

When I am trying to simply create and collect a simple dataframe on this Spark cluster, creating works, but collecting get stuck

Creating runs in about 3s:

```python
from pyspark.sql import SparkSession
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row

spark_cluster_url = "spark://spark-cluster:7077"
spark = SparkSession.builder.master(spark_cluster_url).getOrCreate()
df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
```

Collecting just get stuck:

```python
df.collect()
```

I made sure to use the exact same Spark version (3.0.1) everywhere (spark cluster images, spark local executable, pyspark version: everything is 3.0.1)


These are the logs displayed by the Master and Workers nodes does not seems to contain any interesting informations:

Workers spam this exact code block every 3 seconds (just changing a bit the IDs). The words are english, they can be read, but the sentences they are producing are not giving any relevant informations on what's happening: 
```bash
22/04/26 15:51:14 INFO Worker: Executor app-20220426152543-0006/6194 finished with state EXITED message Command exited with code 1 exitStatus 1
22/04/26 15:51:14 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 6194
22/04/26 15:51:14 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220426152543-0006, execId=6194)
22/04/26 15:51:14 INFO Worker: Asked to launch executor app-20220426152543-0006/6206 for pyspark-shell
22/04/26 15:51:14 INFO SecurityManager: Changing view acls to: 185
22/04/26 15:51:14 INFO SecurityManager: Changing modify acls to: 185
22/04/26 15:51:14 INFO SecurityManager: Changing view acls groups to:
22/04/26 15:51:14 INFO SecurityManager: Changing modify acls groups to:
22/04/26 15:51:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(185); groups with view permissions: Set(); users  with modify permissions: Set(185); groups with modify permissions: Set()
22/04/26 15:51:14 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=42663" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@jupyterhub-nb-vemonet:42663" "--executor-id" "6206" "--hostname" "10.131.2.142" "--cores" "1" "--app-id" "app-20220426152543-0006" "--worker-url" "spark://Worker@10.131.2.142:46655"
``` 

Master gives even less informations:

```bash
22/04/26 15:54:53 INFO Master: Removing executor app-20220426152543-0006/7079 because it is EXITED
22/04/26 15:54:53 INFO Master: Launching executor app-20220426152543-0006/7089 on worker worker-20220426144350-10.128.7.31-45619
```



I tried different number of cores/limitations when deploying the Spark cluster, but every single Spark cluster deployed with the Radanalytics operator never manage to `.collect()` anything. But we always manage to connect to the Spark cluster and somehow create a dataframe on it (at least it seems like, not sure if the dataframe is really created)


### Steps to reproduce:

  1. Get an OKD 4.6 cluster
  2. Deploy the Spark operator. You can use the latest version from the `alpha` channel I guess, as usual with Operators there is not an easy way to share which version we use (not sure why it was designed this way, it really make reproducibility hard to achieve, anyway who cares about reproducibility in computer science?), but you can check the OpenDataHub subscription: https://github.com/MaastrichtU-IDS/odh-manifests/blob/dsri/radanalyticsio/spark/cluster/base/subscription.yaml
 3. Deploy the Spark cluster
```bash
cat <<EOF | oc apply -f -
apiVersion: radanalytics.io/v1
kind: SparkCluster
metadata:
  name: spark-cluster
spec:
  customImage: quay.io/radanalyticsio/openshift-spark:3.0.1-2
  worker:
    instances: '10'
  master:
    instances: '1'
EOF
```

4. Try to create and collect a basic dataframe on this cluster:
```bash
from pyspark.sql import SparkSession
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row

spark_cluster_url = "spark://spark-cluster:7077"
spark = SparkSession.builder.master(spark_cluster_url).getOrCreate()
df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.collect()
```



### Questions
* Is it normal every `.collect()` get stuck when using Spark clusters deployed using the radanalytics operator?
* Does someone as a snippet of python code that creates and `.collect()` a Dataframe on a radanalytics Spark cluster? (maybe the problem comes from the testing python code, and not the cluster, but we could not find an example provided to test if the Spark cluster works as expected

Anyone has any idea what it could be due to? @elmiko 
Is there anyone here who actually used a Spark cluster deployed by the radanalytics operator to run some real pySpark computations? Would be interested to see the code to get some inspiration!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collecting a dataframe do not work on a Spark cluster deployed by the radanalytics Operator #350

Description:

Steps to reproduce:

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Collecting a dataframe do not work on a Spark cluster deployed by the radanalytics Operator #350

Description

Description:

Steps to reproduce:

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions