-
Notifications
You must be signed in to change notification settings - Fork 59
Collecting a dataframe do not work on a Spark cluster deployed by the radanalytics Operator #350
Description
Description:
I deployed the radanalytics/spark-operator on OKD 4.6 (using OpenDataHub, you can find the full ODH manifests we are using here: https://github.com/MaastrichtU-IDS/odh-manifests)
From this spark-operator I started a Spark cluster (1 master, 10 workers, no limit)
When I am trying to simply create and collect a simple dataframe on this Spark cluster, creating works, but collecting get stuck
Creating runs in about 3s:
from pyspark.sql import SparkSession
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row
spark_cluster_url = "spark://spark-cluster:7077"
spark = SparkSession.builder.master(spark_cluster_url).getOrCreate()
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])Collecting just get stuck:
df.collect()I made sure to use the exact same Spark version (3.0.1) everywhere (spark cluster images, spark local executable, pyspark version: everything is 3.0.1)
These are the logs displayed by the Master and Workers nodes does not seems to contain any interesting informations:
Workers spam this exact code block every 3 seconds (just changing a bit the IDs). The words are english, they can be read, but the sentences they are producing are not giving any relevant informations on what's happening:
22/04/26 15:51:14 INFO Worker: Executor app-20220426152543-0006/6194 finished with state EXITED message Command exited with code 1 exitStatus 1
22/04/26 15:51:14 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 6194
22/04/26 15:51:14 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20220426152543-0006, execId=6194)
22/04/26 15:51:14 INFO Worker: Asked to launch executor app-20220426152543-0006/6206 for pyspark-shell
22/04/26 15:51:14 INFO SecurityManager: Changing view acls to: 185
22/04/26 15:51:14 INFO SecurityManager: Changing modify acls to: 185
22/04/26 15:51:14 INFO SecurityManager: Changing view acls groups to:
22/04/26 15:51:14 INFO SecurityManager: Changing modify acls groups to:
22/04/26 15:51:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185); groups with view permissions: Set(); users with modify permissions: Set(185); groups with modify permissions: Set()
22/04/26 15:51:14 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-11-openjdk-11.0.8.10-0.el8_2.x86_64/bin/java" "-cp" "/opt/spark/conf/:/opt/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=42663" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@jupyterhub-nb-vemonet:42663" "--executor-id" "6206" "--hostname" "10.131.2.142" "--cores" "1" "--app-id" "app-20220426152543-0006" "--worker-url" "spark://Worker@10.131.2.142:46655"Master gives even less informations:
22/04/26 15:54:53 INFO Master: Removing executor app-20220426152543-0006/7079 because it is EXITED
22/04/26 15:54:53 INFO Master: Launching executor app-20220426152543-0006/7089 on worker worker-20220426144350-10.128.7.31-45619I tried different number of cores/limitations when deploying the Spark cluster, but every single Spark cluster deployed with the Radanalytics operator never manage to .collect() anything. But we always manage to connect to the Spark cluster and somehow create a dataframe on it (at least it seems like, not sure if the dataframe is really created)
Steps to reproduce:
- Get an OKD 4.6 cluster
- Deploy the Spark operator. You can use the latest version from the
alphachannel I guess, as usual with Operators there is not an easy way to share which version we use (not sure why it was designed this way, it really make reproducibility hard to achieve, anyway who cares about reproducibility in computer science?), but you can check the OpenDataHub subscription: https://github.com/MaastrichtU-IDS/odh-manifests/blob/dsri/radanalyticsio/spark/cluster/base/subscription.yaml - Deploy the Spark cluster
cat <<EOF | oc apply -f -
apiVersion: radanalytics.io/v1
kind: SparkCluster
metadata:
name: spark-cluster
spec:
customImage: quay.io/radanalyticsio/openshift-spark:3.0.1-2
worker:
instances: '10'
master:
instances: '1'
EOF- Try to create and collect a basic dataframe on this cluster:
from pyspark.sql import SparkSession
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row
spark_cluster_url = "spark://spark-cluster:7077"
spark = SparkSession.builder.master(spark_cluster_url).getOrCreate()
df = spark.createDataFrame([
Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df.collect()Questions
- Is it normal every
.collect()get stuck when using Spark clusters deployed using the radanalytics operator? - Does someone as a snippet of python code that creates and
.collect()a Dataframe on a radanalytics Spark cluster? (maybe the problem comes from the testing python code, and not the cluster, but we could not find an example provided to test if the Spark cluster works as expected
Anyone has any idea what it could be due to? @elmiko
Is there anyone here who actually used a Spark cluster deployed by the radanalytics operator to run some real pySpark computations? Would be interested to see the code to get some inspiration!