-
Notifications
You must be signed in to change notification settings - Fork 39
Description
Hi @jcrist . I'm trying to understand if Skein is the right tool for my needs.
We have a pretty big static cluster on GCP and we launch jobs on this cluster using some login/bastion/edge nodes (not on GCP). Along with distributed jobs like Hive/Spark etc, we also run a lot of single machine jobs.
Currently we don't have a good solution for running these single machine Python jobs on the cluster nodes and we end up running these on our Edge nodes itself (ideally edge nodes should not be used for computation).
Skein really seems like a good solution for this problem.
However, I was wondering that we can also run a Python job on PySpark as well by just adding the following lines at the top of the Python file:
from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()and then do a spark-submit with driver-memory and driver-cores specified.
- I'd really appreciate if you and the community here can perhaps share your thoughts on this ?
- Do you see any potential problems with the above approach ?
- What are the use cases where people are using Skein ?