Skip to content

Where should one use Skein ? #227

@apoorvreddy

Description

@apoorvreddy

Hi @jcrist . I'm trying to understand if Skein is the right tool for my needs.

We have a pretty big static cluster on GCP and we launch jobs on this cluster using some login/bastion/edge nodes (not on GCP). Along with distributed jobs like Hive/Spark etc, we also run a lot of single machine jobs.

Currently we don't have a good solution for running these single machine Python jobs on the cluster nodes and we end up running these on our Edge nodes itself (ideally edge nodes should not be used for computation).

Skein really seems like a good solution for this problem.

However, I was wondering that we can also run a Python job on PySpark as well by just adding the following lines at the top of the Python file:

from pyspark.sql import SparkSession
spark = SparkSession.builder.enableHiveSupport().getOrCreate()

and then do a spark-submit with driver-memory and driver-cores specified.

  • I'd really appreciate if you and the community here can perhaps share your thoughts on this ?
  • Do you see any potential problems with the above approach ?
  • What are the use cases where people are using Skein ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions