Skip to content

Use it with popular services #5

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mycaule opened this issue Sep 22, 2018 · 10 comments
Open

Use it with popular services #5

mycaule opened this issue Sep 22, 2018 · 10 comments

Comments

@mycaule
Copy link

mycaule commented Sep 22, 2018

Hello,

It would be nice if you could you provide instructions on how to use it with AWS (AWS EMR, Flintrock on EC2) ou GCP (Google Cloud Dataproc), and how to use it from IntelliJ as well.

This could be a great CLI alternative to Zeppelin.

@alexarchambault
Copy link
Owner

FYI, this gist lists commands to get ammonite-spark up-and-running with an EMR cluster. I'd like to make it an actual tutorial, but didn't find the time to do that yet.

@alexarchambault
Copy link
Owner

cc @mpacer who was also interested by that (link in my previous comment)

@mycaule
Copy link
Author

mycaule commented Nov 13, 2018

Thank you very much.

The script works on my side.

I found Heather Miller's tutorial on Flintrock + S3 quite cool if one day you write a tutorial from the gist.

IMPORTANT!: Before we can go any further, we need to ensure that a handful of specific dependencies for S3 are available on our Spark cluster. As of the time of writing, this is a workaround, which can be solved by downloading a recent Hadoop 2.7.x distribution and a specific, older version of an AWS JAR (1.7.4) that is typically not available in the EC2 Maven Repository.

@mycaule
Copy link
Author

mycaule commented Nov 13, 2018

I get this error message when trying to read Parquet on S3 :

@ spark.read.parquet("s3a://bucket/path/to/parquet") 
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
  org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
  org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
  org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

This import wasn't enough:

@ import $ivy.`org.apache.hadoop:hadoop-aws:2.8.4`, $ivy.`com.amazonaws:aws-java-sdk-s3:1.11.336`, $ivy.`com.amazonaws:aws-java-sdk-emr:1.11.336`

@alexarchambault
Copy link
Owner

@mycaule Did you add the extra dependencies before creating the Spark session? (or call AmmoniteSparkSession.sync() else)

@mycaule
Copy link
Author

mycaule commented Nov 13, 2018

I added it after, will try this afternoon to add them before or use sync thanks.

@mycaule
Copy link
Author

mycaule commented Nov 13, 2018

After adding imports at the correct place,

@ import $ivy.`com.sun.jersey:jersey-client:1.9.1`, $ivy.`org.apache.spark::spark-sql:2.3.1`, $ivy.`sh.almond::ammonite-spark:0.1.1`
@ import $ivy.`org.apache.hadoop:hadoop-aws:2.8.4`, $ivy.`com.amazonaws:aws-java-sdk-s3:1.11.336`, $ivy.`com.amazonaws:aws-java-sdk-emr:1.11.336`
@ val spark = {
             AmmoniteSparkSession.builder()
               .progressBars()
               .master("yarn")
               .config("spark.executor.instances", "4")
               .config("spark.executor.memory", "2g")
               .getOrCreate()
           }
@ ...

... I get another error now, making progress...

I am using EMR 5.16 and using latest versions available and supported by the platform.

@ spark.read.parquet("s3a://bucket/path/to/parquet") 
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
") 
  org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
  org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
  org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
  org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
  org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
  org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
  org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
  org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
  org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
  org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
  org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
  org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
  org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
  org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)
  org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)

And without S3a

@ spark.read.parquet("s3://bucket/path/to/parquet") 
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
  org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
  org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)

The same command works fine in the default spark-shell available with EMR (without imports).

scala> spark.read.parquet("s3://bucket/path/to/parquet")
18/11/13 17:05:56 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
18/11/13 17:05:56 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
18/11/13 17:05:57 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = [id: int, idannonce: int ... 198 more fields]

@mycaule
Copy link
Author

mycaule commented Nov 16, 2018

To solve the Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found, two related issues that tells us to add these to the classpath

/usr/share/aws/emr/emrfs/lib/*
/usr/share/aws/emr/emrfs/auxlib/*
/usr/share/aws/emr/emr-metrics/lib/*
/usr/share/aws/emr/emrfs/conf

or 

[
  {
    "classification":"spark-defaults",
    "properties": {
      "spark.executor.extraClassPath": "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*",
      "spark.driver.extraClassPath": "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*"
    }
  }
]

spark-notebook/spark-notebook#368
https://forums.aws.amazon.com/thread.jspa?messageID=699917

https://github.com/alexarchambault/ammonite-spark/blob/develop/INTERNALS.md

@kyprifog
Copy link

kyprifog commented Mar 26, 2020

@mycaule I am getting the same
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong
error when trying to access S3 from spark notebook.

I think its a spark/hadoop version issue though.

@kyprifog
Copy link

Downgrading to

import $ivy.`org.apache.hadoop:hadoop-aws:2.7.5`

fixed the issue for me (using spark 2.4.2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants