Use it with popular services #5

mycaule · 2018-09-22T19:12:13Z

Hello,

It would be nice if you could you provide instructions on how to use it with AWS (AWS EMR, Flintrock on EC2) ou GCP (Google Cloud Dataproc), and how to use it from IntelliJ as well.

This could be a great CLI alternative to Zeppelin.

alexarchambault · 2018-11-13T09:12:26Z

FYI, this gist lists commands to get ammonite-spark up-and-running with an EMR cluster. I'd like to make it an actual tutorial, but didn't find the time to do that yet.

alexarchambault · 2018-11-13T09:13:18Z

cc @mpacer who was also interested by that (link in my previous comment)

mycaule · 2018-11-13T09:17:36Z

Thank you very much.

The script works on my side.

I found Heather Miller's tutorial on Flintrock + S3 quite cool if one day you write a tutorial from the gist.

IMPORTANT!: Before we can go any further, we need to ensure that a handful of specific dependencies for S3 are available on our Spark cluster. As of the time of writing, this is a workaround, which can be solved by downloading a recent Hadoop 2.7.x distribution and a specific, older version of an AWS JAR (1.7.4) that is typically not available in the EC2 Maven Repository.

The hadoop-aws-2.7.2.jar and aws-java-sdk-1.7.4.jar stuff she mentions can be quite tricky, it would be nice to have a sample code with a read on Amazon S3 as well : the classpath problem looks similar to this PR WIP: Remove need for Ivy/Maven Spark JARs; enables providing JARs directly #7
It would be nice if you provide examples on how to read .amm files too

mycaule · 2018-11-13T11:34:11Z

I get this error message when trying to read Parquet on S3 :

@ spark.read.parquet("s3a://bucket/path/to/parquet") 
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
  org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
  org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
  org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)

This import wasn't enough:

@ import $ivy.`org.apache.hadoop:hadoop-aws:2.8.4`, $ivy.`com.amazonaws:aws-java-sdk-s3:1.11.336`, $ivy.`com.amazonaws:aws-java-sdk-emr:1.11.336`

alexarchambault · 2018-11-13T12:21:06Z

@mycaule Did you add the extra dependencies before creating the Spark session? (or call AmmoniteSparkSession.sync() else)

mycaule · 2018-11-13T13:26:14Z

I added it after, will try this afternoon to add them before or use sync thanks.

mycaule · 2018-11-13T16:58:48Z

After adding imports at the correct place,

@ import $ivy.`com.sun.jersey:jersey-client:1.9.1`, $ivy.`org.apache.spark::spark-sql:2.3.1`, $ivy.`sh.almond::ammonite-spark:0.1.1`
@ import $ivy.`org.apache.hadoop:hadoop-aws:2.8.4`, $ivy.`com.amazonaws:aws-java-sdk-s3:1.11.336`, $ivy.`com.amazonaws:aws-java-sdk-emr:1.11.336`
@ val spark = {
             AmmoniteSparkSession.builder()
               .progressBars()
               .master("yarn")
               .config("spark.executor.instances", "4")
               .config("spark.executor.memory", "2g")
               .getOrCreate()
           }
@ ...

... I get another error now, making progress...

I am using EMR 5.16 and using latest versions available and supported by the platform.

@ spark.read.parquet("s3a://bucket/path/to/parquet") 
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
") 
  org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
  org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
  org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
  org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
  org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
  org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
  org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
  org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
  org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
  org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
  org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
  org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
  org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
  org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)
  org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)

And without S3a

@ spark.read.parquet("s3://bucket/path/to/parquet") 
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
  org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
  org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)

The same command works fine in the default spark-shell available with EMR (without imports).

scala> spark.read.parquet("s3://bucket/path/to/parquet")
18/11/13 17:05:56 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
18/11/13 17:05:56 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
18/11/13 17:05:57 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = [id: int, idannonce: int ... 198 more fields]

mycaule · 2018-11-16T07:19:10Z

To solve the Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found, two related issues that tells us to add these to the classpath

/usr/share/aws/emr/emrfs/lib/*
/usr/share/aws/emr/emrfs/auxlib/*
/usr/share/aws/emr/emr-metrics/lib/*
/usr/share/aws/emr/emrfs/conf

or 

[
  {
    "classification":"spark-defaults",
    "properties": {
      "spark.executor.extraClassPath": "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*",
      "spark.driver.extraClassPath": "/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*"
    }
  }
]

spark-notebook/spark-notebook#368
https://forums.aws.amazon.com/thread.jspa?messageID=699917

https://github.com/alexarchambault/ammonite-spark/blob/develop/INTERNALS.md

kyprifog · 2020-03-26T14:20:31Z

@mycaule I am getting the same
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong
error when trying to access S3 from spark notebook.

I think its a spark/hadoop version issue though.

kyprifog · 2020-03-26T14:38:56Z

Downgrading to

import $ivy.`org.apache.hadoop:hadoop-aws:2.7.5`

fixed the issue for me (using spark 2.4.2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use it with popular services #5

Use it with popular services #5

mycaule commented Sep 22, 2018 •

edited

Loading

alexarchambault commented Nov 13, 2018

Uh oh!

alexarchambault commented Nov 13, 2018

Uh oh!

mycaule commented Nov 13, 2018 •

edited

Loading

Uh oh!

mycaule commented Nov 13, 2018 •

edited

Loading

Uh oh!

alexarchambault commented Nov 13, 2018

Uh oh!

mycaule commented Nov 13, 2018

Uh oh!

mycaule commented Nov 13, 2018 •

edited

Loading

Uh oh!

mycaule commented Nov 16, 2018 •

edited

Loading

Uh oh!

kyprifog commented Mar 26, 2020 •

edited

Loading

Uh oh!

kyprifog commented Mar 26, 2020

Uh oh!

Use it with popular services #5

Use it with popular services #5

Comments

mycaule commented Sep 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

alexarchambault commented Nov 13, 2018

Uh oh!

alexarchambault commented Nov 13, 2018

Uh oh!

mycaule commented Nov 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mycaule commented Nov 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexarchambault commented Nov 13, 2018

Uh oh!

mycaule commented Nov 13, 2018

Uh oh!

mycaule commented Nov 13, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mycaule commented Nov 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kyprifog commented Mar 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kyprifog commented Mar 26, 2020

Uh oh!

mycaule commented Sep 22, 2018 •

edited

Loading

mycaule commented Nov 13, 2018 •

edited

Loading

mycaule commented Nov 13, 2018 •

edited

Loading

mycaule commented Nov 13, 2018 •

edited

Loading

mycaule commented Nov 16, 2018 •

edited

Loading

kyprifog commented Mar 26, 2020 •

edited

Loading