-
Notifications
You must be signed in to change notification settings - Fork 18
Use it with popular services #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
FYI, this gist lists commands to get ammonite-spark up-and-running with an EMR cluster. I'd like to make it an actual tutorial, but didn't find the time to do that yet. |
cc @mpacer who was also interested by that (link in my previous comment) |
Thank you very much. The script works on my side. I found Heather Miller's tutorial on Flintrock + S3 quite cool if one day you write a tutorial from the gist.
|
I get this error message when trying to read Parquet on S3 : @ spark.read.parquet("s3a://bucket/path/to/parquet")
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580)
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2593)
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91) This import wasn't enough: @ import $ivy.`org.apache.hadoop:hadoop-aws:2.8.4`, $ivy.`com.amazonaws:aws-java-sdk-s3:1.11.336`, $ivy.`com.amazonaws:aws-java-sdk-emr:1.11.336` |
@mycaule Did you add the extra dependencies before creating the Spark session? (or call |
I added it after, will try this afternoon to add them before or use |
After adding imports at the correct place, @ import $ivy.`com.sun.jersey:jersey-client:1.9.1`, $ivy.`org.apache.spark::spark-sql:2.3.1`, $ivy.`sh.almond::ammonite-spark:0.1.1`
@ import $ivy.`org.apache.hadoop:hadoop-aws:2.8.4`, $ivy.`com.amazonaws:aws-java-sdk-s3:1.11.336`, $ivy.`com.amazonaws:aws-java-sdk-emr:1.11.336`
@ val spark = {
AmmoniteSparkSession.builder()
.progressBars()
.master("yarn")
.config("spark.executor.instances", "4")
.config("spark.executor.memory", "2g")
.getOrCreate()
}
@ ... ... I get another error now, making progress... I am using EMR 5.16 and using latest versions available and supported by the platform. @ spark.read.parquet("s3a://bucket/path/to/parquet")
java.lang.IllegalAccessError: tried to access method org.apache.hadoop.metrics2.lib.MutableCounterLong.<init>(Lorg/apache/hadoop/metrics2/MetricsInfo;J)V from class org.apache.hadoop.fs.s3a.S3AInstrumentation
")
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:194)
org.apache.hadoop.fs.s3a.S3AInstrumentation.streamCounter(S3AInstrumentation.java:216)
org.apache.hadoop.fs.s3a.S3AInstrumentation.<init>(S3AInstrumentation.java:139)
org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:174)
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2598)
org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2632)
org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2614)
org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:45)
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:354)
org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)
org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606) And without S3a @ spark.read.parquet("s3://bucket/path/to/parquet")
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2154)
org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2580) The same command works fine in the default spark-shell available with EMR (without imports). scala> spark.read.parquet("s3://bucket/path/to/parquet")
18/11/13 17:05:56 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
18/11/13 17:05:56 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
18/11/13 17:05:57 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
res0: org.apache.spark.sql.DataFrame = [id: int, idannonce: int ... 198 more fields] |
To solve the
spark-notebook/spark-notebook#368 https://github.com/alexarchambault/ammonite-spark/blob/develop/INTERNALS.md |
@mycaule I am getting the same I think its a spark/hadoop version issue though. |
Downgrading to
fixed the issue for me (using spark 2.4.2) |
Uh oh!
There was an error while loading. Please reload this page.
Hello,
It would be nice if you could you provide instructions on how to use it with AWS (AWS EMR, Flintrock on EC2) ou GCP (Google Cloud Dataproc), and how to use it from IntelliJ as well.
This could be a great CLI alternative to Zeppelin.
The text was updated successfully, but these errors were encountered: