-
Notifications
You must be signed in to change notification settings - Fork 118
Open
Description
http://www.cs.berkeley.edu/~jey/ampcamp6/training/data-exploration-using-spark.html
e.g.,
"If you look closely at the terminal, the console log is pretty chatty and tells you the progress of the tasks."
"If you examine the console log closely, you will see lines like this, indicating some data was added to the cache"
But my console seems to be hiding the detailed log output:
17:46 steve@fisher:~/work/ampcamp6/ampcamp6$ spark/bin/pyspark
Python 2.7.9 (default, Apr 2 2015, 15:33:21)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/steve/work/ampcamp6/ampcamp6/spark/lib/ampcamp-keystoneml.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/steve/work/ampcamp6/ampcamp6/spark/lib/spark-assembly-1.5.1-hadoop2.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.5.1
/_/
Using Python version 2.7.9 (default, Apr 2 2015 15:33:21)
SparkContext available as sc, HiveContext available as sqlContext.
>>> sc
<pyspark.context.SparkContext object at 0x7efd7d5eac10>
>>> pagecounts = sc.textFile('data/pagecounts')
>>> pagecounts
MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:-2
>>> pagecounts.take(10)
[u'20090507-040000 aa Main_Page 7 51309', u'20090507-040000 ab %D0%90%D0%B8%D0%BD%D1%82%D0%B5%D1%80%D0%BD%D0%B5%D1%82 1 34069', u'20090507-040000 ab %D0%98%D1%85%D0%B0%D0%B4%D0%BE%D1%83_%D0%B0%D0%B4%D0%B0%D2%9F%D1%8C%D0%B0 3 65763', u'20090507-040000 af.b Tuisblad 1 36231', u'20090507-040000 af.d Tuisblad 1 58960', u'20090507-040000 af.q Tuisblad 1 44265', u'20090507-040000 af Afrikaans 3 80838', u'20090507-040000 af Australi%C3%AB 1 132433', u'20090507-040000 af Ensiklopedie 2 60584', u'20090507-040000 af Internet 1 48816']
>>> print '\n'.join(pagecounts.take(10))
20090507-040000 aa Main_Page 7 51309
20090507-040000 ab %D0%90%D0%B8%D0%BD%D1%82%D0%B5%D1%80%D0%BD%D0%B5%D1%82 1 34069
20090507-040000 ab %D0%98%D1%85%D0%B0%D0%B4%D0%BE%D1%83_%D0%B0%D0%B4%D0%B0%D2%9F%D1%8C%D0%B0 3 65763
20090507-040000 af.b Tuisblad 1 36231
20090507-040000 af.d Tuisblad 1 58960
20090507-040000 af.q Tuisblad 1 44265
20090507-040000 af Afrikaans 3 80838
20090507-040000 af Australi%C3%AB 1 132433
20090507-040000 af Ensiklopedie 2 60584
20090507-040000 af Internet 1 48816
>>> pagecounts.count()
1398882
>>> enPages = pagecounts.filter(lambda x: x.split(' ')[1] == 'en').cache()
>>> enPages.count()
970545
>>> enPages.count()
970545
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels