Instructors: Professor Juliana Freire, Dr. Erin Carson, Dr. Nick Knight
Text: Mining of Massive Data Sets by Anand Rajaraman, Jure Leskovec and Jeff Ullman.
The objective of this course is to study the foundations of data storage and processing at scale.
Concepts and tools used in teh course include:
- Relational algebra
- SQL
- Distributed File Systems and MapReduce
- Apache Hadoop and Apache Spark
- Amazon Web Services
- Algorithms for: Finding similar items, frequent itemsets
You can find an overview and details on the course website: https://vgc.poly.edu/~juliana/courses/BigData2016/
- Assignment 1: Querying NYC Taxi data using SQL
- Assignment 2: NYC Taxi data processing using Map/Reduce (Hadoop)
Group: Maria Leonor Zamora Maass mzm239@nyu.edu, Luisa Eugenia Quispe Ortiz lqo202@nyu.edu
The objective of the term project was to analyze a massive dataset using the concepts learned in the course. We decided to analyze taxi data, in particular we focused on the analysis of short trips (those that could have been made by foot or bike).
The final report for this project can be found here.