데이터공학 수업(2023-1)
Hadoop is an Apache top-level open-source project for distributed processing of large-scale data by processing code at where the data resides. This project is a bundle of MapReduce programming under Hadoop ecosystem. It includes Projection, Selection, Join, K-Means and Top-K. Additionally, I wrote the code to enable matrix multiplication in a single MapReduce step, utilizing Composite key comparator, Partitioner, and Grouper. I also implemented matrix multiplication with Spark. In HW folders, there are completed projects of Hadoop programming such as outputting movies by genre in descending order of review ratings.
- Programming Language: Java
- OS: Linux CLI(Command-line interface)
- File System: HDFS API
- Dataset: IMDB, Uber dataset, and MNIST digit dataset