Skip to content

chaewonkwak/DataEngineering

Repository files navigation

Hadoop & Spark

데이터공학 수업(2023-1)

Overview

Hadoop is an Apache top-level open-source project for distributed processing of large-scale data by processing code at where the data resides. This project is a bundle of MapReduce programming under Hadoop ecosystem. It includes Projection, Selection, Join, K-Means and Top-K. Additionally, I wrote the code to enable matrix multiplication in a single MapReduce step, utilizing Composite key comparator, Partitioner, and Grouper. I also implemented matrix multiplication with Spark. In HW folders, there are completed projects of Hadoop programming such as outputting movies by genre in descending order of review ratings.

개발 환경

  • Programming Language: Java
  • OS: Linux CLI(Command-line interface)
  • File System: HDFS API
  • Dataset: IMDB, Uber dataset, and MNIST digit dataset

About

Hadoop & Spark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published