Hadoop & Spark

데이터공학 수업(2023-1)

Overview

Hadoop is an Apache top-level open-source project for distributed processing of large-scale data by processing code at where the data resides. This project is a bundle of MapReduce programming under Hadoop ecosystem. It includes Projection, Selection, Join, K-Means and Top-K. Additionally, I wrote the code to enable matrix multiplication in a single MapReduce step, utilizing Composite key comparator, Partitioner, and Grouper. I also implemented matrix multiplication with Spark. In HW folders, there are completed projects of Hadoop programming such as outputting movies by genre in descending order of review ratings.

개발 환경

Programming Language: Java
OS: Linux CLI(Command-line interface)
File System: HDFS API
Dataset: IMDB, Uber dataset, and MNIST digit dataset

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
HW1		HW1
HW2		HW2
HW3		HW3
Python		Python
Spark		Spark
KMeans.java		KMeans.java
KMeansWithCombiner.java		KMeansWithCombiner.java
MatrixAdd.java		MatrixAdd.java
MatrixMul.java		MatrixMul.java
MatrixMulSingleStep.java		MatrixMulSingleStep.java
ProSel.java		ProSel.java
README.md		README.md
ReduceSideJoin.java		ReduceSideJoin.java
ReduceSideJoin2.java		ReduceSideJoin2.java
TopK.java		TopK.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hadoop & Spark

Overview

개발 환경

About

Uh oh!

Releases

Packages

Languages

chaewonkwak/DataEngineering

Folders and files

Latest commit

History

Repository files navigation

Hadoop & Spark

Overview

개발 환경

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages