Bosch-Production-Line-Kaggle

The main objective of this prpject is to understand the manufacturing process at Bosch and explore their production lines and stations with advanced analytics to refine their productivity.

DATASET

The data set for this paper has been taken from the famous Kaggle competition of Bosch Production Line Performance.The data characterizes the measurement of the various states of parts as they pass through the many stations of the production line. The manufacturing system assigns a unique Id to each part and also records its response (1 = fail, 0= pass) which represents either failing or passing the quality control test. The features of each part are anonymous and are represented in a way that informs the user of the line number, station number and feature number of the respective part.

PREPROCESSING AND FEATURE EXTRACTION

A simple exploration of data shows that the class label has 6879 positive labels compared to 1176868 negative labels giving a ratio of 1:172. This leads to a class imbalanced dataset.The SMOTE and undersampling combination allows us to create a hybrid dataset. Synthetic sampling increases the percentage of the minority class without replacement. We apply an oversampling of 100% to get double the amount of minority classes present. Similarly, we apply an undersampling of 200% to get majority classes as twice the number of minority classes. By applying such a blend of both sampling techniques, the initial bias of the learner turns towards the minority class also.

The numeric data contains 970 variables all of which may not be important to determine the quality factor of the manufactured product. XGBoost is one of the algorithms that can be used to determine importance of features for numerical group of data only. It generates the top features with highest importance categorized by the value of gain. We select top 20 features given below from the entire dataset to carry out further analysis on them.

EXPLORATORY DATA ANALYSIS

This pictorial representation of the data indicates the distribution of failure and acceptance of a component over the timespan. To avoid overlapping of points, we add a little noise called “jitter” in the data to get scattered points and ease in visualization. We can deduce about the time of maximum failure from such plots. For instance, we can observe that for features in Line 1 , the most failure of the component occur in the middle rather than starting or end of the day whereas Line 0 accounts for failure almost equally for the entire period.

The plot demonstrates the variation of features and timestamp on each other along with the points that fail or pass the quality check. The plot marks the failed and accepted parts with different colors. First, we note that the feature values are spread over the entire timestamp indicating the measurements were taken in a continuous manner with little or no breaks in between. Additionally the points for Line 0 and Line 2 are comparatively sparse whereas Line 1 and Line 3 have abundance of data points. This verifies our result of obtaining features and stations of Line 1 and Line 3 as the top 20.

CLASSIFICATION BY NON-SCALABLE PLATFORM

We use different supervised learning classification techniques namely XGBoost and Support Vector Machines for this purpose. The runtime for XGBoost was approximately 720 seconds and SVM was 36000 seconds on a system of 4 cores, 3.2 GHz processor and 16GB RAM.

CLASSIFICATION BY SCALABLE PLATFORM APACHE SPARK

Apache Spark provides a distributed and parallel computing environment for large scale data processing.

Hence we present a more efficient methodology to deal with the volume, variability and velocity of the data which could turn out to be beneficial for advanced data analysis at the manufacturing system at BOSCH, giving them the opportunity to have a proficient quality control.

FILES:

xgboost.R= modeling with R
plots.R= plotting data

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
screenshots		screenshots
README.md		README.md
plots.R		plots.R
xgboost.R		xgboost.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Bosch-Production-Line-Kaggle

DATASET

PREPROCESSING AND FEATURE EXTRACTION

EXPLORATORY DATA ANALYSIS

CLASSIFICATION BY NON-SCALABLE PLATFORM

CLASSIFICATION BY SCALABLE PLATFORM APACHE SPARK

FILES:

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

reshu-b7/Bosch-Production-Line-Kaggle

Folders and files

Latest commit

History

Repository files navigation

Bosch-Production-Line-Kaggle

DATASET

PREPROCESSING AND FEATURE EXTRACTION

EXPLORATORY DATA ANALYSIS

CLASSIFICATION BY NON-SCALABLE PLATFORM

CLASSIFICATION BY SCALABLE PLATFORM APACHE SPARK

FILES:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages