MGT819 Elective at Yale
Cheap storage and computing power have enabled the gathering and analysis of an unprecedented amount of data on everything from genetic health risk profiles to real-time Wall Street diaper consumption. To take advantage of these massive datasets, new statistical tools and ideas have been developed and this body of knowledge is sometimes referred to as Data Science. The aim of this course is to provide a gentle tour of the business and industry applications of data science. Course concepts will be illustrated using the free open-source statistical language R. R and the general programming language Python are the industry standards for data science
Classification Trees (decision trees) and Linear Regression was used to analyze the dataset. The dataset had 35 numerical and categorical features, so significant pre-processing had to be done before it could be used. For this dataset, the classification tree model had a much better accuracy (around 84%) as compared to the linear regression model. Although the linear regression model could not be used here, finding a selection of 12 variables exhaustively using the Leaps package led to interesting insights into which factors contribute the most to employee attrition.