python/project1.py at main · HridyeshKumar/python · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
#Project on Regression and Random Forest Regression
#Regression Problems in Machine Learning
'''Machine Learning is a branch of Artificial Intelligence that enables computer programs to automatically learn and improve from experience.
Machine Learning Algorithms learn from datasets and then based on the patterns identified from the datasets make predictions on unseen data.
ML algorithms can be broadly categorized into two types:
1. Supervised Learning
2. Unsupervised Learning
Supervised ML algorithms are those algorithms where the input dataset and the corresponding output or true prediction is available and the algorithms try to find the relationship between inputs and outputs.
In unsupervised ML algorithms, the true labels for the outputs are not known. Rather, the algorithms try to find similar patterns in the data. E.g., Clustering.
Supervised learning algorithms are further divided into two types:
1. Regression Algorithms
2. Classification Algorithms
Regression algorithms predict a continuous value e.g.,the price of a house.
Classification algorithms predict a discrete value e.g., whether a incoming email is Spam/Ham.'''

import pandas as pd
import numpy as np
import seaborn as sns
#sns.get_dataset_names()
#Importing the dataset and printing the dataset header
tips_df=sns.load_dataset("tips")
tips_df.head()

'''We will be using machine learning algorithms to predict the tip for a particular record based on the remaining features such as total_bill, gender, day, time etc.
Dividing Data into Features and Labels'''
x=tips_df.drop(['tip'],axis=1)
y=tips_df["tip"]
x.head()
y.head()

#Converting Categorical Data to Numbers
'''ML Algorithms can only work with numbers. It is important to convert categorical data into a numeric format'''
#Numeric Variables
numerical=x.drop(['sex','smoker','day','time'],axis=1)
numerical.head()
#DataFrame that contains only categorical columns
categorical=x.filter(['sex','smoker','day','time'])
categorical.head()
categorical["day"].value_counts()

'''One of the most common approaches to convert a categorical column to a numeric one is via one-hot encoding.
In one-hot encoding, for every unique value in the original columns, anew column is created.'''

cat_numerical=pd.get_dummies(categorical)
cat_numerical.head()
'''The final step is to join the numerical columns with the one-hot encoded columns.'''
x=pd.concat([numerical,cat_numerical],axis=1)
x.head()
#Divide Data into Training and Test Sets
'''We divide the dataset into two sets i.e., train and test set.
The dataset is trained via the train set and evaluated on the test set.'''

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.20,random_state=0)

#Data Scaling/Normalization
'''The final step before data is passed to ML algorithm is to scale the data.
Some columns of the dataset contain small values, while the others contain very large values.It is better to convert all values to a uniform scale.'''

from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x_train=sc.fit_transform(x_train)
x_test=sc.transform(x_test)
'''We have converted data into a format that can be sured to train ML algorithms for regression.'''

#Linear Regression
'''Linear Regression is a linear model that assumes a linear relationship between inputs and outputs and minimizes the cost of error between the predicted and actual output using functions like mean absolute error.'''

#Advantages
'''Linear Regression is a simple to implement and easily interpretable algorithm.
It takes less time to train, even for huge datasets.
Linear Regression coefficients are easy to interpret.
Importing Linear Regression model from sklearn.'''

from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
regressor=lin_reg.fit(X_train,y_train)
y_pred=regressor.predict(X_test)

'''Once you have trained a model and have made predictions on the test set, the next step is to know how well your model has performed for making predictions on the unknown test set.
There are various metrics to check that.
Mean Absolute Error (MAE) is calculated by taking the average of absolute error obtained by subtracting real values from predicted values.
Mean Squared Error (MSE) is similar to MAE. However, the error for each record is squared in case of MSE.
Root Mean Squared Error (RMSE) is the under root of mean squared error.'''

from sklearn import metrics
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))
print('Mean Squared Error:',metrics.mean_squared_error(y_test,y_pred))
print('Root Mean Squared Error:',metrics.mean_squared_error(y_test,y_pred))

'''By looking at the MAE, it can be concluded that, on average there is an error of 0.70 for predictions, which means that on average there is an error of 0.70 for predictions, which means that on average, the predicted tip values are 0.70$ more or less than the actual tip values.'''

#Random Forest Regression
'''Random Forest Regression is tree-based algorithm.
Ensemble modelling technique.'''
#Advantages
'''You have lots of missing data or imbalance dataset (0(200) and 1(1000)).
With a large number of trees or models, you can avoid overfitting. Overfitting occurs when ML models performs better on the training set but worse on the test set.'''

from sklearn.ensemble import RandomForestRegressor
rf_reg=RandomForestRegressor(random_state=42,n_estimators=500)
regressor=rf_reg.fit(X_train,y_train)
y_pred=regressor.predict(X_test)
from sklearn import metrics
print('Mean Absolute Error:',metrics.mean_absolute_error(y_test,y_pred))
print('Mean Squared Error:',metrics.mean_squared_error(y_test,y_pred))
print('Root Mean Squared Error:',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))