Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 63 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,66 @@
# Project 1

Put your README here. Answer the following questions.
# CS584_Project1
Midterm project for CS 584

* What does the model you have implemented do and when should it be used?
* How did you test your model to determine if it is working reasonably correctly?
* What parameters have you exposed to users of your implementation in order to tune performance? (Also perhaps provide some basic usage examples.)
* Are there specific inputs that your implementation has trouble with? Given more time, could you work around these or is it fundamental?
Rebecca Thomson
A20548618

CS584 – Machine Learning

Purpose:
Predictions using Regularized Linear Discriminant Analysis for Classification. This program should be used for automatic gamma regulation of a Classification LDA model.

Description:
This program performs a Regularized Linear Discriminant Analysis (as described in Section 4.3.1 of Elements of Statistical Learning, 2nd Edition) from first principals for purposes of classification.

This program takes in the clean array of attributes and a dependent classifier variable.
-Data with NA or other missing values will not be accepted (an error check exists).
-Data will be checked for sufficient datapoints.

The dependent classifier variable:
-Can be 2 or more classes
-Do not need to be balanced
-Each class should be identified by a unique number
-Class identifying numbers need not be sequential.
-Data MUST be in list or DataFrame format.

The attributes:
-All attributes should be numerical and sequential or one-hot encoded.
-The weighted covariance matrix (sigma-hat) of the attributes must be invertible.
-Data MUST be in list or DataFrame format.
-To provide a linear ‘decision boundary’, there should be at least number of attributes plus one as number of classes, otherwise QDA should be used.

Program Procedure:
When the fit(Y,X) is called:
1. The program will automatically separate the classes for all calculations. At this time, the program uses the ‘sys’ library to provide a soft exit if the data is insufficient in size. The given data sets will be split into an 80/20 training/testing sets to tune the gamma penalty value.
2. The program will then calculate a weighted sigma-hat covariance matrix from the training set. This sigma-hat is the starting point of the regularization.
3. Next, this sigma-hat will be used to find a regularized LDA with Sigma(gamma). Since LDA requires all classes to use the same sigma-hat value, the gamma regularization of equation 4.14 of Elements of Statistical Learning, 2nd Edition was chosen. The alpha equation 2.13 is for regulating QLA towards LDA. The program automatically iterates through gamma values, saving the gamma value and error rate. Each iteration will calculate the error rate on the testing set using the original model with the modified sigma-hat. The program predicts the classification of each data-point in the testing set, compares it to the actual classification, and adds +1 to the error rate if it is wrong. It does this by using equation 4.10 with each classes’ values, and then picking the class with the highest value. This equation is a proportional probability, and argmax on all values will predict the class. Once the program reaches a cutoff of number of errors, the process stops. The first gamma with the lowest error on the test set is chosen. This could be gamma =1.0, the default. This gamma is used to over-write the original sigma-hat value, so that all future predicted values from this model are calculated using this sigma-hat.

When confussion_matrix(Y_predict, Y_actual) command is called:
A series of confusion matrixes can be created by using the model and calling model.confussion_matrix(Y_Predicted, Y_Actual). Each 2X2 matrix will be created and printed for each class.

When the predict(X) command is called:
1. First, create your model with the training data.
2. This command will return a list of the predicted values using the input of X, which is in the format of a list of lists of the attribute values.

Problems:
1. Unfortunately, the program must be given data that produces and invertible sigma-hat. Failing to produce an invertible sigma-hat causes an error. My program does print out a line summarizing the problem, but it cannot work around it.
2. The program currently needs input dependent values to be in list format. If I had time, I would produce a series of if-then statements to automatically convert all input into the working format. I did provide a working conversion for attributes from DataFrame to list, but if the dependent variable input is a DataFrame, the conversion is very unreliable, sometimes adding extra brackets.
3. The gamma chosen can be gamma=1.0 if the set is easily fit without overfitting problems.

Global variables:
These are the global variables of each model that can be called for examination:

cov_list List of individual class covariance matrixes
mu_list List of individual class means for each attribute
pi_list List of pi values (# per class/totalN)
sigma_hat Sigma value for model, automatically tunned
gamma The gamma value with the lowest error on the test set (float)
X_by_class A list of lists of each datapoint separated by class
totalN Total training datapoints in input model (int)
total_i_per_class # Datapoints per class (list)
classes_given List of all possible classes by identifying number in input data
classes_given_no Total number of classes (int)
attributes_no Total number of independent attributes (int)
sigma_hat_unmodified Unmodified value of sigma, for checking.
309 changes: 309 additions & 0 deletions R_Thomson.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,309 @@

"""
Author: Rebecca Thomson
IIT #A20548618

Oct. 10, 2024

Project Description: To preform a Regularized Linear Discriminant Analysis on classification (as described in
section 4.3.1 of Elements of Statistical Learning, 2nd Edition) from first principals.

This program takes in the clean array of attributes and a dependent classifier variable.
-Data with NA or other missing values will not be accepted (an error check exists)
The dependent classifier variable:
-Can be 2 or more classes
-Do not need to be ballanced
-Each class should be identified by a unique number
-Class identifying numbers need not be sequential.
The attributes:
-All attributes should be numerical and sequential or one-hot encoded.
-The weighted covariance matrix (sigma-hat) of the attributes must be invertable.

The program will automatically seperate the classes for all calculations.
The program will then calculate a weighted sigma hat covarience matrix.
The program will then used this sigma hat to find a regularized LDA with Sigma(alpha, gamma)
Unfortunately, the program must be given data that produces an invertable sigma-hat.



"""
'''libraries'''
#numpy
import numpy as np
#pandas
import pandas as pd
#system, for exit if certain errors.
import sys
from sklearn.model_selection import train_test_split
#for test

class regLDA():
def __init__(self):
pass
#self.xs=xs
#self.ys=ys


def update_sigma_step_a(self, sigma_given, cov_given, alpha):
update_sigma=[]
inv_alpha=1-alpha
#must have iteration to add array
for j in range(len(self.sigma_hat)):

s_temp_row=[]
for k in range(len(self.sigma_hat[j])):
s_temp=(cov_given[j][k]*alpha)+(sigma_given[j][k]*inv_alpha)
s_temp_row.append(s_temp)
#Build new sigma hat
update_sigma.append(s_temp_row)

#this will result in a sigma_hat(alpha)
return update_sigma
def update_sigma_step_g(self, sigma_given, gamma):
update_sigma=[]
inv_gamma=1-gamma
#sigma_squared=cov_given[K][K]
for j in range(0,len(self.sigma_hat)):
s_temp_row=[]
for k in range(0, len(self.sigma_hat)):
if(k==j):
s_temp=sigma_given[j][k]*gamma+inv_gamma*sigma_given[j][k]
s_temp_row.append(s_temp)
else:
s_temp_row.append(sigma_given[j][k]*gamma)
#Build new sigma hat
update_sigma.append(s_temp_row)
return update_sigma

def log_loss(self,sigma_used,K_class,x_list):
#This produces a proportional probability function for a given class with a given sigma and using the data's mu and pi.
mu=pandas.DataFrame(self.mu_list[K_class]).iloc[:, 0]
x=pandas.DataFrame(x_list).iloc[:, 0]

pi=self.pi_list[K_class]
try:
sigma_inv=numpy.linalg.inv(numpy.asarray(sigma_used))

except:
print("An error occured inverting the sigma-hat matrix.")
#Original formula is Eq. 4.10 of TEXT.
term_three= numpy.log(pi)
#print(term_three)
term_two = -1/2*(numpy.matmul(mu.transpose(),numpy.matmul(sigma_inv,mu)))
#print(term_two)
term_one= (numpy.matmul(x.transpose(),numpy.matmul(sigma_inv,mu)))
#print(term_one)
term_for_this_class=term_one+term_two+term_three
return term_for_this_class
def predict_single(self,temp_s,x):
#Because we don't care the exact probability of the winning catagory, only which catagory won, we return the index of
#with the highest proportional probability.
temp_guess=[]
for i in range(0,len(self.classes_given)):
value=self.log_loss(temp_s,i,x)
temp_guess.append(value)
#print(temp_guess)
winner=self.classes_given[temp_guess.index(max(temp_guess))]
return winner
def predict(self,X_list):

#checking that input is list.
if isinstance(X_list,pandas.core.frame.DataFrame):
X_list=X_list.iloc[:,:].values.tolist()
if (isinstance(X_list,list)!=True):
print("Wrong input format. Not List or DataFrame. Try again.")
print(type(X_list))
tempstatus=1
sys.exit(tempstatus)
#creates predicted list for 'outside' use
predict_list=[]
#print(type(self.sigma_hat))
for i in range(0,len(X_list)):
each_pred=self.predict_single(self.sigma_hat,X_list[i])
predict_list.append(each_pred)
return predict_list
def tunning_predict(self,testxs,testys):
#To determine the best gamma, we must tune the parameter.
#make test, train sets
best_gamma_list=[]
temp_gamma=1.0
modified_sigma=self.sigma_hat
temp_gamma=self.gamma
Error_rate=0
for i in range(0,100):
Error_rate=0
#train with given gamma
temp_sigma=self.update_sigma_step_g(modified_sigma,temp_gamma)
#test with given gamma
predict_list=[]
for i in range(0,len(testxs)):
each_pred=self.predict_single(temp_sigma,testxs[i])
predict_list.append(each_pred)
#compare test predicted to given. +1 to error rate for each error
for k in range(0, len(testxs)):
if (testys[k]!=predict_list[k]):
Error_rate+=1

best_gamma_list.append([temp_gamma,Error_rate]) #add gamma and error rate to list
temp_gamma+=-0.01
#now, to find best gamma
gamma_df = pandas.DataFrame(best_gamma_list, columns = ["temp_gamma", "Errors"]) #turn into dataframe
winner_gamma=gamma_df.loc[gamma_df["Errors"].min(), 'temp_gamma']#.iloc[0] #extract min error's gamma, this will pick the first instance of this rate.
#set the gamma,sigma for the whole model
self.gamma=winner_gamma
#print(winner_gamma)
temp_sigma=self.sigma_hat
self.sigma_hat=self.update_sigma_step_g(temp_sigma,winner_gamma)
def fit(self,fullys,fullxs):
#checking that input is list.
if isinstance(fullxs,pandas.core.frame.DataFrame):
fullxs=fullxs.iloc[:,:].values.tolist()
if isinstance(fullys,pandas.core.frame.DataFrame):
fullys=fullys.iloc[:,:].values.tolist()
if (isinstance(fullxs,list)!=True):
print("Wrong input format on X. Not List or DataFrame. Try again.")
print(type(fullxs))
tempstatus=1
sys.exit(tempstatus)
if (isinstance(fullys,list)!=True):
print("Wrong input format on Y. Not List or DataFrame. Try again.")
print(type(fullys))
tempstatus=1
sys.exit(tempstatus)
#fullys=[int(item[0]) for item in fullys]
#setting given variables into test and training set for future tuning.

# split the dataset
xs, testXS, ys, testYS = train_test_split(fullxs, fullys, test_size=0.20, random_state=0)
#print(xs)
#print(len(xs))

self.totalN=len(ys) #total no. of training data points (#rows)
self.total_i_per_class=[]
self.classes_given=list(set(ys)) #list of possible classes
self.classes_given_no=len(self.classes_given) #no. of classes
self.attributes_no=len(xs[0]) # no attributes (#columns in xs)
self.X_by_class=[] #Empty list that will have each class' xs values listed seperately
self.mu_list=[]
self.cov_list=[]
self.pi_list=[]
self.gamma = 1.0
row_template = [0] * self.attributes_no
self.sigma_hat= [row_template[:] for _ in range(self.attributes_no)]
self.sigma_hat_unmodified= [row_template[:] for _ in range(self.attributes_no)]
#Check input data for N/A errors
tempErrorFound=False
if (pandas.isna(ys).any() or pandas.isnull(ys).any()):
print("NA found in response variable. check data. Analysis stopped")
tempstatus=1
sys.exit(tempstatus)
for i in range (0, self.totalN):
if (pandas.isna(xs[i]).any()or pandas.isnull(xs[i]).any()):
tempErrorFound=True
print("NA found in dependant variables. check data. Analysis stopped")
tempstatus=1
sys.exit(tempstatus)
if (self.totalN != len(xs)):
print("Dependant and response variables not egual length. check data. Analysis stopped")
tempstatus=1
sys.exit(tempstatus)
#Error check, there must be more points than diffent classes. # classes < N
if (self.classes_given_no==self.totalN):
print("Data Points are not separatable into classes. check data. Stop analysis.")
tempstatus=1
sys.exit(tempstatus)

#Caluculate initial variable values
for i in range(0,self.classes_given_no): #Itterate through classes to divide by class.
temp_list=[]
#for j in range(0,self.total_i):
#this creates a filtered list of [row][col]
for j in range(0,len(ys)):
if (ys[j]==self.classes_given[i]):
temp_list.append(xs[j])

#temp_list=xs[numpy.where(ys==self.classes_given[i])[0],:].tolist() #Itterate through xs to divide by class.
#this creates the filtered master list of [class][row][col] , and determines pi for each class
self.X_by_class.append(temp_list) #stick list into master list.

self.total_i_per_class.append(len(temp_list)) # Count Nk per class for weighted sigma
temp_pi=len(temp_list)/self.totalN
self.pi_list.append(temp_pi) #Pi for each class

#Check that each class individually has enough data points
if (self.attributes_no>len(temp_list)):
print("Not enough data points for this class. Stop analysis. Class at stop:")
print(self.classes_given[i])
tempstatus=1
sys.exit(tempstatus)
#this appends the mu_list for each mean, creates the cov. for each class.

#for j in range(0,self.classes_given_no): #Itterate through classes
temp_mu=[]
#for j in range(0, self.attributes_no):#TODO
temp_mu= numpy.mean(temp_list, axis=0)#TODO Check that axis does not shift
#print(temp_mu)
self.mu_list.append(temp_mu)
#now that info is seperated by class, calculate other values.

#this creates each cov to the cov_list for each class:
df=np.array(temp_list)
cov_temp= numpy.cov(df.transpose())
#(1.0/(len(temp_list)-self.classes_given_no))* numpy.matmul((temp_list-self.mu_list[i]).T,(temp_list-self.mu_list[i])) #TODO check formula
#Keeping cov for each class
self.cov_list.append(cov_temp)
temp_sigma=[]
for j in range(len(cov_temp)):
#take each cov value *pi to make this classes' portion of sigma hat
s_temp_row=[]
for k in range(len(cov_temp[j])):
temp=cov_temp[j][k]*temp_pi
s_temp=temp+self.sigma_hat[j][k]

s_temp_row.append(s_temp)
#Build new sigma hat
temp_sigma.append(s_temp_row)

#this will result in a weighted sigma hat
self.sigma_hat=temp_sigma
self.sigma_hat_unmodified=temp_sigma
# tune the gamma
self.tunning_predict(testXS,testYS)
def Single_confusion_matrix(self,T_pos,T_neg,F_pos,F_neg,class_name):
text1=class_name+"vs. All Others. Actual".rjust(40)
text2="Pos".rjust(30)
text3="Neg".rjust(20)
text_title=text2+text3
text_line="- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - "
text4=format("Pred. Pos",'<10')
text5=format("Pred. Neg",'<10')
textTP=format(T_pos,'>20')
textFP=format(F_pos,'>20')
textFN=format(F_neg,'>20')
textTN=format(T_neg,'>20')
p_text=text4 + textTP + textFP
n_text=text5 + textFN + textTN
results=text_line +'\n' +text1 +'\n' +text_line +'\n' + text_title +'\n' +text_line +'\n' + p_text +'\n' + text_line +'\n' +n_text +'\n' +text_line +'\n'

print(results)

def confusion_matrix(self,Y_predict,Y_actual):
for i in range(0,self.classes_given_no): #Itterate through classes to create indvidual confusion matrixes per class)
T_pos=0
T_neg=0
F_pos=0
F_neg=0
for j in range(0,len(Y_predict)):
if (Y_predict[j]==Y_actual[j]):
if (Y_predict[j]==self.classes_given[i]):
T_pos+=1
else:
T_neg+=1
else:
if (Y_predicted[j]==self.classes_given[i]):
F_pos+=1
else:
F_neg+=1

self.Single_confusion_matrix(T_pos,T_neg,F_pos,F_neg,str(self.classes_given[i]))