Linear regression is one of the first algorithms that machine learning students learn, due to its fundamental nature. It is an algorithm of supervised learning, a paradigm in machine learning where input objects (for example, a vector of predictor variables) and a desired output value (also known as human-labeled supervisory signal) train a model.
Linear regression is also a linear model, since it can only fit linear data points.
Linear regression may be simple and multiple, but on this occasion we will limit ourselves to analyze simple linear regression.
Simple linear regression is useful for finding the relationship between two continuous variables, an independent variable and a dependent or predictable one.
Given as input a set of data points, or vectors, containing a series of features, the goal is to obtain a line that best fits the data. The best fit line is the one for which total prediction error is as small as possible. The error represents the vertical distance between each point and its predicted value on the regression line or, in other words, the difference between actual and predicted values.
In the case of simple linear regression, the regression line is represented as: y = ax + b where x is the independent feature, y is the dependent feature, a is the slope and b the y-intercept, or constant.
In this context, "fitting the data" means finding the parameters of a model (in this case, the regression line) that best represent the relationship between the input features (independent variables) and the output (dependent variable) in the data. Specifically, it involves determining the slope and intercept of the line that minimize the discrepancies between the observed values and the values predicted by the model.
The whole concept of linear regression is based on the equation of a line:
The goal of the algorithm is to learn theta0, theta1, and h(x).
The cost function is a quadratic function (it usually traces a parabola) aiming at evaluating the performance of the model by computing a single scalar value that represents the total error across all data points. In linear regression, the most commonly used cost functions are the Mean Squared Error (MSE):
and the Sum of Squared Errors (SSE):
In the train.py program we use the Mean Squared Error function, whose (optimized) formula may be written as follows:
or, to simplify:
where m represents the total number of examples in the dataset.
A cost function should ideally be differentiable and convex to facilitate efficient optimization, particularly when using gradient-based methods
A function is differentiable if it has a derivative for each point in its domain as the following examples:
whereas functions which have a cusp or a discontinuity are non-differentiable:
On the other hand, a univariate function is convex if the line segment connecting two function’s points lays on or above its curve (it does not cross it). If it does it means that it has a local minimum which is not a global one.
Intuitively, a gradient is a slope of a curve at a given point in a specified direction.
In the case of a univariate function, it is simply the first derivative at a selected point. In the case of a multivariate function, it is a vector of derivatives in each main direction (along variable axes). Because we are interested only in a slope along one axis and we don’t care about others these derivatives are called partial derivatives.
Gradient Descent is an algorithm used in machine learning to minimize the error by iteratively adjusting the model parameters.
A machine learning model aims to minimize this error to achieve high accuracy. Gradient Descent
1. Randomly Initialize Values: Start with random parameter values;
2. Update Values: Adjust the parameters using the gradient of the error with respect to each parameter;
3. Repeat Until Convergence: Continue updating until the slope (derivative) is zero, indicating a minimum error.
-
Derivative: Represents the slope of the error function at a particular point, guiding the direction to adjust parameters.
-
Learning Rate (Alpha): Determines the size of the steps taken during each update.
- If too small, the model learns slowly.
- If too large, the model may overshoot the minimum and fail to converge.
This process ensures that the model parameters are adjusted in the direction that reduces the error, leading to an optimal solution.
To measure the goodness of our regression line, we will use the R-squared value, or coefficient of determination.
R-squared value is a statistical measure of how close the data are to the fitted regression line.
To calculate R-squared value we will use the following formula:
Where y is the actual value, yₚᵣₑ𝒹 is the predicted y value and y̅ is the mean.
Basically, we calculate the difference between the predicted value and the mean, then divide it by the difference between the actual value and the mean.
The higher the R-squared value the better our model performance will be. So as the R-squared value gradually increases, the distance of actual points from the regression line decreases, and the performance of the model increases.
To conclude, the goal is to find the parameters a and b of a linear function f(x)=ax+b. To achieve this, you consider a function T(a,b) which represents a type of average error between the estimated value f(x) and the real values, more specifically a variance. You aims at minimizing this error.
The method used to find the minimum of T(a,b) is called the gradient descent algorithm. If you delve into the details and calculate the gradient of T(a,b), you will derive the formulas provided in the project subject: if you take T(a,b)=(1/N)∑(pi−a⋅ki+b)^2 where N is the number of price/kilometer pairs (pi,ki) obtained from a .csv file, and a and b are the parameters of T.
Conceptually, T is just a surface function, and you seek the coordinates of the point where the surface's altitude is the lowest. You start with an initial point (in the subject, it's (a0=0, b0=0)), calculate the gradient at this point, and then follow the direction opposite to the gradient (to minimize the function).
From there, you have an initial point (a0,b0) and a direction (−x0,−y0). You move from the starting point in the opposite direction of the gradient to find a new point: (a0,b0)+coef⋅(−x0,−y0)=(a1,b1).
This completes the first iteration of the algorithm. The point (a1,b1) provides a lower altitude than (a0,b0), meaning T(a1,b1)<T(a0,b0). You then repeat the same process, replacing (a0,b0) with (a1,b1), and continue until the difference between T(an,bn) and T(an+1,bn+1) is sufficiently small.
The coefficient in the formula is the convergence coefficient (learning rate). There is an optimal way to find this value, but that is a different topic. With a sufficiently small value, the algorithm will work, although it may not be the fastest.
The learning rate, denoted as coef, helps guide the updates to a and b to minimize the error. Specifically, in each iteration, you compute temporary values _atemp_ and _btemp_, then update a and b as follows:
a1=learning_rate⋅atemp+a0 b1=learning_rate⋅btemp+b0
You repeat this process until the algorithm converges to the minimum error.
The function T(a,b) compares the estimated price for a given mileage with the actual price. It is defined as: f(x)=a⋅x+b This is the function you aim to find through the algorithm. To compare, you calculate: actual_price−f(actual_mileage)
Since you have multiple data points, you take the average of the differences: $$ \frac{1}{numberOfData}\sum_{i=1}^m (actualPrice − f(actualMileage)) $$
To avoid negative values in this measure, you square each term, resulting in: $$ \frac{1}{numberOfData}\sum_{i=1}^m (actualPrice − f(actualMileage))^2 $$
Expanding f in this formula gives: $$ \frac{1}{numberOfData}\sum_{i=1}^m (actualPrice − (a⋅actualMileage+b))^2 $$ This final formula is T(a,b), representing the average error between the function a⋅x+b and the real data. Minimizing T(a,b) helps find a and b such that the line a⋅x+b best fits the data points.
The gradient descent algorithm is a mathematical technique to find a local minimum of a function, ensuring the function meets certain properties.
I approached machine learning from a YouTube video playlist by Machine Lernia (in French).
The foundations of my train.py program are inspired by Sindhu Seelam's article "Linear Regression From Scratch in Python WITHOUT Scikit-learn" published on Medium.
Other articles I used for the explanations and the redaction of the present README file are:
- Daksh Trehan's articles "Linear Regression Explained" and "Gradient Descent Explained";
- Robert Kwiatkowski's article "Gradient Descent Algorithm — a deep dive";
- Jatin Mehra's article Understanding Gradient Descent: A Beginner’s Guide.
I used Desmos Graphing Calculator to graphically display the functions for my examples.
To normalize the values of my arrays of mileage and prices I followed the tip of Cina on StackOverflow.
To add argument flags to the program, I used the argparse library, and in particular I followed Managing arguments in Python with argparse Managing boolean arguments.
To add mathematical notation to the present README file, I followed the leads of Cheat Sheet: Adding Math Notation to Markdown.













