A beginner-friendly guide to Linear Regression explaining hyperplanes, loss functions, and SGD optimization with practical intuition.
What is Linear Regression?
 |
| Linear Regression |
Linear regression is a statistical algorithm that is used to predict the output. If we say i have the X company vehicle price range from $100,000 to $300,000 of all the past 10 series we can predict the 11th series price using the linear regression as a model. It may look like a time series problem in general explanation but make it a regression problem we need the features of vehicle for example ( type of engine, wheel size, body material, seating area, transmission system, Curb weight, cargo space etc) for each vehicle will make this a regression problem all we have to do pass these features as input in our our model and target the 11th series of the model.
 |
| Linear Regression Prediction |
How does Linear regression in Machine Learning Predict?
Linear Regression uses the equation of line (
y = mx +c) this is a simple line equation in calculus we used to find the y by putting the value of x and c, in above table for predicting the price of vehicle we have four features of a single vehicle and one target price. Suppose if we only had one feature as vehicle horsepower (hp) as a feature and we have to predict the target this make our data as two dimensional graph where x will be horsepower and y will be our target but in real life this will not work this will make the price directly depend upon the engine horsepower no matter the space and vehicle weight. The price of vehicle depend upon various factor the more relative feature we have the more accurate our model will become too many irrelevant features may lead to
overfitting here in table we are using four feature and one target this will make our model five dimensional.
In 2D: The model is a line (y = mx + c).
In 3D: The model become a plane like a flat sheet of paper.
In 5D: The model becomes a
Hyperplane. Even though we can't see it but it is treated like a math problem and used for prediction same as finding y with a given value x.
our aim is to find the target price of series 11 vehicle with four different value of x for that series.
In high school geometry we used (
y = mx + c) to describe the 2D plane. In
machine learning we use the same function as:
looking at this equation we can find the similarity between the equation of line and Hypothesis function basically they both are the same equation with a different notations.
Here y becomes y-hat and both are predicted outputs for given the value of x, w1 and m is the slope/weight of a line
w0 , b is a bias term and the value of y-hat , y when x is zero.
in our data with four features our equation will look something like:
y-hat = w1(HP) + w2(curb weight) + w3(seats) + w4(cargo space) + w0
If we look at this equation and apply it on series 1 vehicle we will have the known values (HP, curb weight, Seats, cargo space and y_hat) the unknowns are (w1, w2, w3, w4 and w0) from series 1 to series 10 if we put all the known values in this equation and model start adjusting its weights iteratively to minimized the error and this is how supervised learning work with the known weights and known features values we will again put in the equation to predict the price of series 11 car.
as you can see machine learning is not a magic its a pure mathematics to predicting the price for a vehicle of X company with given features. This analogy can be applied in any fields not just automobile you can use it to predict the sales of your store,
stock prices ( i don't recommend this as the stock prices are sensitive to human emotions and in a war like situation the stock prices becomes volatile ) if i had to use it for stock markets i will use regressions to find out which stock is prone to war and other factor like given a trade deal if the International tariffs increases or decreases which stocks moves up and down then i will use this to my advantages and put my money in long terms there is no shortcuts for making money.
Criterion: Minimizing error with Mean Squared Error (MSE)
In our vehicle example, the relationship is relatively stable ( More HP means more money ) but what happened if the relations are not stable like in stock prices the geopolitics, human emotions, quarterly earnings or interest rate will create the volatility and might result in the failure of simple linear regression model while more powerful model like Recurrent Neural Networks (RNN) are used for time series predictions but that is for some other day. for now fundamental question remains
how will the model knows if he is right or wrong? To answer this we need some criteria to evaluate if the predicted weight and biases is right or wrong. The Mean Squared Error or the The MSE for short do exactly that it helps to find out the loss also called the
loss function.
Basically it work like this it takes the mean of predicted and actual value difference for example in series 1 vehicle the model will predict the y_hat using some initial weight assigned it as target then it will check both predicted and actual value and how far the difference is and subtract the predicted value from the actual value and took the square of result that will give the error. if you are wondering why square? Suppose the actual value is 2 times greater than predicted value and in other loop the predicted value is 2 times greater than actual value then they both will cancel out each other and we will get the result zero model will think it is perfect when it is infact missing the mark twice, that will not generalize the model instead we will wonder in different direction overfitting or underfitting are some terms that can describe this wondering in different direction. Actually MSE does not directly cause overfitting or underfitting, rather High MSE on training data usually means Underfitting( the model is too simple), while the Low MSE on training and high on new data means Overfitting ( model is not generalizing instead copying from training).
The Convex Loss Surface: Why we Love "The Bowl"
Now that we know how to calculate the error, the next question is: How does the model use this numbers to get better?
Imagine a "U shape" in a Two Dimensional Plane or "a bowl" in Three Dimensional plane and a ball placed on it's rim rolling down on a these shape due to gravity it will go down to the lowest point possible, in machine learning we call this point global minima- the place where our error will be minimum.
In a physical world, the ball might wobble before reaching to the bottom if its speed decreases. However our model is precise it doesn't let the ball roll randomly, it uses vectors(Mathematical arrows that indicate direction and distance) to decide the most direct path downward.
The Optimizer: Why we use Stochastic Gradient Descent (SGD)
Now we have our bowl and we know we have to move down to reach the Global Minimum. But in the real world, calculating the exact slope for every single piece of data at once can be incredibly slow and "heavy" for a computer.
This is where Stochastic Gradient Descent(SGD) comes in.
Stochastic is just a fancy word for "Random". It update the model parameters using single or random batch of data at a time instead of entire dataset. That makes computation easy for large datasets and help escape the shallow local minima during training.
First it will look entire dataset and create a 3D map of entire mountain or bowl before taking a single step, it's accurate but very slow. It will pick one random data point (one car) at a time and start calculating error, and take small step immediately.
There are Two huge advantages to this:
1. Speed: The model starts learning immediately. It doesn't have to wait to read the whole "library" of data before it starts improving.
2. Memory Efficiency: You don't need a supercomputer to process millions of rows at once; you just need enough memory for one row at a time.
A Note on the "Learning Rate" (α):
To make SGD work, we have to decide how big of a step the model takes. We call this the Learning Rate.
If the step is too big, the ball might overshoot the bottom and fly out the other side of the bowl.
If the step is too small, the model will take forever to reach the bottom.
Conclusion
Linear regression might look like a very simple model at first glance, but it forms the foundation of many machine learning techniques. What starts as a simple line equation from high school mathematics becomes a powerful predictive tool when combined with optimization methods like Stochastic Gradient Descent.
Understanding concepts like hyperplanes, loss functions, and convex optimization helps us see that machine learning is not magic—it is structured mathematics solving real-world problems.
Whether we are predicting vehicle prices, sales numbers, or other measurable outcomes, the core idea remains the same: learn patterns from known data and use them to estimate unknown values.
In future posts we will move beyond linear models and explore more complex algorithms, but mastering these fundamentals is what builds the intuition needed for deeper machine learning systems.
Join the conversation