Are you working on supervised learning and searching for a method to reduce the error? And do you want to find better values for parameters? Then this blog is for you.

You can reduce errors and can give better values for parameters using cost function and gradient descent. So, without wasting precious time, Let’s discuss these concepts and solutions for these problems.

### Introduction:

Before discussing the actual topic, Let’s understand “Hypothesis(h)”. Hypothesis is a function which predict values(h) with given input values(x). For example, An ML (Machine learning) algorithm has to be built to predict the cost price of selling a house, given the size of the site and training data. (This example is considered throughout this blog to understand the concept).

Now we need to generate a hypothesis function which is denoted by ‘h(x)’. It is a polynomial function that takes one or more parameters. For example:

It applies to all polynomial functions. For every input value (x), it predicts output using hypothesis function (h). The general form of hypothesis function is

where i=0,1,…..,N

Therefore ‘h’ is a vector of predicted values for an algorithm.

Our objective is to get the best possible line. The best possible line will be such that the average squared vertical distances of the scattered points from the line will be the least. Ideally, the line should pass through all the points of our training data set. In such a case, the value of cost function (J) will be zero. For the example mentioned above, Let’s consider

In the above figure, You can observe that the actual values (y, output of training data) and predicted values (h) are different and have some differences. So, To make the algorithm optimum, we need to reduce this difference during training. And this difference itself is called “Cost function”. The method used to reduce cost function is called “Gradient descent”.

## Cost Function:

It is the average difference of all the results of the hypothesis (h) and the actual output (y). The cost function is denoted by “J”. This is also called as “Squared Error function” or “Mean square Error”. The mathematical equation is given by

Where N-->Number of training data.

h-->Predicted output

y-->Actual output

The mean is halved as a convenience for computation of the gradient descent as the derivative term of the square function will cancel out the halved term.

So from the above example, you understood that and are values that play an important role in prediction. Thus we have to improve this and values, such that the cost function reaches its optimal minimum. Therefore, the Gradient Descent algorithm is best to attain this feature.

## Gradient Descent:

It is an optimization approach for determining the values of a function’s parameters that minimizes a cost function. It is a derivative (the tangential line to the function) of the cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter , which is called the “learning rate”.

For example, the distance between each “star” in the graph above represents a step determined by our parameter . A smaller would result in a smaller step, and a larger results in a larger step. The direction in which the step is taken is determined by the partial derivative of

.

The gradient descent algorithm is given by:

Repeat until convergence:

But

Therefore, Repeat until convergence:

Where j= 0,1 represents the feature index number

At each iteration j, one should simultaneously update the parameters . Updating a specific parameter before calculating another one on the j^{th} iteration would yield a wrong implementation.

Here our objective is to converge the cost function to its minimum value. This convergence depends on the parameter “”.

If “” is very small, gradient descent can be slow (It may take a long time to reach its minimum).

If “” is too large, gradient descent Can overshoot the minimum. It may fail to converge or even diverge.

### MATLAB CODE:

% Author:Sneha G.K

% Topic :Cost Function in MATLAB

% Company: MATLAB Helper

% Website: https://MATLABHelper.com

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% MATLAB code to find and improve parameter(theta value).

clc;

clear;

close all;

% Training data size of house(x) and price of house(y).

x=[2104 1416 1534 852];

y=[460 232 315 178];

%Theta value to be inserted in hypothesis function.

%Theta value can be 1x1 scalar or nx1 vector.

theta=input("Enter the theta1 value:");

N=length(y);

h=theta*x;%hypothesis fuction.

% Formula to find cost function.

Error=(h-y).^2;

J=(1/(2*N))*sum(Error);

disp(J);

% alpha to find gradient descent.

alpha=input("Enter alpha value:");

% Numbaer of times the parameters to be updated.

num_iter=input("Enter the number of times the parameter need to be updated:");

% The logic behind the gradient descent.

% If parameters are nx1 vector then another for loop has to inserted to

% updated every theta value.

for i=1:num_iter

Error=h-y;

delta=x'*Error;

theta=theta-(alpha/N)*sum(sum(delta));

h=theta*x;endError=(h-y).^2;

J=(1/(2*N))*sum(Error);

disp(J);

% display theta value

subplot(2,1,1);

plot(x,y,"ro","MarkerFacecolor","r");

hold on;

plot(x,h,"g-");

xlabel("Size(in feet^2)");

ylabel("Predicted value");

hold off;

subplot(2,1,2);

plot(theta,J,"bo","MarkerFacecolor","b");

### Conclusion:

- The cost function is used to determine the quantity of error in prediction, which uses training data.
- Cost function plays an essential role in training the ML (Machine Learning) algorithm.
- A gradient descent algorithm is used to reduce the cost function for regression.

**Did you find some helpful content from our video or article and now looking for its code, model, or application? You can ****purchase**** the specific Title, if available, and instantly get the download link.**

Thank you for reading this blog. Do share this blog if you found it helpful. If you have any queries, post them in the comments or contact us by emailing your questions to [email protected]. Follow us on LinkedIn Facebook, and Subscribe to our YouTube Channel. *If you find any bug or error on this or any other page on our website, please inform us & we will correct it.*

If you are looking for free help, you can post your comment below & wait for any community member to respond, which is not guaranteed. You can book Expert Help, a paid service, and get assistance in your requirement. If your timeline allows, we recommend you book the **Research Assistance** plan. If you want to get trained in MATLAB or Simulink, you may join one of our **training** modules.

*If you are ready for the paid service, share your requirement with necessary attachments & inform us about any ***Service*** preference along with the timeline. Once evaluated, we will revert to you with more details and the next suggested step.*

*Education is our future. MATLAB is our feature. Happy MATLABing!*

How do you determine the number of iterations? Is it the same as the size of the data set? I’m assuming theta is the straight line gradient? Is alpha the intercept value of the regression that’s been previously calculated?

1.It is a great question but unfortunately number of iteration value is like trial and error, you can try for some set of values till you reach your destination.This is the main reason why we opted neural network.

2.No, it depends on the theta value you give at initial stage.

3. Yes alpha is intercept for previously calculated regression.

Thanks for your compliment on blog and also for asking question.

Very nice blog..but this is my question. I’m unclear about the source of numiter, theta and alpha. I;m thinking that you could use least squares algorithm to generate a model. Then, you could use the gradient of the model as an estimate for theta. The size of the data set could be used an estimate for numiter and finally, the intercept of the model could be used as an estimate for alpha in descent gradient. Am I right?

You can do that. But numiter doesn’t depend on dataset.It depends on the theta value you provide at initial code. Intercept can be used as alpha for Gradient descent.