Regression analysis how does it work




















It is because it causes problems in ranking variables based on its importance, or it makes the job difficult in selecting the most important independent variable. When the variation between the target variable and the independent variable is not constant, it is called heteroscedasticity. A poorer person will spend a rather constant amount by always eating inexpensive food; a wealthier person may occasionally buy inexpensive food and at other times, eat expensive meals.

Those with higher incomes display a greater variability of food consumption. When we use unnecessary explanatory variables, it might lead to overfitting. Overfitting means that our algorithm works well on the training set but is unable to perform better on the test sets. It is also known as a problem of high variance. When our algorithm works so poorly that it is unable to fit even a training set well, then it is said to underfit the data.

It is also known as a problem of high bias. For different types of Regression analysis, there are assumptions that need to be considered along with understanding the nature of variables and its distribution. The simplest of all regression types is Linear Regression where it tries to establish relationships between Independent and Dependent variables. The Dependent variable considered here is always a continuous variable. Linear Regression is a predictive model used for finding the linear relationship between a dependent variable and one or more independent variables.

If the relationship with the dependent variable is in the form of single variables, then it is known as Simple Linear Regression. If the relationship between Independent and dependent variables are multiple in number, then it is called Multiple Linear Regression. As the model is used to predict the dependent variable, the relationship between the variables can be written in the below format.

The main factor that is considered as part of Regression analysis is understanding the variance between the variables. For understanding the variance, we need to understand the measures of variation. With all these factors taken into consideration, before we start assessing if the model is doing good, we need to consider the assumptions of Linear Regression.

Since Linear Regression assesses whether one or more predictor variables explain the dependent variable and hence it has 5 assumptions:. With these assumptions considered while building the model, we can build the model and do our predictions for the dependent variable. For any type of machine learning model, we need to understand if the variables considered for the model are correct and have been analysed by a metric. In the case of Regression analysis, the statistical measure that evaluates the model is called the coefficient of determination which is represented as r 2.

The coefficient of determination is the portion of the total variation in the dependent variable that is explained by variation in the independent variable. A higher value of r 2 better is the model with the independent variables being considered for the model. This type of regression technique is used to model nonlinear equations by taking polynomial functions of independent variables.

In the figure given below, you can see the red curve fits the data better than the green curve. Hence in the situations where the relationship between the dependent and independent variable seems to be non-linear, we can deploy Polynomial Regression Models.

Here we can create new features like. In case of multiple variables say X1 and X2, we can create a third new feature say X3 which is the product of X1 and X2 i. The main drawback of this type of regression model is if we create unnecessary extra features or fitting polynomials of higher degree this may lead to overfitting of the model. Logistic Regression is also known as Logit, Maximum-Entropy classifier is a supervised learning method for classification. It establishes a relation between dependent class variables and independent variables using regression.

The dependent variable is categorical i. The probabilities describing the possible outcomes of a query point are modelled using a logistic function. This model belongs to a family of discriminative classifiers. They rely on attributes which discriminate the classes well. This model is used when we have 2 classes of dependent variables. Typically, a regression analysis is done for one of two purposes: In order to predict the value of the dependent variable for individuals for whom some information concerning the explanatory variables is available, or in order to estimate the effect of some explanatory variable on the dependent variable.

In order to see how much our prediction can be trusted, we use the standard error of the prediction [3] to construct confidence intervals for the prediction. Examine a workbook that provides a detailed discussion of the standard error of the prediction. If our goal is not to make a prediction for an individual, but rather to estimate the mean value of the dependent variable across a large pool of similar individuals, we use the standard error of the estimated mean instead when computing confidence intervals.

In order to do this, we should always use the most complete model available, i. Dive down for further discussion. Our estimate of impact of a unit difference in the targeted explanatory variable is its coefficient in the prediction equation. The extent to which our estimate can be trusted is measured by the standard error of the coefficient.

We take the standard approach of classical hypothesis testing: In order to see if there is evidence supporting the inclusion of the variable in the model, we start by hypothesizing that it does not belong, i. Dividing the estimated coefficient by the standard error of the coefficient yields the t-ratio of the variable, which simply shows how many standard-deviations-worth of sampling error would have to have occurred in order to yield an estimated coefficient so different from the hypothesized true value of 0.

We then ask how likely it is to have experienced so much sampling error: This yields the significance level of the sample data with respect to the null hypothesis that 0 is the true value of the coefficient.

How do those factors interact with each other? And, perhaps most importantly, how certain are we about all of these factors? In regression analysis, those factors are called variables. And then you have your independent variables — the factors you suspect have an impact on your dependent variable. In order to conduct a regression analysis, you gather the data on the variables in question. Then you plot all of that information on a chart that looks like this:.

Glancing at this data, you probably notice that sales are higher on days when it rains a lot. What about if it rains 4 inches? Now imagine drawing a line through the chart above, one that runs roughly through the middle of all the data points.

This line will help you answer, with some degree of certainty, how much you typically sell when it rains a certain amount.

In addition to drawing the line, your statistics program also outputs a formula that explains the slope of the line and looks something like this:. Ignore the error term for now. Just focus on the model:. And in the past, for every additional inch of rain, you made an average of five more sales. You might be tempted to say that rain has a big impact on sales if for every inch you get five more sales, but whether this variable is worth your attention will depend on the error term.

A regression line always has an error term because, in real life, independent variables are never perfect predictors of the dependent variables. Rather the line is an estimate based on the available data. So the error term tells you how certain you can be about the formula.

The larger it is, the less certain the regression line. The above example uses only one variable to predict the factor of interest — in this case rain to predict sales. Typically you start a regression analysis wanting to understand the impact of several independent variables.



0コメント

  • 1000 / 1000