The basic concept of Regression in Statistics is establishing a cause – effect relationship between two or more variables. The Cause is better referred to as the Independent Variable(s). And the effect is the Dependent Variable. When we regress the dependent variable on the independent one(s) using a regression equation, we obtain a Regression Coefficient which is a measure of the degree of linear association between the cause and the effect. The value of this coefficient always ranges from -1 to +1. A value close to or equal to 1 implies high or perfect positive linear association respectively. Likewise, a value close to or equal to -1 implies high or prefect negative linear association respectively. A value close to or equal to 0 implies little or no linear association respectively.
When dealing with real-life data, we resort to Regression Analysis for estimating the relationships among the various variables. It includes a number of modeling techniques to analyze the association between one dependent variable and one or more independent variables (also called Predictors). Here we will deal with some of the techniques that can be employed by the statistical software R in regression analysis.
Simple Linear Regression
Here, we investigate the relationship between one dependent variable and one independent variable. R provides a number of tools to achieve our objective in this regard.
Let us consider an example having the length of the snout vent and weight of alligators as the independent and the dependent variable respectively. Our objective is to determine the degree of linear association between them.
First we create a data frame to store the data. Note, that the observations have been transformed to the log scale.
R Code: alligator = data.frame(
lnLength = c(3.87, 3.61, 4.33, 3.43, 3.81, 3.83, 3.46, 3.76,3.50, 3.58, 4.19, 3.78, 3.71, 3.73, 3.78),
lnWeight = c(4.87, 3.93, 6.46, 3.33, 4.38, 4.70, 3.50, 4.50,3.58, 3.64, 5.90, 4.43, 4.38, 4.42, 4.25))
Now, we first perform some exploratory data analysis (graphical analysis) to visually project our data. plot(lnWeight ~ lnLength, data = alligator,
xlab = "Snout vent length (inches) on log scale",
ylab = "Weight (pounds) on log scale",
main = "Figure 1: Alligators in Central Florida")
The graph suggests that weight (on the log scale) increases linearly with snout vent length (again on the log scale). Thus, we fit a simple linear regression model to the data and save the fitted model to an object for further analysis:
alli.mod1 = lm(lnWeight ~ lnLength, data = alligator)
The lm function uses the data stored in alligator and fits a linear model with the weight as the dependent variable with length as the predictor (that is, regressing weight on length).
The summary function gives the 5-point summary of the residuals (estimation error values) as well as the slope and intercept of the best fit regression line that models the data.
lm(formula = lnWeight ~ lnLength, data = alligator)
Min 1Q Median 3Q Max
-0.24348 -0.03186 0.03740 0.07727 0.12669
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.4761 0.5007 -16.93 3.08e-10 ***
lnLength 3.4311 0.1330 25.80 1.49e-12 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1229 on 13 degrees of freedom
Multiple R-squared: 0.9808, Adjusted R-squared: 0.9794
F-statistic: 665.8 on 1 and 13 DF, p-value: 1.495e-12
As we see, besides the aforementioned statistics / measures, summary gives some useful other information as well. We get the values of R-square and Adjusted R-square. While R-square is simply the square of the regression coefficient giving us an idea of the goodness of fit, Adjusted R-square also compensates for the number of predictors used in order to avoid overestimation of the strength of association. In this example though, the two values don’t differ much as only one predictor has been used.
plot(resid(alli.mod1) ~ fitted(alli.mod1),
xlab = "Fitted Values", ylab = "Residuals",
main = "Figure 2: Residual Diagnostic Plot")
Now, we use the above code to generate a scatterplot of the Residuals against the Fitted values to check for systematic patterns. Presence of a pattern or trend in the residual plot would indicate a poor fit as the errors are supposed to be randomly distributed about mean 0.
The absence of the any definite pattern in the residuals indicates that our model is a good one. In case we do find a pattern, we will require to further tweak our model to provide a better fit.