Residuals:
Residuals, also known as errors, are the differences between the observed values of the dependent variable and the corresponding values predicted by a statistical model. In the context of linear regression, residuals represent the vertical distances between the data points and the fitted regression line. Mathematically, the residual for the ith observation is calculated as follows:
Residual (εᵢ) = Observed Value (Yᵢ) – Fitted Value (Ŷᵢ)
Residuals are important because they provide insights into how well the regression model fits the data. If the model fits the data well, the residuals should be randomly scattered around zero with no clear patterns or trends. Systematic patterns in the residuals may indicate that the model is missing important explanatory variables, the relationship between the variables is not linear, or the assumptions of the model are violated.
Fitted Values:
Fitted values, also known as predicted values, are the values of the dependent variable (Y) that are estimated by the regression model for each corresponding value of the independent variable(s) (X). These values are calculated based on the estimated coefficients obtained from the regression analysis. In a simple linear regression model, the fitted value for the ith observation (Ŷᵢ) is calculated as:
Fitted Value (Ŷᵢ) = Intercept + Slope × Xᵢ
Fitted values represent the points on the regression line that correspond to the given values of the independent variable(s). They serve as the model’s predictions for the dependent variable.
Goodness of Fit:
Goodness of fit refers to how well a statistical model fits the observed data. It is a measure of the degree to which the model explains the variation in the dependent variable. Several methods are used to assess the goodness of fit for a regression model:
- Residual Analysis: As mentioned earlier, analyzing the residuals is a fundamental method to assess the goodness of fit. If the residuals are randomly scattered around zero with no patterns or trends, it indicates a good fit. However, systematic patterns in the residuals suggest that the model may not be capturing all the relevant information.
- R-squared (Coefficient of Determination): R-squared is a statistical measure that quantifies the proportion of the variation in the dependent variable that is explained by the independent variable(s) in the model. It takes values between 0 and 1, where 1 indicates a perfect fit. A higher R-squared value indicates a better fit of the model to the data.
- Adjusted R-squared: Adjusted R-squared is a modified version of R-squared that penalizes the addition of unnecessary variables to the model. It is particularly useful when comparing models with different numbers of independent variables.
- F-test: The F-test is used to assess the overall significance of the regression model. It tests whether the coefficients of all the independent variables in the model are jointly significant. A significant F-test indicates a good fit of the model.
- Mean Square Error (MSE) or Root Mean Square Error (RMSE): MSE or RMSE measures the average squared difference between the observed values and the fitted values. Lower values of MSE or RMSE indicate a better fit of the model.
Probability distribution of residuals
In statistics, the probability distribution of residuals, also known as the residual distribution, refers to the distribution of the differences between the observed values of the dependent variable and the corresponding values predicted by a statistical model. Residuals are the vertical distances between the data points and the fitted regression line in the context of linear regression. The residual distribution provides valuable information about the quality of the model’s fit to the data and is essential for assessing the assumptions of the regression model.
Properties of the Residual Distribution:
- Mean of Residuals: The mean of the residuals is expected to be close to zero if the model is correctly specified and unbiased. A non-zero mean of the residuals may indicate a bias in the model.
- Variability (Standard Deviation) of Residuals: The standard deviation of the residuals, also known as the residual standard error, measures the spread or variability of the residuals around zero. A smaller residual standard error suggests a better fit of the model to the data.
- Distribution Shape: The shape of the residual distribution is crucial for assessing the model’s assumptions. In linear regression, if the residuals follow a normal distribution, it implies that the model’s assumptions of linearity, constant variance (homoscedasticity), and normally distributed errors are likely met. Deviations from normality may indicate potential problems with the model.
- Independence: The residuals should be independent of each other, meaning that the error terms for different observations should not be correlated. Autocorrelation in the residuals suggests that the model does not capture all the underlying patterns in the data.
Assessing the Residual Distribution:
To assess the residual distribution, various graphical and statistical methods can be used:
- Histogram or Density Plot: A histogram or density plot of the residuals can help visualize the shape and spread of the distribution. It can reveal skewness, kurtosis, or departures from normality.
- Normal Probability Plot (Q-Q Plot): A Q-Q plot compares the quantiles of the residuals to the quantiles of a theoretical normal distribution. If the residuals follow a normal distribution, the points on the Q-Q plot should roughly form a straight line.
- Residuals vs. Fitted Values Plot: This plot examines the relationship between the residuals and the fitted values. A random scatter of points around the zero line indicates a well-behaved residual distribution.
- Residuals vs. Independent Variable Plot: For multiple regression models, plotting residuals against each independent variable can help identify potential patterns or nonlinear relationships.
- Durbin-Watson Test: This test is used to detect autocorrelation in the residuals. A value close to 2 indicates no autocorrelation, while values significantly below 2 suggest positive autocorrelation, and values above 2 suggest negative autocorrelation.