Correlation Coefficient, Assumptions of Correlation Coefficient
The correlation coefficient is a statistical measure that calculates the strength of the relationship between the relative movements of the two variables. The range of values for the correlation coefficient bounded by 1.0 on an absolute value basis or between -1.0 to 1.0. If the correlation coefficient is greater than 1.0 or less than -1.0, the correlation measurement is incorrect. A correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a perfect positive correlation. A correlation of 0.0 shows zero or no relationship between the movements of the two variables.
While the correlation coefficient measures a degree of relation between two variables, it only measures the linear relationship between the variables. The correlation coefficient cannot capture nonlinear relationships between two variables.
A value of exactly 1.0 means there is a perfect positive relationship between the two variables. For a positive increase in one variable, there is also a positive increase in the second variable. A value of -1.0 means there is a perfect negative relationship between the two variables. This shows the variables move in opposite directions — for a positive increase in one variable, there is a decrease in the second variable. If the correlation is 0, there is no relationship between the two variables.
The strength of the relationship varies in degree based on the value of the correlation coefficient. For example, a value of 0.2 shows there is a positive relationship between the two variables, but it is weak and likely insignificant. Experts do not consider correlations significant until the value surpasses at least 0.8. However, a correlation coefficient with an absolute value of 0.9 or greater would represent a very strong relationship.
This statistic is useful in finance. For example, it can be helpful in determining how well a mutual fund performs relative to its benchmark index, or another fund or asset class. By adding a low or negatively correlated mutual fund to an existing portfolio, the investor gains diversification benefits.
Correlation Coefficient Formulas
One of the most commonly used formulas in stats is Pearson’s correlation coefficient formula. If you’re taking a basic stats class, this is the one you’ll probably use:
r = Pearson correlation coefficient
x = Values in first set of data
y = Values in second set of data
n = Total number of values.
The assumptions of Correlation Coefficient are-
- Normality means that the data sets to be correlated should approximate the normal distribution. In such normally distributed data, most data points tend to hover close to the mean.
- Homoscedascity comes from the Greek prefix hom, along with the Greek word skedastikos, which means ‘able to disperse’. Homoscedascity means ‘equal variances’. It means that the size of the error term is the same for all values of the independent variable. If the error term, or the variance, is smaller for a particular range of values of independent variable and larger for another range of values, then there is a violation of homoscedascity. It is quite easy to check for homoscedascity visually, by looking at a scatter plot. If the points lie equally on both sides of the line of best fit, then the data is homoscedastic.
- Linearity simply means that the data follows a linear relationship. Again, this can be examined by looking at a scatter plot. If the data points have a straight line (and not a curve) relationship, then the data satisfies the linearity assumption.
- Continuous variables are those that can take any value within an interval. Ratio variables are also continuous variables. To compute Karl Pearson’s Coefficient of Correlation, both data sets must contain continuous variables. If even one of the data sets is ordinal, then Spearman’s Coefficient of Rank Correlation would be a more appropriate measure.
- Paired observations mean that every data point must be in pairs. That is, for every observation of the independent variable, there must be a corresponding observation of the dependent variable. We cannot compute correlation coefficient if one data set has 12 observations and the other has 10 observations.
- No outliers must be present in the data. While statistically there’s no harm if the data contains outliers, they can significantly skew the correlation coefficient and make it inaccurate. When does a data point become an outlier? In general, a data point thats beyond +3.29 or -3.29 standard deviations away, it is considered to be an outlier. Outliers are easy to spot visually from the scatter plot.