Quantitative Materials: 7

Psychology 105

Richard Lowry
©1999-2002

Some Basic Statistical Concepts and Methods
for the Introductory Psychology Course
Part 7

The Measurement of Linear Correlation

   The primary measure of linear correlation is the Pearson product-moment correlation coefficient, symbolized by the lower-case Roman letter r, which ranges in value from r=+1.0 for a perfect positive correlation to r=—1.0 for a perfect negative correlation. The midpoint of its range, r=0.0, corresponds to a complete absence of correlation. Values falling between r=0.0 and r=+1.0 represent varying degrees of positive correlation, while those falling between r=0.0 and r=—1.0 represent varying degrees of negative correlation.

   A closely related companion measure of linear correlation is the coefficient of determination, symbolized as r², which is simply the square of the correlation coefficient. The coefficient of determination can have only positive values ranging from r²=+1.0 for a perfect correlation (positive or negative) down to r²=0.0 for a complete absence of correlation. The advantage of the correlation coefficient, r, is that it can have either a positive or a negative sign and thus provide an indication of the positive or negative direction of the correlation. The advantage of the coefficient of determination, r², is that it provides an equal interval and ratio scale measure of the strength of the correlation. In effect, the correlation coefficient, r, gives you the true direction of the correlation (+ or —) but only the square root of the strength of the correlation; while the coefficient of determination, r², gives you the true strength of the correlation but without an indication its direction. Both of them together give you the whole works.

   We will examine the details of calculation for these two measures in a moment, but first a bit more by way of introducing the general concepts. Figure 7.1 shows four specific examples of r and r², each produced by taking two very simple sets of X and Y values, namely

      X_i = {1, 2, 3, 4, 5, 6}  and  Y_i = {2, 4, 6, 8, 10, 12}

and pairing them up in one or another of four different ways. In Example I they are paired in such a way as to produce a perfect positive correlation, resulting in a correlation coefficient of r=+1.0 and a coefficient of determination of r²=1.0. In Example II the pairing produces a somewhat looser positive correlation that yields a correlation coefficient of r=+0.66 and a coefficient of determination of r²= 0.44. For purposes of interpretation, you can translate the coefficient of determination into terms of percentages (i.e., percentage=r²x100), which will then allow you to say such things as, for example, that the correlation in Example I (r²=1.0 ) is 100% as strong as it possibly could be, given these particular values of X_i and Y_i, whereas the one in Example II (r²=0.44 ) is only 44% as strong as it possibly could be. Alternatively, you could say that the looser positive correlation of Example II is only 44% as strong as the perfect one shown in Example I. The essential meaning of "strength of correlation" in this context is that such-and-such percentage of the variability of Y is associated with (tied to, linked with, coupled with) variability in X, and vice versa. Thus, for Example I, 100% of the variability in Y is coupled with variability in X; whereas, in Example II, only 44% of the variability in Y is linked with variability in X.

Figure 7.1. Four Different Pairings of the Same Values of X and Y

The correlations shown in Examples III and IV are obviously mirror images of the ones just described. For Example III the six values of X_i and Y_i are paired in such a way as to produce a perfect negative correlation, which yields a correlation coefficient of r=—1.0 and a coefficient of determination of r²=1.0. In Example IV the pairing produces a looser negative correlation, resulting in a correlation coefficient of r=—0.66 and a coefficient of determination of r²= 0.44. Here again you can say for Example III that 100% of the variability in Y is coupled with variability in X; whereas for Example IV only 44% of the variability in Y is linked with variability in X. You can also go further and say that the perfect positive and negative correlations in Examples I and III are of equal strength (for both, r²=1.0) but in opposite directions; and similarly, that the looser positive and negative correlations in Examples II and IV are of equal strength (for both, r²=0.44) but in opposite directions.

To illustrate the next point in closer detail, we will focus for a moment on the particular pairing of X_i and Y_i values that produced the positive correlation shown in Example II of Figure 7.1.

Pair	X_i	Y_i
a b c d e f	1 2 3 4 5 6	6 2 4 10 12 8

When you perform the computational procedures for linear correlation and regression, what you are essentially doing is defining the straight line that best fits the bivariate distribution of data points, as shown shown in the following version of the same graph. This line is spoken of as the regression line, or line of regression, and the criterion for "best fit" is that the sum of the squared vertical distances (the green lines ||||) between the data points and the regression line must be as small as possible.

As it happens, this line of best fit will in every instance pass through the point at which the mean of X and the mean of Y intersect on the graph. In the present example, the mean of X is 3.5 and the mean of Y is 7.0. Their point of intersection occurs at the convergence of the two dotted gray lines.

   The details of this line—in particular, where it begins on the Y axis and the rate at which it slants either upward or downward—will not be explicitly drawn out until we consider the regression side of correlation and regression. Nonetheless, they are implicitly present when you perform the computational procedures for the correlation side of the coin. As indicated above, the slant of the line upward or downward is what determines the sign of the correlation coefficient (r), positive or negative; and the degree to which the data points are lined up along the line, or scattered away from it, determines the strength of the correlation (r²).

   You have already encountered the general concept of variance for the case where you are describing the variation that exists among the variate instances of a single variable. The measurement of linear correlation requires an extension of this concept to the case where you are describing the co-variation that exists among the paired bivariate instances of two variables, X and Y, together. We have already touched upon the general concept. In positive correlation, high values of X tend to be associated with high values of Y, and low values of X tend to be associated with low values of Y. In negative correlation, it is the opposite: high values of X tend to be associated with low values of Y, and low values of X tend to be associated with high values of Y. In both cases, the phrase "tend to be associated" is another way of saying that the variability in X tends to be coupled with variability in Y, and vice versa—or, in brief, that X and Y tend to vary together. The raw measure of the tendency of two variables, X and Y, to co-vary is a quantity known as the covariance. As it happens, you will not need to be able to calculate the quantity of covariance in and of itself, because the destination we are aiming for, the calculation of r and r², can be reached by way of a simpler shortcut. However, you will need to have at least the general concept of it; so keep in mind as we proceed through the next few paragraphs that covariance is a measure of the degree to which two variables, X and Y, co-vary.

   In its underlying logic, the Pearson product-moment correlation coefficient comes down to a simple ratio between (i) the amount of covariation between X and Y that is actually observed, and (ii) the amount of covariation that would exist if X and Y had a perfect (100%) positive correlation. Thus


	r =	observed covariance maximum possible positive covariance

As it turns out, the quantity listed above as "maximum possible positive covariance" is precisely determined by the two separate variances of X and Y. This is for the simple reason that X and Y can co-vary, together, only in the degree that they vary separately. If either of the variables had zero variability (for example, if the values of X_i were all the same), then clearly they could not co-vary. Specifically, the maximum possible positive covariance that can exist between two variables is equal to the geometric mean of the two separate variances.

For any n numerical values, a, b, c, etc., the geometric mean is the n^th root of the product of those values. Thus, the geometric mean of a and b would be the square root of axb; the geometric mean of a, b and c would be the cube root of axbxc; and so on.

So the structure of the relationship now comes down to


	r =	observed covariance sqrt[(variance_X) x (variance_Y)]

Recall that "sqrt" means "the square root of."

Although in principle this relationship involves two variances and a covariance, in practice, through the magic of algebraic manipulation, it boils down to something that is computationally much simpler. In the following formulation you will immediately recognize the meaning of SS_X, which is the sum of squared deviates for X; by extension, you will also be able to recognize SS_Y, which is the sum of squared deviates for Y.

In order to get from the formula above to the one below, you will need to recall that the variance (s²) of a set of values is simply the average of the squared deviates: SS/N.

The third item, SC_XY, denotes a quantity that we will speak of as the sum of co-deviates; and as you can no doubt surmise from the name, it is something very closely akin to a sum of squared deviates. SS_X is the raw measure of the variability among the values of X_i; SS_Y is the raw measure of the variability among the values of Y_i; and SC_XY is the raw measure of the co-variability of X and Y together.


	r =	SC_XY sqrt[SS_X x SS_Y]

To understand this kinship, recall from Part 4 precisely what is meant by the term "deviate."

For any particular item in a set of measures of the variable X,
deviate_X=X_i — M_X

Similarly, for any particular item in a set of measures of the variable Y,
deviate_Y=Y_i — M_Y

As you have probably already guessed, a co-deviate pertaining to a particular pair of XY values involves the deviate_X of the X_i member of the pair and the deviate_Y of the Y_i member of the pair. The specific way in which these two are joined to form the co-deviate is

co-deviate_XY = (deviate_X) x (deviate_Y)

And finally, the analogy between a co-deviate and a squared deviate:

For a value of X_i, the squared deviate is
(deviate_X) x (deviate_X)

For a value of Y_i it is
(deviate_Y) x (deviate_Y)

And for a pair of X_i and Y_i values, the co-deviate is
(deviate_X) x (deviate_Y)

This should give you a sense of the underlying concepts. Just keep in mind, no matter what particular computational sequence you follow when you calculate the correlation coefficient, that what you are fundamentally calculating is the ratio


	r =	observed covariance maximum possible positive covariance

which, for computational purposes, comes down to


	r =	SC_XY sqrt[SS_X x SS_Y]

Now for the nuts-and-bolts of it. Here, once again, is the particular pairing of X_i and Y_i values that produced the positive correlation shown in Example II of Figure 7.1. But now we subject them to a bit of number-crunching, calculating the square of each value of X_i and Y_i, along with the cross-product of each X_iY_i pair. These are the items that will be required for the calculation of the three summary quantities in the above formula: SS_X, SS_Y, and SS_XY.

Pair	X_i	Y_i	X_i²	Y_i²	X_iY_i
a b c d e f	1 2 3 4 5 6	6 2 4 10 12 8	1 4 9 16 25 36	36 4 16 100 144 64	6 4 12 40 60 48
sums	21	42	91	364	170

SS_X : sum of squared deviates for X_i values_T
You saw in Part 4 that the sum of squared deviates for a set of X_i values can be calculated according to the computational formula_T

In the present example,
N=6 [because there are 6 values of X_i]

X_i² = 91

X_i = 21
(

X_i)² = (21)² = 441
Thus:
SS_X = 91—(441/6) = 17.5

SS_Y : sum of squared deviates for Y_i values_T
Similarly, the sum of squared deviates for a set of Y_i values can be calculated according to the formula_T

In the present example,_T
N = 6 [because there are 6 values of Y_i]

Y_i² = 364

Y_i = 42
(

Y_i)² = (42)² = 1764
Thus:
SS_Y = 364—(1764/6) = 70.0

SC_XY : sum of co-deviates for paired values of X_i and Y_i_T
A moment ago we observed that the sum of co-deviates for paired values of X_i and Y_i is analogous to the sum of squared deviates for either of those variables separately. You will probably be able to see that this analogy also extends to the computational formula for the sum of co-deviates:_T

Again, for the present example,_T
N = 6 [because there are 6 X_iY_i pairs]

X_i = 21

Y_i = 42
(

X_i)(

Y_i) = 21 x 42 = 882

(X_iY_i) = 170
Thus:
SC_XY = 170—(882/6) = 23.0

Once you have these preliminaries,_T
SS_X = 17.5, SS_Y = 70.0, and SC_XY = 23.0
you can then easily calculate the correlation coefficient as


r =	SC_XY sqrt[SS_X x SS_Y]

=	23.0 sqrt[17.5 x 70.0]	= +0.66

and the coefficient of determination as

r² = (+0.66)² = 0.44

To make sure you have a solid grasp of these matters, please take a moment to work your way through the details of Table 3.1, which will show the data and calculations for each of the examples of Figure 7.1. Recall that each example starts out with the same values of X_i and Y_i; they differ only with respect to how these values are paired up with one another.

Larger, more complex data sets will of course require something more laborious—but the general principles and specific computational procedures are precisely the same, either way. Consider, for example, the data set we referred to at the beginning of Part 6, pertaining to the correlation between the percentage of high school seniors taking the SAT versus average state score on the SAT.

Table 3.2 shows the details of calculation for this data set. As you will see, it requires quite a large number of separate operations, many of which result in multi-digit numerical values. There was a time in the not too distant past when students of statistics had to perform calculations of this sort armed with nothing but paper, pencil, and patience, and it was a very laborious enterprise indeed. But that was then, and this is now. With a fairly inexpensive pocket calculator and a little practice in using it to its full advantage, you can perform complex calculations of this sort with a speed that would have made an earlier generation of statistics students weep with envy. With a computer spreadsheet, and again a little practice, you can do it even faster. With pre-packaged computer software, such as the linear correlation page on the VassarStats website, you can do it with as little time and effort as it takes you to enter the paired values of X_i and Y_i.

At any rate, once you perform the operations required to arrive at the following sums (from Table 3.2), all the rest is simple and straightforward:

Sums of
X_i	Y_i	X_i²	Y_i²	X_iY_i
1,816	47,627	102,722	45,598,101	1,650,185

Given these results from the preliminary number-crunching, you could then (as shown in Table 3.2) easily calculate

    SS_X = 36,764.88

    SS_Y = 231,478.42

    SC_XY = —79,672.64

    r = —0.86

    r² = 0.74

Go to Part 8 [The Interpretation of Correlation]