What is a correlation matrix?

This post gives a brief overview / explanation of some best practices involved in establishing correlational matrices and tries to offer a simple way to understand the topic. We will also give an example of a correlation matrix  and discuss some applications of a correlation matrix. We also briefly touch upon negative correlation and coding of the variables. Correlation statistics is important since it can help show that certain results can be attained with statistical significance. There are numerous different statistical methods that can be used, and the one that you should use could depend on the number of independent variables that you are studying which in turn is related to how much data you’ve collected, if you’re using a large amount of data, the statistical method that you use could vary compared to if you’re not using a large amount of data in your study. 

The concept of correlation

In statistics, there is often a need to understand the relationship between two variables. This is where the concept of correlation comes in. Correlation can be defined as any statistical relationship between two random variables or bivariate data. For instance, a researcher may want to determine the relationship between intelligence quotient and academic performance. He or she can achieve this by using a correlation coefficient. 

There are several correlation coefficients used in correlation statistics but the most popular is Pearson’s correlation coefficient (also known as the corr function, developed by Karl Pearson) which measures the linear relationship between a couple of variables. The Pearson coefficient has a value that ranges from –1 to 0 to +1 (use decimal places).  The relationship is said to be positive if each increment in variable ‘A’ shows a corresponding increment in variable ‘B.’ It is negative if a decrease in variable A shows a corresponding decrease in variable B. A value of zero indicates that no linear correlation exists between both variables. The more distant the correlation coefficient is from 0, the stronger the relationship between the two studied variables. You can also have partial correlations. 

The correlation coefficient possesses an important property that distinguishes it from other types of covariances – the mathematical lower boundary of -1.0 and an upper bound of 1.0 [discussed above]. This property makes it possible for correlation coefficients to be compared, unlike ordinary covariances which usually cannot be compared.

For example, if the correlation between the pair of variables A and B is 0.91 and that between A and C is 0.29, then it can be concluded that A is more strongly related to B than to C. However, if variables A and B have a covariance of 106.7 while A and C have a covariance of 14.4, no conclusion can be reached about the magnitude of the relationship. This is because the magnitude of a covariance depends upon the measurement scale of the variables. If the measurement scale for the variable C is shown to have significantly lower variance than that for variable B, then the variable A could actually be more strongly related to C than to B. The correlation coefficient avoids this interpretive problem by keeping all variables on the same measurement scale – the Z score with a mean of 0 and a standard deviation of 1.0.

Sometimes, it may be necessary to scrutinize the relationship between more than two variables. This can be made possible through a correlation matrix.

Definition of a correlation matrix

A correlation matrix is a table of rows and columns that shows the extent of correlation between variables. All the numbers in the cells of a correlation matrix represent pairwise correlation coefficient values of the column and row variables. Every cell with the number 1 is part of the table’s diagonal. The diagonal implies that the correlation between a variable and itself is unity, i.e., perfect correlation [as depicted in the illustration below]. The correlation matrix is a symmetrical matrix and a special type of covariance matrix.

A correlation matrix

  High exam grades Studying for exams Attending lectures   participating in tutorials Purchasing textbooks
High exam grades 1. 00 0.70 0.50 0.80 0.50
Studying for exams 0.95   1.00   0.70 0.95   0.50
Attending lectures   0.50 0.89 1.00   0.71 0.40
Participating in tutorials   0.80 0.95 0.71 1.00 0.77
Purchasing textbooks   0.50 0.50 0.40 0.77 1.00

Properties of a correlation matrix

  • Correlation matrices are symmetrical
  • The diagonal elements are all denoted with the number 1 since the correlation between a variable and itself is 1
  • The numerical value of all the coefficients/elements must range from –1 to +1
  • All eigenvalues MUST be non-negative

The scope of a correlation matrix

 In terms of usage, a correlation matrix mainly serves three purposes:

  • To quantitatively summarize large data in order to identify correlation patterns.
  • To support other quantitative analyses, e.g., linear regression, exploratory factor analysis, confirmatory factor analysis, and structural equation models.
  • To help diagnose possible shortcomings in other analyses. In linear regression, for example, a high correlation is an indication of unreliable estimates.

Pseudo correlation matrices and positive definite matrices

In correlation matrices, it is required that all eigenvalues be non-negative. This conditionality is because of the relationship between eigenvalues and variances. A given eigenvalue divided by the sum of all eigenvalues yields the proportion of variance associated with the specific direction or dimension defined by the associated eigenvector. Thus, the presence of a negative eigenvalue implies a negative proportion of variance, in this case a statistical anomaly, both conceptually and mathematically.

However, it is not uncommon that a matrix may seem like a correlation matrix if its elements possess the first three mandatory properties of a correlation matrix [symmetry, perfect correlation between a variable and itself, and all the coefficients/elements ranging from –1 to +1] but does not possess the fourth property – non-negative eigenvalues. Where this is the case, such a matrix is referred to as a pseudo-correlation matrix. For instance, a matrix with the following eigenvalues: 1.19, 2.01, 0.62, and -0.78 is a pseudo correlation or indefinite matrix because of the presence of at least one negative [and one positive] eigenvalue.

Matrices that possess all the four properties of a correlation matrix are known as true correlation matrices. In the language of matrix algebra, they are also known as positive semidefinite (PSD) matrices. Also, while PSD matrices are made up of non-negative eigenvalues, including zero, another variant, known as positive definite [PD] matrices excludes zero values [unlike PSDs]

A situation where a correlation matrix has one or more values that are exactly zero implies that the eigenvalues correspond to the directions or dimensions [which relate to the corresponding eigenvectors] that explain zero proportion of the variance in the original variables. This situation may occur if a linear dependence exists among the variables in the correlation matrix as can be the case if a researcher unintentionally includes a variable that is a linear combination of one or more variables.

Though correlation matrices with one or more zero eigenvalues are true matrices, they are problematic for most statistical software applications and may yield error messages for researchers trying to analyze them. To solve this problem, the researcher should scrutinize his or her data to fish out and remove the variable[s] that brought about the linear dependence.

Advanced analyses 

Let us know if you want more info on this topic. In the future, we might cover parametric correlations, treatment of missing values, and computing correlation by using statistical software or simpler tools like Microsoft Excell.